arxiv: v2 [cs.cv] 15 Apr 2015
|
|
|
- Oscar Gilmore
- 9 years ago
- Views:
Transcription
1 Published as a conference paper at ICLR 215 OBJECT DETECTORS EMERGE IN DEEP SCENE CNNS Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba Computer Science and Artificial Intelligence Laboratory, MIT {bolei,khosla,agata,oliva,torralba}@mit.edu ABSTRACT arxiv: v2 [cs.cv] 15 Apr 215 With the success of new computational architectures for visual processing, such as convolutional neural networks (CNN) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. One important factor for continued progress is to understand the representations that are learned by the inner layers of these deep architectures. Here we show that object detectors emerge from training CNNs to perform scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful objects detectors, representative of the learned scene categories. With object detectors emerging as a result of learning to recognize scenes, our work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects. 1 INTRODUCTION Current deep neural networks achieve remarkable performance at a number of vision tasks surpassing techniques based on hand-crafted features. However, while the structure of the representation in hand-crafted features is often clear and interpretable, in the case of deep networks it remains unclear what the nature of the learned representation is and why it works so well. A convolutional neural network (CNN) trained on ImageNet (Deng et al., 29) significantly outperforms the best hand crafted features on the ImageNet challenge (Russakovsky et al., 214). But more surprisingly, the same network, when used as a generic feature extractor, is also very successful at other tasks like object detection on the PASCAL VOC dataset (Everingham et al., 21). A number of works have focused on understanding the representation learned by CNNs. The work by Zeiler & Fergus (214) introduces a procedure to visualize what activates each unit. Recently Yosinski et al. (214) use transfer learning to measure how generic/specific the learned features are. In Agrawal et al. (214) and Szegedy et al. (213), they suggest that the CNN for ImageNet learns a distributed code for objects. They all use ImageNet, an object-centric dataset, as a training set. When training a CNN to distinguish different object classes, it is unclear what the underlying representation should be. Objects have often been described using part-based representations where parts can be shared across objects, forming a distributed code. However, what those parts should be is unclear. For instance, one would think that the meaningful parts of a face are the mouth, the two eyes, and the nose. However, those are simply functional parts, with words associated with them; the object parts that are important for visual recognition might be different from these semantic parts, making it difficult to evaluate how efficient a representation is. In fact, the strong internal configuration of objects makes the definition of what is a useful part poorly constrained: an algorithm can find different and arbitrary part configurations, all giving similar recognition performance. Learning to classify scenes (i.e., classifying an image as being an office, a restaurant, a street, etc) using the Places dataset (Zhou et al., 214) gives the opportunity to study the internal representation learned by a CNN on a task other than object recognition. In the case of scenes, the representation is clearer. Scene categories are defined by the objects they contain and, to some extent, by the spatial configuration of those objects. For instance, the important parts of a bedroom are the bed, a side table, a lamp, a cabinet, as well as the walls, floor and ceiling. Objects represent therefore a distributed code for scenes (i.e., object classes are shared across different scene categories). Importantly, in scenes, the spatial configuration of objects, 1
2 Published as a conference paper at ICLR 215 Table 1: The parameters of the network architecture used for ImageNet-CNN and Places-CNN. Layer conv1 conv2 conv5 fc6 fc7 Units Feature although compact, has a much larger degree of freedom. It is this loose spatial dependency that, we believe, makes scene representation different from most object classes (most object classes do not have a loose interaction between parts). In addition to objects, other feature regularities of scene categories allow for other representations to emerge, such as textures (Renninger & Malik, 24), GIST (Oliva & Torralba, 26), bag-of-words (Lazebnik et al., 26), part-based models (Pandey & Lazebnik, 211), and ObjectBank (Li et al., 21). While a CNN has enough flexibility to learn any of those representations, if meaningful objects emerge without supervision inside the inner layers of the CNN, there will be little ambiguity as to which type of representation these networks are learning. The main contribution of this paper is to show that object detection emerges inside a CNN trained to recognize scenes, even more than when trained with ImageNet. This is surprising because our results demonstrate that reliable object detectors are found even though, unlike ImageNet, no supervision is provided for objects. Although object discovery with deep neural networks has been shown before in an unsupervised setting (Le, 213), here we find that many more objects can be naturally discovered, in a supervised setting tuned to scene classification rather than object classification. Importantly, the emergence of object detectors inside the CNN suggests that a single network can support recognition at several levels of abstraction (e.g., edges, texture, objects, and scenes) without needing multiple outputs or a collection of networks. Whereas other works have shown that one can detect objects by applying the network multiple times in different locations (Girshick et al., 214), or focusing attention (Tang et al., 214), or by doing segmentation (Grangier et al., 29; Farabet et al., 213), here we show that the same network can do both object localization and scene recognition in a single forward-pass. Another set of recent works (Oquab et al., 214; Bergamo et al., 214) demonstrate the ability of deep networks trained on object classification to do localization without bounding box supervision. However, unlike our work, these require object-level supervision while we only use scenes. 2 IMAGENET-CNN AND PLACES-CNN Convolutional neural networks have recently obtained astonishing performance on object classification (Krizhevsky et al., 212) and scene classification (Zhou et al., 214). The ImageNet-CNN from Jia (213) is trained on 1.3 million images from 1 object categories of ImageNet (ILSVRC 212) and achieves a top-1 accuracy of 57.4%. With the same network architecture, Places-CNN is trained on 2.4 million images from 25 scene categories of Places Database (Zhou et al., 214), and achieves a top-1 accuracy of 5.%. The network architecture used for both CNNs, as proposed in (Krizhevsky et al., 212), is summarized in Table 1 1. Both networks are trained from scratch using only the specified dataset. The deep features from Places-CNN tend to perform better on scene-related recognition tasks compared to the features from ImageNet-CNN. For example, as compared to the Places-CNN that achieves 5.% on scene classification, the ImageNet-CNN combined with a linear SVM only achieves 4.8% on the same test set 2 illustrating the importance of having scene-centric data. To further highlight the difference in representations, we conduct a simple experiment to identify the differences in the type of images preferred at the different layers of each network: we create a set of 2k images with an approximately equal distribution of scene-centric and object-centric images 3, and run them through both networks, recording the activations at each layer. For each layer, we obtain the top 1 images that have the largest average activation (sum over all spatial locations for 1 We use unit to refer to neurons in the various layers and features to refer to their activations. 2 Scene recognition demo of Places-CNN is available at html. The demo has 77.3% top-5 recognition rate in the wild estimated from 968 anonymous user responses. 3 1k object-centric images from the test set of ImageNet LSVRC212 and 18k scene-centric images from the SUN dataset (Xiao et al., 214). 2
3 Published as a conference paper at ICLR 215 fc7 ImageNet-CNN Places-CNN Figure 1: Top 3 images producing the largest activation of units in each layer of ImageNet-CNN (top) and Places-CNN (bottom). a given layer). Fig. 1 shows the top 3 images for each layer. We observe that the earlier layers such as and prefer similar images for both networks while the later layers tend to be more specialized to the specific task of scene or object categorization. For layer, 55% and 47% of the top-1 images belong to the ImageNet dataset for ImageNet-CNN and Places-CNN. Starting from layer, we observe a significant difference in the number of top-1 belonging to each dataset corresponding to each network. For fc7, we observe that 78% and 24% of the top-1 images belong to the ImageNet dataset for the ImageNet-CNN and Places-CNN respectively, illustrating a clear bias in each network. In the following sections, we further investigate the differences between these networks, and focus on better understanding the nature of the representation learned by Places-CNN when doing scene classification in order to clarify some part of the secret to their great performance. 3 U NCOVERING THE CNN REPRESENTATION The performance of scene recognition using Places-CNN is quite impressive given the difficulty of the task. In this section, our goal is to understand the nature of the representation that the network is learning. 3.1 S IMPLIFYING THE INPUT IMAGES Simplifying images is a well known strategy to test human recognition. For example, one can remove information from the image to test if it is diagnostic or not of a particular object or scene (for a review see Biederman (1995)). A similar procedure was also used by Tanaka (1993) to understand the receptive fields of complex cells in the inferior temporal cortex (IT). Inspired by these approaches, our idea is the following: given an image that is correctly classified by the network, we want to simplify this image such that it keeps as little visual information as possible while still having a high classification score for the same category. This simplified image (named minimal image representation) will allow us to highlight the elements that lead to the high classification score. In order to do this, we manipulate images in the gradient space as typically done in computer graphics (Pe rez et al., 23). We investigate two different approaches described below. In the first approach, given an image, we create a segmentation of edges and regions and remove segments from the image iteratively. At each iteration we remove the segment that produces the smallest decrease of the correct classification score and we do this until the image is incorrectly classified. At the end, we get a representation of the original image that contains, approximately, the minimal amount of information needed by the network to correctly recognize the scene category. In Fig. 2 we show some examples of these minimal image representations. Notice that objects seem to contribute important information for the network to recognize the scene. For instance, in the case of bedrooms these minimal image representations usually contain the region of the bed, or in the art gallery category, the regions of the paintings on the walls. Based on the previous results, we hypothesized that for the Places-CNN, some objects were crucial for recognizing scenes. This inspired our second approach: we generate the minimal image representations using the fully annotated image set of SUN Database (Xiao et al., 214) (see section 4.1 for details on this dataset) instead of performing automatic segmentation. We follow the same procedure as the first approach using the ground-truth object segments provided in the database. This led to some interesting observations: for bedrooms, the minimal representations retained the bed in 87% of the cases. Other objects kept in bedrooms were wall (28%) and window (21%). 3
4 Published as a conference paper at ICLR 215 Figure 2: Each pair of images shows the original image (left) and a simplified image (right) that gets classified by the Places-CNN as the same scene category as the original image. From top to bottom, the four rows show different scene categories: bedroom, auditorium, art gallery, and dining room. discrepancy maps for top 1 images sliding-window stimuli calibrated discrepancy maps receptive field Figure 3: The pipeline for estimating the RF of each unit. Each sliding-window stimuli contains a small randomized patch (example indicated by red arrow) at different spatial locations. By comparing the activation response of the sliding-window stimuli with the activation response of the original image, we obtain a discrepancy map for each image (middle top). By summing up the calibrated discrepancy maps (middle bottom) for the top ranked images, we obtain the actual RF of that unit (right). For art gallery the minimal image representations contained paintings (81%) and pictures (58%); in amusement parks, carousel (75%), ride (64%), and roller coaster (5%); in bookstore, bookcase (96%), books (68%), and shelves (67%). These results suggest that object detection is an important part of the representation built by the network to obtain discriminative information for scene classification. 3.2 VISUALIZING THE RECEPTIVE FIELDS OF UNITS AND THEIR ACTIVATION PATTERNS In this section, we investigate the shape and size of the receptive fields (RFs) of the various units in the CNNs. While theoretical RF sizes can be computed given the network architecture (Long et al., 214), we are interested in the actual, or empirical size of the RFs. We expect the empirical RFs to be better localized and more representative of the information they capture than the theoretical ones, allowing us to better understand what is learned by each unit of the CNN. Thus, we propose a data-driven approach to estimate the learned RF of each unit in each layer. It is simpler than the deconvolutional network visualization method (Zeiler & Fergus, 214) and can be easily extended to visualize any learned CNNs 4. The procedure for estimating a given unit s RF, as illustrated in Fig. 3, is as follows. As input, we use an image set of 2k images with a roughly equal distribution of scenes and objects (similar to Sec. 2). Then, we select the top K images with the highest activations for the given unit. 4 More visualizations are available at 4
5 Published as a conference paper at ICLR 215 Figure 4: The RFs of 3 units of,,, and layers respectively for ImageNetand Places-CNNs, along with the image patches corresponding to the top activation regions inside the RFs. Table 2: Comparison of the theoretical and empirical sizes of the RFs for Places-CNN and ImageNet-CNN at different layers. Note that the RFs are assumed to be square shaped, and the sizes reported below are the length of each side of this square, in pixels. Theoretic size Places-CNN actual size 17.8± ± ±1.6 6.± ± 2. ImageNet-CNN actual size 17.9± ± ± ± ± 21.6 For each of the K images, we now want to identify exactly which regions of the image lead to the high unit activations. To do this, we replicate each image many times with small random occluders (image patches of size 11 11) at different locations in the image. Specifically, we generate occluders in a dense grid with a stride of 3. This results in about 5 occluded images per original image. We now feed all the occluded images into the same network and record the change in activation as compared to using the original image. If there is a large discrepancy, we know that the given patch is important and vice versa. This allows us to build a discrepancy map for each image. Finally, to consolidate the information from the K images, we center the discrepancy map around the spatial location of the unit that caused the maximum activation for the given image. Then we average the re-centered discrepancy maps to generate the final RF. In Fig. 4 we visualize the RFs for units from 4 different layers of the Places-CNN and ImageNet- CNN, along with their highest scoring activation regions inside the RF. We observe that, as the layers go deeper, the RF size gradually increases and the activation regions become more semantically meaningful. Further, as shown in Fig. 5, we use the RFs to segment images using the feature maps of different units. Lastly, in Table 2, we compare the theoretical and empirical size of the RFs at different layers. As expected, the actual size of the RF is much smaller than the theoretical size, especially in the later layers. Overall, this analysis allows us to better understand each unit by focusing precisely on the important regions of each image. Places-CNN ImageNet-CNN Figure 5: Segmentation based on RFs. Each row shows the 4 most confident images for some unit. 5
6 Published as a conference paper at ICLR 215 Figure 6: AMT interface for unit concept annotation. There are three tasks in each annotation. 3.3 IDENTIFYING THE SEMANTICS OF INTERNAL UNITS In Section 3.2, we found the exact RFs of units and observed that activation regions tended to become more semantically meaningful with increasing depth of layers. In this section, our goal is to understand and quantify the precise semantics learned by each unit. In order to do this, we ask workers on Amazon Mechanical Turk (AMT) to identify the common theme or concept that exists between the top scoring segmentations for each unit. We expect the tags provided by naive annotators to reduce biases. Workers provide tags without being constrained to a dictionary of terms that could bias or limit the identification of interesting properties. Specifically, we divide the task into three main steps as shown in Fig. 6. We show workers the top 6 segmented images that most strongly activate one unit and we ask them to (1) identify the concept, or semantic theme given by the set of 6 images e.g., car, blue, vertical lines, etc, (2) mark the set of images that do not fall into this theme, and (3) categorize the concept provided in (1) to one of 6 semantic groups ranging from low-level to high-level: simple elements and colors (e.g., horizontal lines, blue), materials and textures (e.g., wood, square grid), regions ans surfaces (e.g., road, grass), object parts (e.g., head, leg), objects (e.g., car, person), and scenes (e.g., kitchen, corridor). This allows us to obtain both the semantic information for each unit, as well as the level of abstraction provided by the labeled concept. To ensure high quality of annotation, we included 3 images with high negative scores that the workers were required to identify as negatives in order to submit the task. Fig. 7 shows some example annotations by workers. For each unit, we measure its precision as the percentage of images that were selected as fitting the labeled concept. In Fig. 8.(a) we plot the average precision for ImageNet- CNN and Places-CNN for each layer. In Fig. 8.(b-c) we plot the distribution of concept categories for ImageNet-CNN and Places-CNN at each layer. For this plot we consider only units that had a precision above 75% as provided by the AMT workers. Around 6% of the units on each layer where above that threshold. For both networks, units at the early layers (, ) have more units responsive to simple elements and colors, while those at later layers (, ) have more high-level semantics (responsive more to objects and scenes). Furthermore, we observe that and units in Places-CNN have higher ratios of high-level semantics as compared to the units in ImageNet-CNN. Fig. 9 provides a different visualization of the same data as in Fig. 8.(b-c). This plot better reveals how different levels of abstraction emerge in different layers of both networks. The vertical axis indicates the percentage of units in each layer assigned to each concept category. ImageNet-CNN has more units tuned to simple elements and colors than Places-CNN while Places-CNN has more objects and scenes. ImageNet-CNN has more units tuned to object parts (with the maximum around ). It is interesting to note that Places-CNN discovers more objects than ImageNet-CNN despite having no object-level supervision. 6
7 Published as a conference paper at ICLR 215 Pool5, unit 76; Label: ocean; Type: scene; Precision: 93% Pool5, unit 13; Label: Lamps; Type: object; Precision: 84% Pool5, unit 77; Label:legs; Type: object part; Precision: 96% Pool5, unit 22; Label: dinner table; Type: scene; Precision: 6% Pool5, unit 112; Label: pool table; Type: object; Precision: 7% Pool5, unit 168; Label: shrubs; Type: object; Precision: 54% Figure 7: Examples of unit annotations provided by AMT workers for 6 units from in PlacesCNN. For each unit the figure shows the label provided by the worker, the type of label, the images selected as corresponding to the concept (green box) and the images marked as incorrect (red box). The precision is the percentage of correct images. The top three units have high performance while the bottom three have low performance (< 75%). 7
8 Average precision per layer Number of units (p>75%) Number of units (p>75%) Published as a conference paper at ICLR Places CNN ImageNet CNN 1 8 ImageNet CNN 1 8 Places CNN simple elements & color texture materials region surface object part object scene a) b) c) Figure 8: (a) Average precision of all the units in each layer for both networks as reported by AMT workers. (b) and (c) show the number of units providing different levels of semantics for ImageNet- CNN and Places-CNN respectively. Simple elements & colors Texture materials Region or surface Object part Object Scene percent units (perf>75%) places-cnn imagenet-cnn Figure 9: Distribution of semantic types found for all the units in both networks. From left to right, each plot corresponds to the distribution of units in each layer assigned to simple elements or colors, textures or materials, regions or surfaces, object parts, objects, and scenes. The vertical axis is the percentage of units with each layer assigned to each type of concept. 4 EMERGENCE OF OBJECTS AS THE INTERNAL REPRESENTATION As shown before, a large number of units in are devoted to detecting objects and sceneregions (Fig. 9). But what categories are found? Is each category mapped to a single unit or are there multiple units for each object class? Can we actually use this information to segment a scene? 4.1 WHAT OBJECT CLASSES EMERGE? To answer the question of why certain objects emerge from, we tested ImageNet-CNN and Places-CNN on fully annotated images from the SUN database (Xiao et al., 214). The SUN database contains 822 fully annotated images from the same 25 place categories used to train Places-CNN. There are no duplicate images between SUN and Places. We use SUN instead of COCO (Lin et al., 214) as we need dense object annotations to study what the most informative object classes for scene categorization are, and what the natural object frequencies in scene images are. For this study, we manually mapped the tags given by AMT workers to the SUN categories. Fig. 1(a) shows the distribution of objects found in of Places-CNN. Some objects are detected by several units. For instance, there are 15 units that detect buildings. Fig. 11 shows some units from the Places-CNN grouped by the type of object class they seem to be detecting. Each row shows the top five images for a particular unit that produce the strongest activations. The segmentation shows the regions of the image for which the unit is above a certain threshold. Each unit seems to be selective to a particular appearance of the object. For instance, there are 6 units that detect lamps, each unit detecting a particular type of lamp providing finer-grained discrimination; there are 9 units selective to people, each one tuned to different scales or people doing different tasks. 8
9 building tree grass floor mountain person plant water window ceiling lamp pitch road arcade bridge cabinet chair food lighthouse path sky tower wall water tower bed bookcase car ceiling cementery column desk desk lamp field grandstand ground iceberg phone booth railing river rocks sand screen sea seats showcase snowy ground street swimming pool tent text wardrobe waterfall windmill dog bird person wheel animal body flower ground head legs animal face animal head building car cat ceiling face human face leg monkey plant plants pot road sea tower tree water window Counts Counts Published as a conference paper at ICLR a) b) Figure 1: Object counts of CNN units discovering each object class for (a) Places-CNN and (b) ImageNet-CNN. Fig. 1(b) shows the distribition of objects found in of ImageNet-CNN. ImageNet has an abundance of animals among the categories present: in the ImageNet-CNN, out of the 256 units in, there are 15 units devoted to detecting dogs and several more detecting parts of dogs (body, legs,...). The categories found in tend to follow the target categories in ImageNet. Why do those objects emerge? One possibility is that the objects that emerge in correspond to the most frequent ones in the database. Fig. 12(a) shows the sorted distribution of object counts in the SUN database which follows Zipf s law. Fig. 12(b) shows the counts of units found in for each object class (same sorting as in Fig. 12(a)). The correlation between object frequency in the database and object frequency discovered by the units in is.54. Another possibility is that the objects that emerge are the objects that allow discriminating among scene categories. To measure the set of discriminant objects we used the ground truth in the SUN database to measure the classification performance achieved by each object class for scene classification. Then we count how many times each object class appears as the most informative one. This measures the number of scene categories a particular object class is the most useful for. The counts are shown in Fig. 12(c). Note the similarity between Fig. 12(b) and Fig. 12(c). The correlation is.84 indicating that the network is automatically identifying the most discriminative object categories to a large extent. Note that there are 115 units in of Places-CNN not detecting objects. This could be due to incomplete learning or a complementary texture-based or part-based representation of the scenes. Therefore, although objects seem to be a key part of the representation learned by the network, we cannot rule out other representations being used in combination with objects. 4.2 OBJECT LOCALIZATION WITHIN THE INNER LAYERS Places-CNN is trained to do scene classification using the output of the final layer of logistic regression and achieves state-of-the-art performance. From our analysis above, many of the units in the inner layers could perform interpretable object localization. Thus we could use this single Places- CNN with the annotation of units to do both scene recognition and object localization in a single forward-pass. Fig. 13 shows an example of the output of different layers of the Places-CNN using the tags provided by AMT workers. Bounding boxes are shown around the areas where each unit is activated within its RF above a certain threshold. In Fig. 14 we provide the segmentation performance of the objects discovered in using the SUN database. The performance of many units is very high which provides strong evidence that they are indeed detecting those object classes despite being trained for scene classification. 5 CONCLUSION We find that object detectors emerge as a result of learning to classify scene categories, showing that a single network can support recognition at several levels of abstraction (e.g., edges, textures, objects, and scenes) without needing multiple outputs or networks. While it is common to train a network to do several tasks and to use the final layer as the output, here we show that reliable outputs can be extracted at each layer. As objects are the parts that compose a scene, detectors tuned to the objects that are discriminant between scenes are learned in the inner layers of the network. Note 9
10 Published as a conference paper at ICLR 215 Buildings 56) building Indoor objects 182) food Furniture 18) billard table Outdoor objects 87) car 12) arcade 46) painting 155) bookcase 61) road 8) bridge 16) screen 116) bed 96) swimming pool 123) building 53) staircase 38) cabinet 28) water tower 119) building 17) wardrobe 85) chair 6) windmill 9) lighthouse People 3) person Lighting 55) ceiling lamp Nature 195) grass Scenes 145) cementery 49) person 174) ceiling lamp 89) iceberg 127) street 138) person 223) ceiling lamp 14) mountain 218) pitch 1) person 13) desk lamp 159) sand Figure 11: Segmentations using units from Places-CNN. Many classes are encoded by several units covering different object appearances. Each row shows the 5 most confident images for each unit. The number represents the unit number in. that only informative objects for specific scene recognition tasks will emerge. Future work should explore which other tasks would allow for other object classes to be learned without the explicit supervision of object labels. ACKNOWLEDGMENTS This work is supported by the National Science Foundation under Grant No to A.O, ONR MURI N to A.T, as well as MIT Big Data Initiative at CSAIL, Google and Xerox Awards, a hardware donation from NVIDIA Corporation, to A.O and A.T. REFERENCES Agrawal, Pulkit, Girshick, Ross, and Malik, Jitendra. neural networks for object recognition. ECCV, 214. Analyzing the performance of multilayer Bergamo, Alessandro, Bazzani, Loris, Anguelov, Dragomir, and Torresani, Lorenzo. Self-taught object localization with deep networks. arxiv preprint arxiv: , 214. Biederman, Irving. Visual object recognition, volume 2. MIT press Cambridge, Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scale hierarchical image database. In CVPR, 29. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. The pascal visual object classes challenge. IJCV, 21. Farabet, Clement, Couprie, Camille, Najman, Laurent, and LeCun, Yann. Learning hierarchical features for scene labeling. TPAMI,
11 Published as a conference paper at ICLR Object counts in SUN 1 5 a) 2 wall window chair building floor tree ceiling lamp cabinet ceiling person plant cushion sky picture curtain painting door desk lamp side table table bed books pillow mountain car pot armchair box vase flowers road grass bottle shoes sofa outlet worktop sign book sconce plate mirror column rug basket ground desk coffee table clock shelves Counts of CNN units discovering each object class b) 5 Object counts of most informative objects for scene recognition 3 wall window chair building floor tree ceiling lamp cabinet ceiling person plant cushion sky picture curtain painting door desk lamp side table table bed books pillow mountain car pot armchair box vase flowers road grass bottle shoes sofa outlet worktop sign book sconce plate mirror column rug basket ground desk coffee table clock shelves 2 1 c) wall window chair building floor tree ceiling lamp cabinet ceiling person plant cushion sky picture curtain painting door desk lamp side table table bed books pillow mountain car pot armchair box vase flowers road grass bottle shoes sofa outlet worktop sign book sconce plate mirror column rug basket ground desk coffee table clock shelves Figure 12: (a) Object frequency in SUN (only top 5 objects shown), (b) Counts of objects discovered by in Places-CNN. (c) Frequency of most informative objects for scene classification. Figure 13: Interpretation of a picture by different layers of the Places-CNN using the tags provided by AMT workers. The first shows the final layer output of Places-CNN. The other three show detection results along with the confidence based on the units activation and the semantic tags. Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation Grangier, D., Bottou, L., and Collobert, R. Deep convolutional networks for scene parsing. TPAMI, 29. Jia, Yangqing. Caffe: An open source convolutional architecture for fast feature embedding, 213. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, 212. Lazebnik, Svetlana, Schmid, Cordelia, and Ponce, Jean. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 26. Le, Quoc V. Building high-level features using large scale unsupervised learning. In ICASSP,
12 Counts Precision Published as a conference paper at ICLR 215 Fireplace (J=5.3%, AP=22.9%) Wardrobe (J=4.2%, AP=12.7%) Billiard table (J=3.2%, AP=42.6%) Building (J=14.6%, AP=47.2%) Bed (J=24.6%, AP=81.1%) Mountain (J=11.3%, AP=47.6%) Sofa (J=1.8%, AP=36.2%) Washing machine (J=3.2%, AP=34.4%) b) Sofa Car Bed Desk lamp Swimming pool Recall 4 2 a) c) Average precision (AP) Figure 14: (a) Segmentation of images from the SUN database using of Places-CNN (J = Jaccard segmentation index, AP = average precision-recall.) (b) Precision-recall curves for some discovered objects. (c) Histogram of AP for all discovered object classes. Li, Li-Jia, Su, Hao, Fei-Fei, Li, and Xing, Eric P. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In NIPS, pp , 21. Lin, Tg-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollr, Piotr, and Zitnick, C. Lawrence. Microsoft COCO: Common objects in context. In ECCV, 214. Long, Jonathan, Zhang, Ning, and Darrell, Trevor. Do convnets learn correspondence? In NIPS, 214. Oliva, A. and Torralba, A. Building the gist of a scene: The role of global image features in recognition. Progress in Brain Research, 26. Oquab, Maxime, Bottou, Léon, Laptev, Ivan, Sivic, Josef, et al. Weakly supervised object recognition with convolutional neural networks. In NIPS Pandey, M. and Lazebnik, S. Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV, 211. Pérez, Patrick, Gangnet, Michel, and Blake, Andrew. Poisson image editing. ACM Trans. Graph., 23. Renninger, Laura Walker and Malik, Jitendra. When is scene identification just texture recognition? Vision research, 44(19): , 24. Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 214. Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian, and Fergus, Rob. Intriguing properties of neural networks. arxiv preprint arxiv: , 213. Tanaka, Keiji. Neuronal mechanisms of object recognition. Science, 262(5134): , Tang, Yichuan, Srivastava, Nitish, and Salakhutdinov, Ruslan R. Learning generative models with visual attention. In NIPS Xiao, J, Ehinger, K A., Hays, J, Torralba, A, and Oliva, A. collection of scene categories. IJCV, 214. SUN database: Exploring a large Yosinski, Jason, Clune, Jeff, Bengio, Yoshua, and Lipson, Hod. How transferable are features in deep neural networks? In NIPS, 214. Zeiler, M. and Fergus, R. Visualizing and understanding convolutional networks. In ECCV, 214. Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, and Oliva, Aude. Learning deep features for scene recognition using places database. In NIPS,
CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015
CS 1699: Intro to Computer Vision Deep Learning Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 Today: Deep neural networks Background Architectures and basic operations Applications Visualizing
Convolutional Feature Maps
Convolutional Feature Maps Elements of efficient (and accurate) CNN-based object detection Kaiming He Microsoft Research Asia (MSRA) ICCV 2015 Tutorial on Tools for Efficient Object Detection Overview
Compacting ConvNets for end to end Learning
Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli. Success of CNN Image Classification Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,
Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016
Module 5 Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016 Previously, end-to-end.. Dog Slide credit: Jose M 2 Previously, end-to-end.. Dog Learned Representation Slide credit: Jose
arxiv:1506.03365v2 [cs.cv] 19 Jun 2015
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop Fisher Yu Yinda Zhang Shuran Song Ari Seff Jianxiong Xiao arxiv:1506.03365v2 [cs.cv] 19 Jun 2015 Princeton
Pedestrian Detection with RCNN
Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University [email protected] Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional
Image Classification for Dogs and Cats
Image Classification for Dogs and Cats Bang Liu, Yan Liu Department of Electrical and Computer Engineering {bang3,yan10}@ualberta.ca Kai Zhou Department of Computing Science [email protected] Abstract
Bert Huang Department of Computer Science Virginia Tech
This paper was submitted as a final project report for CS6424/ECE6424 Probabilistic Graphical Models and Structured Prediction in the spring semester of 2016. The work presented here is done by students
CNN Based Object Detection in Large Video Images. WangTao, [email protected] IQIYI ltd. 2016.4
CNN Based Object Detection in Large Video Images WangTao, [email protected] IQIYI ltd. 2016.4 Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body
Deformable Part Models with CNN Features
Deformable Part Models with CNN Features Pierre-André Savalle 1, Stavros Tsogkas 1,2, George Papandreou 3, Iasonas Kokkinos 1,2 1 Ecole Centrale Paris, 2 INRIA, 3 TTI-Chicago Abstract. In this work we
Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection
CSED703R: Deep Learning for Visual Recognition (206S) Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection Bohyung Han Computer Vision Lab. [email protected] 2 3 Object detection
Learning and transferring mid-level image representions using convolutional neural networks
Willow project-team Learning and transferring mid-level image representions using convolutional neural networks Maxime Oquab, Léon Bottou, Ivan Laptev, Josef Sivic 1 Image classification (easy) Is there
CAP 6412 Advanced Computer Vision
CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Jan 26, 2016 Today Administrivia A bigger picture and some common questions Object detection proposals, by Samer
Object Recognition. Selim Aksoy. Bilkent University [email protected]
Image Classification and Object Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] Image classification Image (scene) classification is a fundamental
Object Detection in Video using Faster R-CNN
Object Detection in Video using Faster R-CNN Prajit Ramachandran University of Illinois at Urbana-Champaign [email protected] Abstract Convolutional neural networks (CNN) currently dominate the computer
Transfer Learning by Borrowing Examples for Multiclass Object Detection
Transfer Learning by Borrowing Examples for Multiclass Object Detection Joseph J. Lim CSAIL, MIT [email protected] Ruslan Salakhutdinov Department of Statistics, University of Toronto [email protected]
Image and Video Understanding
Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material
Transform-based Domain Adaptation for Big Data
Transform-based Domain Adaptation for Big Data Erik Rodner University of Jena Judy Hoffman Jeff Donahue Trevor Darrell Kate Saenko UMass Lowell Abstract Images seen during test time are often not from
Lecture 6: Classification & Localization. boris. [email protected]
Lecture 6: Classification & Localization boris. [email protected] 1 Agenda ILSVRC 2014 Overfeat: integrated classification, localization, and detection Classification with Localization Detection. 2 ILSVRC-2014
Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning
: Intro. to Computer Vision Deep learning University of Massachusetts, Amherst April 19/21, 2016 Instructor: Subhransu Maji Finals (everyone) Thursday, May 5, 1-3pm, Hasbrouck 113 Final exam Tuesday, May
Do Convnets Learn Correspondence?
Do Convnets Learn Correspondence? Jonathan Long Ning Zhang Trevor Darrell University of California Berkeley {jonlong, nzhang, trevor}@cs.berkeley.edu Abstract Convolutional neural nets (convnets) trained
Scalable Object Detection by Filter Compression with Regularized Sparse Coding
Scalable Object Detection by Filter Compression with Regularized Sparse Coding Ting-Hsuan Chao, Yen-Liang Lin, Yin-Hsi Kuo, and Winston H Hsu National Taiwan University, Taipei, Taiwan Abstract For practical
Learning to Process Natural Language in Big Data Environment
CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview
Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report
Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 69 Class Project Report Junhua Mao and Lunbo Xu University of California, Los Angeles [email protected] and lunbo
3D Model based Object Class Detection in An Arbitrary View
3D Model based Object Class Detection in An Arbitrary View Pingkun Yan, Saad M. Khan, Mubarak Shah School of Electrical Engineering and Computer Science University of Central Florida http://www.eecs.ucf.edu/
arxiv:1409.1556v6 [cs.cv] 10 Apr 2015
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION Karen Simonyan & Andrew Zisserman + Visual Geometry Group, Department of Engineering Science, University of Oxford {karen,az}@robots.ox.ac.uk
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected]
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected] Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Semantic Recognition: Object Detection and Scene Segmentation
Semantic Recognition: Object Detection and Scene Segmentation Xuming He [email protected] Computer Vision Research Group NICTA Robotic Vision Summer School 2015 Acknowledgement: Slides from Fei-Fei
arxiv:1312.6034v2 [cs.cv] 19 Apr 2014
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps arxiv:1312.6034v2 [cs.cv] 19 Apr 2014 Karen Simonyan Andrea Vedaldi Andrew Zisserman Visual Geometry Group,
Pedestrian Detection using R-CNN
Pedestrian Detection using R-CNN CS676A: Computer Vision Project Report Advisor: Prof. Vinay P. Namboodiri Deepak Kumar Mohit Singh Solanki (12228) (12419) Group-17 April 15, 2016 Abstract Pedestrian detection
InstaNet: Object Classification Applied to Instagram Image Streams
InstaNet: Object Classification Applied to Instagram Image Streams Clifford Huang Stanford University [email protected] Mikhail Sushkov Stanford University [email protected] Abstract The growing
Studying Relationships Between Human Gaze, Description, and Computer Vision
Studying Relationships Between Human Gaze, Description, and Computer Vision Kiwon Yun1, Yifan Peng1, Dimitris Samaras1, Gregory J. Zelinsky2, Tamara L. Berg1 1 Department of Computer Science, 2 Department
Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang
Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform
Taking Inverse Graphics Seriously
CSC2535: 2013 Advanced Machine Learning Taking Inverse Graphics Seriously Geoffrey Hinton Department of Computer Science University of Toronto The representation used by the neural nets that work best
Fast R-CNN Object detection with Caffe
Fast R-CNN Object detection with Caffe Ross Girshick Microsoft Research arxiv code Latest roasts Goals for this section Super quick intro to object detection Show one way to tackle obj. det. with ConvNets
High Level Describable Attributes for Predicting Aesthetics and Interestingness
High Level Describable Attributes for Predicting Aesthetics and Interestingness Sagnik Dhar Vicente Ordonez Tamara L Berg Stony Brook University Stony Brook, NY 11794, USA [email protected] Abstract
Edge Boxes: Locating Object Proposals from Edges
Edge Boxes: Locating Object Proposals from Edges C. Lawrence Zitnick and Piotr Dollár Microsoft Research Abstract. The use of object proposals is an effective recent approach for increasing the computational
Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
SSD: Single Shot MultiBox Detector
SSD: Single Shot MultiBox Detector Wei Liu 1, Dragomir Anguelov 2, Dumitru Erhan 3, Christian Szegedy 3, Scott Reed 4, Cheng-Yang Fu 1, Alexander C. Berg 1 1 UNC Chapel Hill 2 Zoox Inc. 3 Google Inc. 4
Introduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti Singh Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey
arxiv:1604.08893v1 [cs.cv] 29 Apr 2016
Faster R-CNN Features for Instance Search Amaia Salvador, Xavier Giró-i-Nieto, Ferran Marqués Universitat Politècnica de Catalunya (UPC) Barcelona, Spain {amaia.salvador,xavier.giro}@upc.edu Shin ichi
Character Image Patterns as Big Data
22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,
Multi-view Face Detection Using Deep Convolutional Neural Networks
Multi-view Face Detection Using Deep Convolutional Neural Networks Sachin Sudhakar Farfade Yahoo [email protected] Mohammad Saberian Yahoo [email protected] Li-Jia Li Yahoo [email protected]
Applications of Deep Learning to the GEOINT mission. June 2015
Applications of Deep Learning to the GEOINT mission June 2015 Overview Motivation Deep Learning Recap GEOINT applications: Imagery exploitation OSINT exploitation Geospatial and activity based analytics
Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083.
Fast R-CNN Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083. ECS 289G 001 Paper Presentation, Prof. Lee Result 1 67% Accuracy
Object Categorization using Co-Occurrence, Location and Appearance
Object Categorization using Co-Occurrence, Location and Appearance Carolina Galleguillos Andrew Rabinovich Serge Belongie Department of Computer Science and Engineering University of California, San Diego
Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks
1 Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks Tomislav Hrkać, Karla Brkić, Zoran Kalafatić Faculty of Electrical Engineering and Computing University of
Localizing 3D cuboids in single-view images
Localizing 3D cuboids in single-view images Jianxiong Xiao Bryan C. Russell Antonio Torralba Massachusetts Institute of Technology University of Washington Abstract In this paper we seek to detect rectangular
Group Sparse Coding. Fernando Pereira Google Mountain View, CA [email protected]. Dennis Strelow Google Mountain View, CA strelow@google.
Group Sparse Coding Samy Bengio Google Mountain View, CA [email protected] Fernando Pereira Google Mountain View, CA [email protected] Yoram Singer Google Mountain View, CA [email protected] Dennis Strelow
Network Morphism. Abstract. 1. Introduction. Tao Wei
Tao Wei [email protected] Changhu Wang [email protected] Yong Rui [email protected] Chang Wen Chen [email protected] Microsoft Research, Beijing, China, 18 Department of Computer Science and Engineering,
Task-driven Progressive Part Localization for Fine-grained Recognition
Task-driven Progressive Part Localization for Fine-grained Recognition Chen Huang Zhihai He [email protected] University of Missouri [email protected] Abstract In this paper we propose a task-driven
Segmentation as Selective Search for Object Recognition
Segmentation as Selective Search for Object Recognition Koen E. A. van de Sande Jasper R. R. Uijlings Theo Gevers Arnold W. M. Smeulders University of Amsterdam University of Trento Amsterdam, The Netherlands
Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006
Object Recognition Large Image Databases and Small Codes for Object Recognition Pixels Description of scene contents Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)
Semantic Image Segmentation and Web-Supervised Visual Learning
Semantic Image Segmentation and Web-Supervised Visual Learning Florian Schroff Andrew Zisserman University of Oxford, UK Antonio Criminisi Microsoft Research Ltd, Cambridge, UK Outline Part I: Semantic
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
The Delicate Art of Flower Classification
The Delicate Art of Flower Classification Paul Vicol Simon Fraser University University Burnaby, BC [email protected] Note: The following is my contribution to a group project for a graduate machine learning
An Analysis of Single-Layer Networks in Unsupervised Feature Learning
An Analysis of Single-Layer Networks in Unsupervised Feature Learning Adam Coates 1, Honglak Lee 2, Andrew Y. Ng 1 1 Computer Science Department, Stanford University {acoates,ang}@cs.stanford.edu 2 Computer
Local features and matching. Image classification & object localization
Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15 GENIVI is a registered trademark of the GENIVI Alliance in the USA and other countries Copyright GENIVI Alliance
The Visual Internet of Things System Based on Depth Camera
The Visual Internet of Things System Based on Depth Camera Xucong Zhang 1, Xiaoyun Wang and Yingmin Jia Abstract The Visual Internet of Things is an important part of information technology. It is proposed
Real-Time Grasp Detection Using Convolutional Neural Networks
Real-Time Grasp Detection Using Convolutional Neural Networks Joseph Redmon 1, Anelia Angelova 2 Abstract We present an accurate, real-time approach to robotic grasp detection based on convolutional neural
Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not?
Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not? Erjin Zhou [email protected] Zhimin Cao [email protected] Qi Yin [email protected] Abstract Face recognition performance improves rapidly
3 An Illustrative Example
Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent
Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers
Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers Fan Yang 1,2, Wongun Choi 2, and Yuanqing Lin 2 1 Department of Computer Science,
Fast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)
Mean-Shift Tracking with Random Sampling
1 Mean-Shift Tracking with Random Sampling Alex Po Leung, Shaogang Gong Department of Computer Science Queen Mary, University of London, London, E1 4NS Abstract In this work, boosting the efficiency of
Fast Accurate Fish Detection and Recognition of Underwater Images with Fast R-CNN
Fast Accurate Fish Detection and Recognition of Underwater Images with Fast R-CNN Xiu Li 1, 2, Min Shang 1, 2, Hongwei Qin 1, 2, Liansheng Chen 1, 2 1. Department of Automation, Tsinghua University, Beijing
Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.
Visual search: what's next? Cees Snoek University of Amsterdam The Netherlands Euvision Technologies The Netherlands Problem statement US flag Tree Aircraft Humans Dog Smoking Building Basketball Table
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
T O B C A T C A S E G E O V I S A T DETECTIE E N B L U R R I N G V A N P E R S O N E N IN P A N O R A MISCHE BEELDEN
T O B C A T C A S E G E O V I S A T DETECTIE E N B L U R R I N G V A N P E R S O N E N IN P A N O R A MISCHE BEELDEN Goal is to process 360 degree images and detect two object categories 1. Pedestrians,
Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall
Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin
Data Mining and Predictive Analytics - Assignment 1 Image Popularity Prediction on Social Networks
Data Mining and Predictive Analytics - Assignment 1 Image Popularity Prediction on Social Networks Wei-Tang Liao and Jong-Chyi Su Department of Computer Science and Engineering University of California,
arxiv:1504.08083v2 [cs.cv] 27 Sep 2015
Fast R-CNN Ross Girshick Microsoft Research [email protected] arxiv:1504.08083v2 [cs.cv] 27 Sep 2015 Abstract This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object
Big Data: Image & Video Analytics
Big Data: Image & Video Analytics How it could support Archiving & Indexing & Searching Dieter Haas, IBM Deutschland GmbH The Big Data Wave 60% of internet traffic is multimedia content (images and videos)
Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite
Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite Philip Lenz 1 Andreas Geiger 2 Christoph Stiller 1 Raquel Urtasun 3 1 KARLSRUHE INSTITUTE OF TECHNOLOGY 2 MAX-PLANCK-INSTITUTE IS 3
Unsupervised Discovery of Mid-Level Discriminative Patches
Unsupervised Discovery of Mid-Level Discriminative Patches Saurabh Singh, Abhinav Gupta, and Alexei A. Efros Carnegie Mellon University, Pittsburgh, PA 15213, USA http://graphics.cs.cmu.edu/projects/discriminativepatches/
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Analecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Simultaneous Deep Transfer Across Domains and Tasks
Simultaneous Deep Transfer Across Domains and Tasks Eric Tzeng, Judy Hoffman, Trevor Darrell UC Berkeley, EECS & ICSI {etzeng,jhoffman,trevor}@eecs.berkeley.edu Kate Saenko UMass Lowell, CS [email protected]
Finding people in repeated shots of the same scene
Finding people in repeated shots of the same scene Josef Sivic 1 C. Lawrence Zitnick Richard Szeliski 1 University of Oxford Microsoft Research Abstract The goal of this work is to find all occurrences
EdVidParse: Detecting People and Content in Educational Videos
EdVidParse: Detecting People and Content in Educational Videos by Michele Pratusevich S.B., Massachusetts Institute of Technology (2013) Submitted to the Department of Electrical Engineering and Computer
Blocks that Shout: Distinctive Parts for Scene Classification
Blocks that Shout: Distinctive Parts for Scene Classification Mayank Juneja 1 Andrea Vedaldi 2 C. V. Jawahar 1 Andrew Zisserman 2 1 Center for Visual Information Technology, International Institute of
Object Class Recognition using Images of Abstract Regions
Object Class Recognition using Images of Abstract Regions Yi Li, Jeff A. Bilmes, and Linda G. Shapiro Department of Computer Science and Engineering Department of Electrical Engineering University of Washington
Privacy-CNH: A Framework to Detect Photo Privacy with Convolutional Neural Network Using Hierarchical Features
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Privacy-CNH: A Framework to Detect Photo Privacy with Convolutional Neural Network Using Hierarchical Features Lam Tran
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Selecting Receptive Fields in Deep Networks
Selecting Receptive Fields in Deep Networks Adam Coates Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Andrew Y. Ng Department of Computer Science Stanford
Large Scale Semi-supervised Object Detection using Visual and Semantic Knowledge Transfer
Large Scale Semi-supervised Object Detection using Visual and Semantic Knowledge Transfer Yuxing Tang 1 Josiah Wang 2 Boyang Gao 1,3 Emmanuel Dellandréa 1 Robert Gaizauskas 2 Liming Chen 1 1 Ecole Centrale
