We don t need no bounding-boxes: Training object class detectors using only human verification

Size: px
Start display at page:

Download "We don t need no bounding-boxes: Training object class detectors using only human verification"

Transcription

1 We don t need no bounding-boxes: Training object class detectors using only human verification Dim P. Papadopoulos Jasper R. R. Uijlings Frank Keller Vittorio Ferrari dim.papadopoulos@ed.ac.uk jrr.uijlings@ed.ac.uk keller@inf.ed.ac.uk vferrari@inf.ed.ac.uk University of Edinburgh Abstract Training object class detectors typically requires a large set of images in which objects are annotated by boundingboxes. However, manually drawing bounding-boxes is very time consuming. We propose a new scheme for training object detectors which only requires annotators to verify bounding-boxes produced automatically by the learning algorithm. Our scheme iterates between re-training the detector, re-localizing objects in the training images, and human verification. We use the verification signal both to improve re-training and to reduce the search space for re-localisation, which makes these steps different to what is normally done in a weakly supervised setting. Extensive experiments on PASCAL VOC 2007 show that (1) using human verification to update detectors and reduce the search space leads to the rapid production of high-quality bounding-box annotations; (2) our scheme delivers detectors performing almost as good as those trained in a fully supervised setting, without ever drawing any bounding-box; (3) as the verification task is very quick, our scheme substantially reduces total annotation time by a factor Introduction Object class detection is a central problem in computer vision. Training a detector typically requires a large set of images in which objects have been manually annotated with bounding-boxes [10, 16, 18, 19, 20, 34, 53, 57, 62]. Bounding-box annotation is tedious, time consuming and expensive. For instance, annotating ILSVRC, currently the most popular object class detection dataset, required 42s per bounding-box by crowd-sourcing on Mechanical Turk [40] using a technique specifically developed for efficient bounding-box annotation [50]. In order to reduce the cost of bounding-box annotation, researchers have mainly focused on two strategies. The first is learning in the weakly supervised setting, i.e. given only labels indicating which object classes are present in an image. While this setting is much cheaper, it produces lower quality detectors, typically performing only about half as well as when trained from bounding-boxes [4, 5, 9, 12, 42, 47, 48, 49, 60]. The second strategy is active learning, where the computer requests human annotators to draw bounding-boxes on a subset of images actively selected by the learner itself. This strategy can produce high quality detectors, but it still requires humans to draw a lot of bounding-boxes in order to get there, leading to limited gains in terms of total annotation time [56, 63]. In this paper we propose a new scheme for learning object detectors which only requires humans to verify bounding-boxes produced automatically by the learning algorithm: the annotator merely needs to decide whether a bounding-box is correct or not. Crucially, answering this verification question takes much less time than actually drawing the bounding-box. Given a set of training images with image-level labels, our scheme iteratively alternates between updating object detectors, re-localizing objects in the training images, and querying humans for verification. At each iteration we use the verification signal in two ways. First, we update the object class detector using only positively verified boundingboxes. This makes it stronger than when using all detected bounding-boxes, as it is commonly done in the weakly supervised setting, because typically many of them are incorrect. Moreover, once the object location in an image has been positively verified, it can be fixed and removed from consideration in subsequent iterations. Second, we observe that bounding-boxes judged as incorrect still provide valuable information about where the object is not. Building on this observation, we use the negatively verified boundingboxes to reduce the search space of possible object locations in subsequent iterations. Both these points help to rapidly find more objects in the remaining images. This results in a framework for training object detectors which minimizes human annotation effort and eliminates the need to draw any bounding-box. In extensive experiments on the popular PASCAL VOC 2007 dataset [16] with both simulated and actual annotators, we show that: (1) using human verification to update detectors and reduce the search space leads to rapid pro- 1

2 Figure 1: Our framework iterates between (A) re-training object detectors, (B) re-localising objects, and (C) querying annotators for verification. The verification signal resulting from (C) is used in both (A) and (B). duction of high-quality bounding-box annotations; (2) our scheme delivers object class detectors performing almost as good as those trained in a fully supervised setting, without ever drawing any bounding-box; (3) as the verification task is very quick, our scheme substantially reduces total annotation time by a factor Related Work Weakly-supervised object localization (WSOL). Many previous techniques [4, 5, 8, 12, 42, 47, 48, 49, 60] try to learn object class detectors in the weakly supervised setting, i.e. given training images known to contain instances of a certain object class, but not their location. The task is both to localize the objects in the training images and to learn an object detector for localizing instances in new images. Recent work on WSOL [4, 5, 8, 48, 49, 60] has shown remarkable progress, mainly because of the use of Convolutional Neural Nets (CNN [20, 28]), which greatly improve visual recognition tasks. However, learning a detector without location annotation is difficult and performance is still quite low compared to fully supervised methods (typically about half the map of the same detection model trained on the same images but with manual bounding-box annotation [4, 5, 8, 12, 42, 47, 48, 49, 60]). Often WSOL is conceptualised as Multiple Instance Learning (MIL) [4, 8, 12, 13, 44, 47, 48, 49]. Images are treated as bags of windows (instances). A negative image contains only negative instances. A positive image contains at least one positive instance, mixed in with a majority of negative ones. The goal is to find the true positive instances from which to learn a window classifier for the object class. This typically happens by iteratively alternating between (A) re-training the object detector given the current selection of positive instances and (B) re-localising instances in the positive images using the current object detector. Our proposed scheme also contains these steps. However, we introduce a human verification step, whose signal is fed into both (A) and (B), which fundamentally alters these steps. The resulting framework leads to substantially better object detectors with modest additional annotation effort (sec. 5.3). Humans in the loop. Human-machine collaboration approaches have been successfully used in tasks that are currently too difficult to be solved by computer vision alone, such as fine-grained visual recognition [7, 11, 58, 59], semisupervised clustering [30], attribute-based image classification [6, 36, 37]. These works combine the responses of pretrained computer vision models on a new test image with human input to fully solve the task. In the domain of object detection, Russakovsky et al. [41] propose such a scheme to fully detect all objects in images of complex scenes. Importantly, their object detectors are pre-trained on boundingboxes from the large training set of ILSVRC 2014 [40], as their goal is not to make an efficient training scheme. Active learning. Active learning schemes iteratively train models while requesting humans to annotate a subset of the data points actively selected by the learner as being the most informative. Previous active learning work has mainly focussed on image classification [25, 26, 27, 39], and freeform region labelling [45, 54, 55]. A few works have proposed active learning schemes specifically for training object class detectors [56, 63]. Vijayanarasimhan and Grauman [56] propose an approach where the training images do not come from a predefined dataset but are crawled from the web. Here annotators are asked to draw many bounding-boxes around the target objects (about one third of the training images [56]). Yao et al. [63] propose to manually correct bounding-boxes detected in video. While both [56, 63] produce high quality detectors, they achieve only moderate gains in annotation

3 time, because drawing or correcting bounding-boxes is expensive. In contrast, our scheme only asks annotators to verify bounding-boxes, never to draw. This leads to more substantial reductions in annotation time. Other ways to reduce annotation effort. A few authors tried to learn object detectors from videos, where the spatiotemporal coherence of the video frames facilitates object localization [32, 38, 52]. An alternative is transfer learning, where learning a model for a new class is helped by labeled examples of related classes [2, 17, 21, 29, 31]. Hoffman et al. [24] proposed an algorithm that transforms an image classifier to an object detector without bounding-box annotated data using domain adaptation. Other types of data, such as text from web pages or newspapers [3, 15, 22, 33] or eye-tracking data [35], have also been used as a weak annotation signal to train object detectors. 3. Method In this paper we are given a training set with imagelevel labels. Our goal is to obtain object instances annotated by bounding-boxes and to train good object detectors while minimizing human annotation effort. We therefore propose a framework where annotators only need to verify bounding-boxes automatically produced by our scheme. Our framework iteratively alternates between (A) retraining object detectors, (B) re-localizing objects in the training images, and (C) querying annotators for verification (fig. 1). Importantly, we use verification signals to help both re-training and re-localisation. More formally, let In be the set of images for which we do not have positively verified bounding-boxes at iteration n yet. Let Sn be the corresponding set of possible object locations. Initially, I0 is the complete training set and S0 is a complete set of object proposals [1, 14, 53] extracted from these images (we use EdgeBoxes [14]). To facilitate exposition, we describe our framework starting from the verification step (C, sec 3.1). At iteration n we have a set of automatically detected bounding-boxes Dn which are given to annotators to be verified. Detections which are judged to be correct Dn+ Dn are used for re-training the object detectors (A) in the next iteration (sec. 3.2). The verification signal is also used to reduce the search space Sn+1 for relocalisation (B, sec. 3.3). We describe our main three steps below. We defer to Sec. 4 a description of the object detection model we use, and of how to automatically obtain initial detections D0 to start the process Verification by annotators In this phase, we ask annotators to verify the automatically generated detections Dn at iteration n. For this we explore two strategies (fig. 2): simple yes/no verification, and more elaborate verification in which annotators are asked to categorize the type of error. Figure 2: Our two verification strategies for some images of the dog class. Yes/No verification (left): verify a detection as either correct (Yes) or incorrect (No). YPCMM verification (right): label a detection as Yes, Part, Container, Mixed or Missed. Yes/No Verification. In this task the annotators are shown a detection ld and a class label. They are instructed to respond Yes if the detection correctly localizes an object of that class, and No otherwise. This splits the set of object detections Dn into Dn+ and Dn. We define correct localization based on the standard PASCAL Intersectionover-Union criterion [16] (IoU). Let ld be the detected object bounding-box and lgt be the actual object bounding-box (which is not given to the annotator). Let IoU(la, lb ) = la lb / la lb, where denotes area. If IoU(lgt, ld ) 0.5, the detected bounding-box should be considered correct and the annotator should answer Yes. Intuitively, this Yes/No verification is a relatively simple task which should translate into fast annotation times. Yes/Part/Container/Mixed/Missed Verification. In this task, we asked the annotators to label an object detection ld as Yes (correct), Part, Container, Mixed, or Missed. Yes is defined as above (IoU(lgt, ld ) 0.5). For incorrect detections the annotators are asked to diagnose the error as either Part if it contains part of the target object and no background; Container if it contains the whole object and some background; Mixed if it contains part of the object and some background; Missed if the object was completely missed. This verification step splits Dn into Dn+ and Dnypcmm. Intuitively, determining the type of error is more difficult leading to longer annotation times, but also brings more information that we can use in the next steps Re-training object detectors In this step we re-train object detectors. After the verification step we know that Dn+ contains well localized object instances, while Dn or Dnypcmm do not. Hence we train using only bounding-boxes D1+ Dn+ that have been positively verified in past iterations. To obtain background training samples, we sample proposals which have an IoU in range [0 0.5) with positively verified bounding-boxes. Note how it is common in WSOL works [4, 5, 8, 12, 42, 47, 48, 49, 60] to also have a re-training step. However, they typically use all detected bounding-boxes Dn. Since in WSOL generally less than half of them are correct, this leads to rather weak detectors. In contrast, our verification

4 step enables us to train purely from correct bounding-boxes, resulting in stronger, more reliable object detectors Re-localizing objects by search space reduction In this step we re-localize objects in the training images. For each image, we apply the current object detector to score the object proposals in it, and select the proposal with the highest score as the new detection for that image. Importantly, we do not evaluate all proposals S0, but instead use the verification signal to reduce the search space by removing proposals. Positively verified detections Dn+ are correct by definition and therefore their images need not be considered in subsequent iterations, neither in the re-localization step nor in the verification step. For negatively verified detections we reduce the search space depending on the verification strategy, as described below. Yes/No Verification. In the case where the annotator judges a detection as incorrect (Dn ), we can simply eliminate its proposal from the search space. This results in the updated search space Sn+1, where one proposal has been removed from each image with an incorrect detection. However, we might make a better use of the negative verification signal. Since an incorrect detection has an IoU < 0.5 with the true bounding-box, we can eliminate all proposals with an IoU 0.5 with it. This is a more aggressive reduction of the search space. While it may remove some proposals which are correct according to the IoU criterion, it will not remove the best possible proposal. Importantly, this strategy eliminates those areas of the search space that matter: high scoring locations which are unlikely to contain the object. In Sec. 5.2 we investigate which way of using negatively verified detection performs better in practice. Yes/Part/Container/Mixed/Missed Verification. In the case where annotators categorize incorrect detections as Part/Container/Mixed/Missed, we can use the type of error to get an even greater reduction of the search space. Depending on the type of error we eliminate different proposals (fig. 3): Part: eliminate all proposals which do not contain the detection; Container: eliminate all proposals which are not inside the detection; Mixed: eliminate all proposals which are not inside the detection, or do not contain it, or have zero IoU with it, or have IoU 0.5 with it; Missed: eliminate all proposals which have non-zero IoU with the detection. To precisely determine what is inside and contained, we introduce the Intersection-over-A measure: IoA(la, lb ) = la lb / la. Note that IoA(lgt, ld ) = 1 if the detection ld contains the true object bounding-box lgt, whereas IoA(ld, lgt ) = 1 if ld covers a part of lgt. In practice, if ld is judged to be a Part by the annotator, we eliminate all proposals ls with IoA(ld, ls ) 0.9. Similarly, if ld is judged to be a Container, we eliminate all proposals ls Figure 3: Visualisation of search space reduction induced by YPCMM verification on some images of the cat class (part, container, mixed, and missed). In the last row, the search space reduction steers the re-localization process towards the small cat on the right of the image and away from the dog on the left. with IoA(ls, ld ) 0.9. We keep the tolerance threshold 0.9 fixed in all experiments. Note how in WSOL there is also a re-localisation step. However, because there is no verification signal there is also no search space reduction: each iteration needs to consider the complete set S0 of proposals. In contrast, in our work the search space reduction greatly facilitates re-localization. 4. Implementation details We summarize here two existing state-of-the-art components that we use in our framework: the object detection model, and a WSOL algorithm which we use to obtain initial object detections D Object class detector As object detector we use Fast R-CNN [19], which combines object proposals [1, 14, 53] with CNNs [23, 28, 46]. Instead of Selective Search [53] we use EdgeBoxes [14] which gives us an objectness measure [1] which we use in the initialization phase described below. For simplicity of implementation, for the re-training step (sec. 3.2) we omit bounding-box regression, so that the set of object proposals stays fixed throughout all iterations. For evaluation on the test set, we then train detectors with bounding-box regression. In most of our experiments, we use AlexNet [28] as the underlying CNN architecture.

5 Figure 4: Comparing the search process of Yes/No verification with Yes/Part/Container/Mixed/Missed on an image of the bird class. The extra signal for YPCMM allows for a more targeted search, resulting in fewer verification steps to find the object Initialization by Multiple Instance Learning We perform multiple instance learning (MIL) for weakly supervised object localisation [4, 8, 48] to obtain the initial set of detections D 0. We start with the training images I 0 and the set of object proposals S 0 extracted using Edge- Boxes [14]. Following [4, 20, 48, 49, 60] we extract CNN features on top of which we train an SVM. We iterate between (A) re-training object detectors and (B) re-localizing objects in the training images. We stop when two subsequent re-localization steps yield the same detections, which typically happens within 10 iterations. These detections become D 0. In the very first iteration, we train the classifier using complete images as positive training examples [8, 42]. We apply two improvements to the standard MIL framework. First, in high dimensional feature space the discriminative SVM classifier can relatively easily separate any positive examples from negative examples, which means that most positive examples are far from the decision hyperplane. Hence the same positive training examples used for re-training (A) are often re-localised in (B), leading to premature locked-in behaviour. To prevent this Cinbis et al. [8, 9] introduced multi-fold MIL: similar to cross-validation, the dataset is split into 10 subsets, where the re-localisation on each subset is done using detectors trained on the union of all other subsets. Second, like in [9, 12, 21, 38, 43, 47, 44, 51, 61], we combine the object detector score with a general measure of objectness [1], which measures how likely it is that a proposal tightly encloses an object of any class (e.g. bird, car, sheep), as opposed to background (e.g. sky, water, grass). In this paper we use the recent objectness measure of [14]. 5. Experimental Results 5.1. Dataset and evaluation protocol PASCAL VOC We perform expriments on PAS- CAL VOC 2007 [16], which consists of 20 classes. The trainval set contains 5011 images, while the test set contains 4952 images. We use the trainval set with accompanying image-level labels to train object detectors, and measure their performance on the test set. Following the common protocol for WSOL experiments [8, 9, 12, 42, 60], we exclude trainval images that contain only difficult and truncated instances, ending up with 3550 images. In sections 5.2, 5.3 we carry out a detailed analysis of our system in these settings, using AlexeNet as CNN architecture [28]. For completeness, in section 5.4 we also present results when using the complete trainval set and the deeper VGG16 as CNN architecture [46]. Evaluation. Given a training set with image-level labels, our goal is to localize the object instances in this set and to train good object detectors, while minimizing human annotation effort. We evaluate this by exploring the trade-off between localization performance and quality of the object detectors versus required annotation effort. We quantify localization performance in the training set with the Correct Localization (CorLoc) measure [4, 5, 8, 9, 12, 42, 47, 60]. CorLoc is the percentage of images in which the boundingbox returned by the algorithm correctly localizes an object of the target class (i.e., IoU 0.5). We quantify object detection performance on the test set using mean average precision (map), as standard in PAS- CAL VOC 07. We quantify annotation effort both in terms of the number of verifications and in terms of actual human time measurements. As most previous WSOL methods [4, 5, 8, 9, 12, 42, 47, 48, 49, 60], our scheme returns exactly one bounding-box per class per training image. This enables clean comparisons to previous work in terms of CorLoc on the training set, and keeps the human verification tasks simple (as we do not need to ask the annotators whether they see additional instances in an image). Note how at test time the detector is capable of localizing multiple objects of the same class in the same image (and this is captured in the map measure). Compared methods. We compare our approach to the fully supervised alternative by training the same object detector (sec. 4.1) on the same training images, but with manual bounding-boxes (again, one bounding-box per class per image). On the other end of the supervision spectrum, we also compare to a modern MIL-based WSOL technique (sec 4.2) run on the same training images, but without human verification. Since that technique also forms the initialization step of our method, this comparison reveals how much farther we can go with human verification.

6 CorLoc (I) only retrain (II) retrain + remove Neg (III) retrain + remove ExtNeg (IV) retrain + remove PCMM % of checked images Figure 5: Trade-off between the number of verifications and Cor- Loc for the simulated verification case on PASCAL VOC For MIL WSOL, the effort to draw bounding-boxes is zero. For fully supervised learning we take the actual annotation times for ILSVRC from [50]: they report 26 seconds for drawing one bounding-box without quality control, and 42 seconds with quality control. These timings are also representative for PASCAL VOC, since it is of comparable difficulty and its annotations are of comparable quality, as discussed in [40]. The bounding-boxes in both datasets are of high quality and precisely match the object extent Simulated verification We first use simulated verification to determine how best to use the verification signal. We simulate human verification by using the available ground-truth bounding boxes. Note how these are not given to the learning algorithm, they are only used to derive the verification signals of sec Fig. 5 compares four ways to use the verification signal in terms of the trade-off between the number of verifications and CorLoc (sec. 3.3): (I) only retrain the object detector (using positively verified detections D + n ); (II) retrain + remove Neg: for Yes/No verification, retrain and reduce the search space by eliminating one proposal for each negatively verified detection; (III) retrain + remove ExtNeg: for Yes/No verification, retrain and eliminate all proposals overlapping with a negatively verified detection; (IV) retrain + remove PCMM: for YPCMM verification, retrain and eliminate proposals according to the type of error. As fig. 5 shows, even using verification just to retrain the object detector (I) drastically boosts CorLoc from the initial 43% (achieved by MIL WSOL) up to 82%. This requires checking each training image on average 4 times. Using the verification signal in the re-localisation step by reducing the search space (II IV) helps to reach this CorLoc substantially faster (1.6 2 checks per image). Moreover, the final CorLoc is much higher when we reduce the search space. Removing negatively verified detections brings a considerable jump to 92% CorLoc (II); removing all proposals around negatively verified detections further increases CorLoc to 95% (III); while the Yes/Part/Container/Mixed/Missed strategy appears to be the most promising, achieving a near-perfect 96% CorLoc af- Figure 6: (a) Percentage of positively verified detections as a function of ground-truth IoU, for each annotator. (b) Average human response time as a function of ground-truth IoU. (c) Percentage of incorrectly verified detections as a function of object area (relative to image area). ter checking each image only 2.5 times on average. These results show that feeding the verification signal into both the re-training and re-localisation steps quickly results in a large number of correctly located object instances. Fig. 4 compares the search process of Yes/No and YPCMM verification strategies on a bird example. The second detection is diagnosed in YPCMM as a container. This focuses the search to that particular part of the image. In contrast, detections of the Yes/No case jump around before finding the detection. This shows that the search process of YPCMM is more targeted. However, in both cases the target object location is found rather quickly. In conclusion, YPCMM is the most promising verification strategy, followed by Yes/No with removing all proposals overlapping with a negatively verified detection (III). Since Yes/No verification is intuitively easier and faster, we try both strategies in experiments with human annotators Human verification Instructions and interface. For the Yes/No and YPCMM verification tasks, we used five annotators from our university who were given examples to learn about the IoU criterion. For both tasks we created a full-screen interface. All images of a target class were shown in sequence, with the current detection superimposed. For Yes/No verification, annotators were asked to press 1 for Yes and 0 for No. This simple task took on average 1.6 seconds per verification. For YPCMM verification, the annotators were asked to click on one of five on-screen buttons corresponding to Yes, Part, Container, Mixed and Missed. This more elaborate task took on average 2.4 seconds per verification. Analysis of human verification. Fig. 6a reports the percentage of positively verified detections as a function of their IoU with the ground-truth bounding-box. We observe

7 Figure 7: Examples of objects localized by using our proposed Yes/No human verification scheme on the trainval set of PASCAL VOC 2007 (sec. 5.3). For each example, we compare the final output of our scheme (green box) to the output of the reference multiple instance learning (MIL) weakly-supervised object localization approach (red box) (sec. 4.2). that humans behave quite closely to the desired PASCAL criterion (i.e. IoU > 0.5) which we use in our simulations. All annotators behave identically on easy cases (IoU < 0.25, IoU > 0.75). On boundary cases (IoU 0.5) we observe some annotator bias. For example, anno2 tends to judge boundary cases as wrong detections, whereas anno3 and anno4 judge them more frequently as correct. Overall the percentage of incorrect Yes and No judgements are 14.8% and 8.5%, respectively. Therefore there is a slight bias towards Yes, i.e. humans tend to be slightly more lenient than the IoU> 0.5 criterion. While the average human response time is 1.6 s for the Yes/No verification, the response time for verifying difficult detections (IoU 0.5) is significantly higher (2.2 s, fig. 6b). This shows how the difficulty of the verification task is directly linked to the IoU of the detection, and is reflected in the time measurements. We also found that human verification errors strongly correlate with the area of objects: 48% of all errors are made when objects occupy less than 10% of the image area (fig. 6c). Simulated vs. human annotators. We first compare simulated and human annotators by plotting CorLoc and map against actual annotation time (rather than as number of verifications as in Sec. 5.2). For simulated verification, we use average human annotation time as reported above. Fig. 8 shows the results on a log-scale. While human annotators are somewhat worse than simulations in terms of CorLoc on the training set, the map of the resulting object detectors on the test set are comparable. The diverging results for CorLoc and map is because human judgement errors are generally made on boundary cases with bounding-boxes that only approximately cover an object (fig. 6a). Using these cases either as positive or negative training examples, the object detector remains equally strong. To conclude, in terms of training high quality object detectors, simulated human annotators reliably deliver similar results as actual annotators. In sec. 5.2 we observed that YPCMM needs fewer verifications than Yes/No. However, in terms of total annotation time, the Yes/No task has the more favourable tradeoff: Yes/No achieves achieves 83% CorLoc and 45% map by taking 5.8 hours of annotation time, while YPCMM achieves 81% CorLoc and 45% map by taking 7.7 hours (fig. 8). Hence we conclude that the Yes/No task is preferable for human annotation, as it is easier and faster. Weak supervision vs. verification. We now compare the reference MIL WSOL approach that we use to initialize our process (Sec. 4.2 and magenta diamond in fig. 8) to the final output of our Yes/No human verification scheme (solid orange line, fig. 8). While MIL WSOL achieves 43% CorLoc and 27% map, using human verification bring a massive jump in performance to 83% CorLoc and 45% map. Hence at a modest cost of 5.8 hours of annotation time we achieve substantial performance gains. Examples in fig. 7 show that our approach localizes objects more accurately and succeeds in more challenging conditions, e.g. when the object is very small and appears in a cluttered scene. The state-of-the-art WSOL approaches perform as follows: Cinbis et al. [9] achieve 52.0% CorLoc and 30.2% map, Bilen et al. [5] achieve 43.7% CorLoc and 27.7% map, and Wang et al. [60] achieve 48.5% CorLoc and 31.6% map. Our method using human verification substantially outperforms all of them, reaching 83% CorLoc and 45% map (our method and [5, 9, 60] all use AlexNet). Hence at a modest extra annotation cost, we obtain many more correct object locations and train better detectors. Full supervision vs. verification We now compare our Yes/No human verification scheme (solid orange line, fig. 8) to standard fully supervised learning with manual

8 Figure 8: Evaluation on PASCAL VOC 2007: CorLoc and map against human annotation time in hours (log-scale). All orange and red curves are variations of our proposed scheme, with simulated ( sim ) and real ( real ) annotators. Draw indicates learning from manual bounding-boxes (full supervision). MIL indicates learning under weak supervision only, without human verification (sec. 4.2). The fully supervised approach needs a factor 6-9 extra annotation time to obtain similar performance to our framework. bounding-boxes (solid green lines). The object detectors learned by our scheme achieve 45% map, almost as good as the fully supervised ones (51% map). Importantly, fully supervised training needs 33 hours of annotation time (when assuming an optimistic 26 s per image), or even 53 hours (when assuming a more realistic 42 s per image). Our method instead requires only 5.8 hours, a reduction in human effort of a factor of 6-9. From a different perspective, when given the same human annotation time as our approach (5.8 hours), the fully supervised detector only achieves 33% map (at 26 s per bounding-box) or 30% map (at 42 s). We conclude that by the use of just an inexpensive verification component, we can train strong object detectors at little cost. This is significant since it enables the cheap creation of high quality object detectors for a large variety of classes, bypassing the need for massive annotation efforts such as ImageNet [40] Complete training set and VGG16 Our experiments are based on the Fast R-CNN detector [19]. In order to have a clean comparison between our verification-based scheme and the fully supervised results of [19], we re-ran our experiments using the complete trainval set of PASCAL VOC 2007 (i.e images, table 1). Under full supervision, [19] reports 57% map based on AlexNet. Training Fast R-CNN from one bounding-box per class per image, results in 55% map, while our Yes/No human verification scheme gets to 50% map. Additionally, we reduced training set complete training set Yes/No FS Yes/No FS AlexNet 45% 51% 50% 55% VGG16 55% 61% 58% 66% Table 1: Comparison of map results between our Yes/No human verification scheme and full supervision (FS) using different training sets and different network architectures. reduced training set : excluding trainval images containing only difficult and truncated instances (3550 images); complete training set : all trainval images (5011). experiment with VGG16 instead of AlexNet, with the same settings. Training with full supervision leads to 66% map, while our verification scheme delivers 58% map. Hence, on both CNN architectures our verification-based training scheme produces high quality detectors, achieving 90% of the map of their fully supervised counterparts Conclusions We proposed a scheme for training object class detectors which uses a human verification step to improve the retraining and re-localisation steps common to most weakly supervised approaches. Experiments on PASCAL VOC 2007 show that our scheme produces detectors performing almost as good as those trained in a fully supervised setting, without ever drawing any bounding-box. As the verification task is very quick, our scheme reduces the total human annotation time by a factor of 6-9. Acknowledgement. This work was supported by the ERC Starting Grant VisCul.

9 References [1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, , 4, 5 [2] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer for object category detection. In ICCV, [3] T. Berg, A. Berg, J. Edwards, M. Mair, R. White, Y. Teh, E. Learned-Miller, and D. Forsyth. Names and Faces in the News. In CVPR, [4] H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervised object detection with posterior regularization. In BMVC, , 2, 3, 5 [5] H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervised object detection with convex clustering. In CVPR, , 2, 3, 5, 7 [6] A. Biswas and D. Parikh. Simultaneous active learning of classifiers & attributes via relative feedback. In CVPR, [7] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie. Visual recognition with humans in the loop. In ECCV, [8] R. Cinbis, J. Verbeek, and C. Schmid. Multi-fold mil training for weakly supervised object localization. In CVPR, , 3, 5 [9] R. Cinbis, J. Verbeek, and C. Schmid. Weakly supervised object localization with multi-fold multiple instance learning. arxiv: , , 5, 7 [10] N. Dalal and B. Triggs. Histogram of Oriented Gradients for human detection. In CVPR, [11] J. Deng, J. Krause, and L. Fei-Fei. Fine-grained crowdsourcing for fine-grained recognition. In CVPR, [12] T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In ECCV, , 2, 3, 5 [13] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1-2):31 71, [14] P. Dollar and C. Zitnick. Edge boxes: Locating object proposals from edges. In ECCV, , 4, 5 [15] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In ECCV, [16] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, , 3, 5 [17] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. cviu, [18] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. on PAMI, 32(9), [19] R. Girshick. Fast R-CNN. In ICCV, , 4, 8 [20] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, , 2, 5 [21] M. Guillaumin and V. Ferrari. Large-scale knowledge transfer for object localization in imagenet. In CVPR, , 5 [22] A. Gupta and L. Davis. Beyond nouns: Exploiting prepositions and comparators for learning visual classifiers. In ECCV, [23] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, [24] J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, and J. Donahue. LSDA: Large scale detection through adaptation. In NIPS, [25] A. J. Joshi, F. Porikli, and N. Papanikolopoulos. Multi-class active learning for image classification. In CVPR, [26] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Active learning with gaussian processes for object categorization. In ICCV, [27] A. Kovashka, S. Vijayanarasimhan, and K. Grauman. Actively selecting annotations among objects and attributes. In ICCV, [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, , 4, 5 [29] D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation Propagation in ImageNet. In ECCV, [30] S. Lad and D. Parikh. Interactively guiding semi-supervised clustering via attribute-based explanations. In ECCV, [31] C. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, [32] C. Leistner, M. Godec, S. Schulter, A. Saffari, and H. Bischof. Improving classifiers with weakly-related videos. In CVPR, [33] J. Luo, B. Caputo, and V. Ferrari. Who s doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In NIPS, [34] T. Malisiewicz, A. Gupta, and A. Efros. Ensemble of exemplar-svms for object detection and beyond. In ICCV, [35] D. P. Papadopoulos, A. D. F. Clarke, F. Keller, and V. Ferrari. Training object class detectors from eye tracking data. In ECCV, [36] D. Parikh and K. Grauman. Relative attributes. In ICCV, [37] A. Parkash and D. Parikh. Attributes for classifier feedback. In ECCV, [38] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In CVPR, , 5 [39] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, and H.-J. Zhang. Twodimensional active learning for image classification. In CVPR, [40] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, , 2, 6, 8 [41] O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both worlds: human-machine collaboration for object annotation. In CVPR,

10 [42] O. Russakovsky, Y. Lin, K. Yu, and L. Fei-Fei. Objectcentric spatial pooling for image classification. In ECCV, , 2, 3, 5 [43] N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, and G. Mori. Similarity constrained latent support vector machine: An application to weakly supervised action classification. In ECCV, [44] Z. Shi, P. Siva, and T. Xiang. Transfer learning by ranking for weakly supervised object annotation. In BMVC, , 5 [45] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR, [46] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, , 5 [47] P. Siva and T. Xiang. Weakly supervised object detector learning with model drift detection. In ICCV, , 2, 3, 5 [48] H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darell. On learning to localize objects with minimal supervision. In ICML, , 2, 3, 5 [49] H. Song, Y. Lee, S. Jegelka, and T. Darell. Weaklysupervised discovery of visual pattern configurations. In NIPS, , 2, 3, 5 [50] H. Su, J. Deng, and L. Fei-Fei. Crowdsourcing annotations for visual object detection. In AAAI Human Computation Workshop, , 6 [51] K. Tang, A. Joulin, L.-J. Li, and L. Fei-Fei. Co-localization in real-world images. In CVPR, [52] K. Tang, R. Sukthankar, J. Yagnik, and L. Fei-Fei. Discriminative segment annotation in weakly labeled video. In CVPR, [53] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. IJCV, , 3, 4 [54] S. Vijayanarasimhan and K. Grauman. Multi-level active prediction of useful image annotations for recognition. In NIPS, [55] S. Vijayanarasimhan and K. Grauman. What s it going to cost you?: Predicting effort vs. informativeness for multilabel image annotations. In CVPR, [56] S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. IJCV, 108(1-2):97 114, , 2 [57] P. A. Viola, J. Platt, and C. Zhang. Multiple instance boosting for object detection. In NIPS, [58] C. Wah, S. Branson, P. Perona, and S. Belongie. Multiclass recognition and part localization with humans in the loop. In ICCV, [59] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and S. Belongie. Similarity comparisons for interactive finegrained categorization. In CVPR, [60] C. Wang, W. Ren, J. Zhang, K. Huang, and S. Maybank. Large-scale weakly supervised object localization via latent category learning. IEEE Transactions on Image Processing, 24(4): , , 2, 3, 5, 7 [61] L. Wang, G. Hua, R. Sukthankar, J. Xue, and J. Zheng. Video object discovery and co-segmentation with extremely weak supervision. In ECCV, [62] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object detection. In ICCV, pages IEEE, [63] A. Yao, J. Gall, C. Leistner, and L. Van Gool. Interactive object detection. In CVPR, , 2

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection CSED703R: Deep Learning for Visual Recognition (206S) Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 3 Object detection

More information

Pedestrian Detection with RCNN

Pedestrian Detection with RCNN Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University mcc17@stanford.edu Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional

More information

Convolutional Feature Maps

Convolutional Feature Maps Convolutional Feature Maps Elements of efficient (and accurate) CNN-based object detection Kaiming He Microsoft Research Asia (MSRA) ICCV 2015 Tutorial on Tools for Efficient Object Detection Overview

More information

Deformable Part Models with CNN Features

Deformable Part Models with CNN Features Deformable Part Models with CNN Features Pierre-André Savalle 1, Stavros Tsogkas 1,2, George Papandreou 3, Iasonas Kokkinos 1,2 1 Ecole Centrale Paris, 2 INRIA, 3 TTI-Chicago Abstract. In this work we

More information

Scalable Object Detection by Filter Compression with Regularized Sparse Coding

Scalable Object Detection by Filter Compression with Regularized Sparse Coding Scalable Object Detection by Filter Compression with Regularized Sparse Coding Ting-Hsuan Chao, Yen-Liang Lin, Yin-Hsi Kuo, and Winston H Hsu National Taiwan University, Taipei, Taiwan Abstract For practical

More information

Semantic Recognition: Object Detection and Scene Segmentation

Semantic Recognition: Object Detection and Scene Segmentation Semantic Recognition: Object Detection and Scene Segmentation Xuming He xuming.he@nicta.com.au Computer Vision Research Group NICTA Robotic Vision Summer School 2015 Acknowledgement: Slides from Fei-Fei

More information

Edge Boxes: Locating Object Proposals from Edges

Edge Boxes: Locating Object Proposals from Edges Edge Boxes: Locating Object Proposals from Edges C. Lawrence Zitnick and Piotr Dollár Microsoft Research Abstract. The use of object proposals is an effective recent approach for increasing the computational

More information

Multi-fold MIL Training for Weakly Supervised Object Localization

Multi-fold MIL Training for Weakly Supervised Object Localization Multi-fold MIL Training for Weakly Supervised Object Localization Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid To cite this version: Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid. Multi-fold

More information

Bert Huang Department of Computer Science Virginia Tech

Bert Huang Department of Computer Science Virginia Tech This paper was submitted as a final project report for CS6424/ECE6424 Probabilistic Graphical Models and Structured Prediction in the spring semester of 2016. The work presented here is done by students

More information

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016 Module 5 Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016 Previously, end-to-end.. Dog Slide credit: Jose M 2 Previously, end-to-end.. Dog Learned Representation Slide credit: Jose

More information

Transform-based Domain Adaptation for Big Data

Transform-based Domain Adaptation for Big Data Transform-based Domain Adaptation for Big Data Erik Rodner University of Jena Judy Hoffman Jeff Donahue Trevor Darrell Kate Saenko UMass Lowell Abstract Images seen during test time are often not from

More information

Best of both worlds: human-machine collaboration for object annotation

Best of both worlds: human-machine collaboration for object annotation Best of both worlds: human-machine collaboration for object annotation Olga Russakovsky 1 Li-Jia Li 2 Li Fei-Fei 1 1 Stanford University 2 Snapchat Abstract The long-standing goal of localizing every object

More information

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Lecture 6: Classification & Localization. boris. ginzburg@intel.com Lecture 6: Classification & Localization boris. ginzburg@intel.com 1 Agenda ILSVRC 2014 Overfeat: integrated classification, localization, and detection Classification with Localization Detection. 2 ILSVRC-2014

More information

Pedestrian Detection using R-CNN

Pedestrian Detection using R-CNN Pedestrian Detection using R-CNN CS676A: Computer Vision Project Report Advisor: Prof. Vinay P. Namboodiri Deepak Kumar Mohit Singh Solanki (12228) (12419) Group-17 April 15, 2016 Abstract Pedestrian detection

More information

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 CS 1699: Intro to Computer Vision Deep Learning Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 Today: Deep neural networks Background Architectures and basic operations Applications Visualizing

More information

Learning Object Class Detectors from Weakly Annotated Video

Learning Object Class Detectors from Weakly Annotated Video Learning Object Class Detectors from Weakly Annotated Video Alessandro Prest 1,2, Christian Leistner 1, Javier Civera 4, Cordelia Schmid 2 and Vittorio Ferrari 3 1 ETH Zurich 2 INRIA Grenoble 3 University

More information

ENHANCED WEB IMAGE RE-RANKING USING SEMANTIC SIGNATURES

ENHANCED WEB IMAGE RE-RANKING USING SEMANTIC SIGNATURES International Journal of Computer Engineering & Technology (IJCET) Volume 7, Issue 2, March-April 2016, pp. 24 29, Article ID: IJCET_07_02_003 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=7&itype=2

More information

Task-driven Progressive Part Localization for Fine-grained Recognition

Task-driven Progressive Part Localization for Fine-grained Recognition Task-driven Progressive Part Localization for Fine-grained Recognition Chen Huang Zhihai He chenhuang@mail.missouri.edu University of Missouri hezhi@missouri.edu Abstract In this paper we propose a task-driven

More information

High Level Describable Attributes for Predicting Aesthetics and Interestingness

High Level Describable Attributes for Predicting Aesthetics and Interestingness High Level Describable Attributes for Predicting Aesthetics and Interestingness Sagnik Dhar Vicente Ordonez Tamara L Berg Stony Brook University Stony Brook, NY 11794, USA tlberg@cs.stonybrook.edu Abstract

More information

Segmentation as Selective Search for Object Recognition

Segmentation as Selective Search for Object Recognition Segmentation as Selective Search for Object Recognition Koen E. A. van de Sande Jasper R. R. Uijlings Theo Gevers Arnold W. M. Smeulders University of Amsterdam University of Trento Amsterdam, The Netherlands

More information

Fast R-CNN Object detection with Caffe

Fast R-CNN Object detection with Caffe Fast R-CNN Object detection with Caffe Ross Girshick Microsoft Research arxiv code Latest roasts Goals for this section Super quick intro to object detection Show one way to tackle obj. det. with ConvNets

More information

CAP 6412 Advanced Computer Vision

CAP 6412 Advanced Computer Vision CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Jan 26, 2016 Today Administrivia A bigger picture and some common questions Object detection proposals, by Samer

More information

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 69 Class Project Report Junhua Mao and Lunbo Xu University of California, Los Angeles mjhustc@ucla.edu and lunbo

More information

Multi-Class Active Learning for Image Classification

Multi-Class Active Learning for Image Classification Multi-Class Active Learning for Image Classification Ajay J. Joshi University of Minnesota Twin Cities ajay@cs.umn.edu Fatih Porikli Mitsubishi Electric Research Laboratories fatih@merl.com Nikolaos Papanikolopoulos

More information

Compacting ConvNets for end to end Learning

Compacting ConvNets for end to end Learning Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli. Success of CNN Image Classification Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,

More information

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform

More information

Large Scale Semi-supervised Object Detection using Visual and Semantic Knowledge Transfer

Large Scale Semi-supervised Object Detection using Visual and Semantic Knowledge Transfer Large Scale Semi-supervised Object Detection using Visual and Semantic Knowledge Transfer Yuxing Tang 1 Josiah Wang 2 Boyang Gao 1,3 Emmanuel Dellandréa 1 Robert Gaizauskas 2 Liming Chen 1 1 Ecole Centrale

More information

Automatic Attribute Discovery and Characterization from Noisy Web Data

Automatic Attribute Discovery and Characterization from Noisy Web Data Automatic Attribute Discovery and Characterization from Noisy Web Data Tamara L. Berg 1, Alexander C. Berg 2, and Jonathan Shih 3 1 Stony Brook University, Stony Brook NY 11794, USA, tlberg@cs.sunysb.edu,

More information

Learning and transferring mid-level image representions using convolutional neural networks

Learning and transferring mid-level image representions using convolutional neural networks Willow project-team Learning and transferring mid-level image representions using convolutional neural networks Maxime Oquab, Léon Bottou, Ivan Laptev, Josef Sivic 1 Image classification (easy) Is there

More information

Object Detection from Video Tubelets with Convolutional Neural Networks

Object Detection from Video Tubelets with Convolutional Neural Networks Object Detection from Video Tubelets with Convolutional Neural Networks Kai Kang Wanli Ouyang Hongsheng Li Xiaogang Wang Department of Electronic Engineering, The Chinese University of Hong Kong {kkang,wlouyang,hsli,xgwang}@ee.cuhk.edu.hk

More information

Image and Video Understanding

Image and Video Understanding Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material

More information

Quality Assessment for Crowdsourced Object Annotations

Quality Assessment for Crowdsourced Object Annotations S. VITTAYAKORN, J. HAYS: CROWDSOURCED OBJECT ANNOTATIONS 1 Quality Assessment for Crowdsourced Object Annotations Sirion Vittayakorn svittayakorn@cs.brown.edu James Hays hays@cs.brown.edu Computer Science

More information

Multi-view Face Detection Using Deep Convolutional Neural Networks

Multi-view Face Detection Using Deep Convolutional Neural Networks Multi-view Face Detection Using Deep Convolutional Neural Networks Sachin Sudhakar Farfade Yahoo fsachin@yahoo-inc.com Mohammad Saberian Yahoo saberian@yahooinc.com Li-Jia Li Yahoo lijiali@cs.stanford.edu

More information

Local features and matching. Image classification & object localization

Local features and matching. Image classification & object localization Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to

More information

Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.

Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree. Visual search: what's next? Cees Snoek University of Amsterdam The Netherlands Euvision Technologies The Netherlands Problem statement US flag Tree Aircraft Humans Dog Smoking Building Basketball Table

More information

arxiv:1506.03365v2 [cs.cv] 19 Jun 2015

arxiv:1506.03365v2 [cs.cv] 19 Jun 2015 LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop Fisher Yu Yinda Zhang Shuran Song Ari Seff Jianxiong Xiao arxiv:1506.03365v2 [cs.cv] 19 Jun 2015 Princeton

More information

Adapting Attributes by Selecting Features Similar across Domains

Adapting Attributes by Selecting Features Similar across Domains Adapting Attributes by Selecting Features Similar across Domains Siqi Liu Adriana Kovashka University of Pittsburgh siqiliu, kovashka@cs.pitt.edu Abstract Attributes are semantic visual properties shared

More information

Object Detection in Video using Faster R-CNN

Object Detection in Video using Faster R-CNN Object Detection in Video using Faster R-CNN Prajit Ramachandran University of Illinois at Urbana-Champaign prmchnd2@illinois.edu Abstract Convolutional neural networks (CNN) currently dominate the computer

More information

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083.

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083. Fast R-CNN Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083. ECS 289G 001 Paper Presentation, Prof. Lee Result 1 67% Accuracy

More information

3D Model based Object Class Detection in An Arbitrary View

3D Model based Object Class Detection in An Arbitrary View 3D Model based Object Class Detection in An Arbitrary View Pingkun Yan, Saad M. Khan, Mubarak Shah School of Electrical Engineering and Computer Science University of Central Florida http://www.eecs.ucf.edu/

More information

Image Classification for Dogs and Cats

Image Classification for Dogs and Cats Image Classification for Dogs and Cats Bang Liu, Yan Liu Department of Electrical and Computer Engineering {bang3,yan10}@ualberta.ca Kai Zhou Department of Computing Science kzhou3@ualberta.ca Abstract

More information

arxiv:1604.08893v1 [cs.cv] 29 Apr 2016

arxiv:1604.08893v1 [cs.cv] 29 Apr 2016 Faster R-CNN Features for Instance Search Amaia Salvador, Xavier Giró-i-Nieto, Ferran Marqués Universitat Politècnica de Catalunya (UPC) Barcelona, Spain {amaia.salvador,xavier.giro}@upc.edu Shin ichi

More information

Semantic Image Segmentation and Web-Supervised Visual Learning

Semantic Image Segmentation and Web-Supervised Visual Learning Semantic Image Segmentation and Web-Supervised Visual Learning Florian Schroff Andrew Zisserman University of Oxford, UK Antonio Criminisi Microsoft Research Ltd, Cambridge, UK Outline Part I: Semantic

More information

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual

More information

SSD: Single Shot MultiBox Detector

SSD: Single Shot MultiBox Detector SSD: Single Shot MultiBox Detector Wei Liu 1, Dragomir Anguelov 2, Dumitru Erhan 3, Christian Szegedy 3, Scott Reed 4, Cheng-Yang Fu 1, Alexander C. Berg 1 1 UNC Chapel Hill 2 Zoox Inc. 3 Google Inc. 4

More information

A Convolutional Neural Network Cascade for Face Detection

A Convolutional Neural Network Cascade for Face Detection A Neural Network Cascade for Face Detection Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, Gang Hua Stevens Institute of Technology Hoboken, NJ 07030 {hli18, ghua}@stevens.edu Adobe Research San

More information

Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers

Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers Fan Yang 1,2, Wongun Choi 2, and Yuanqing Lin 2 1 Department of Computer Science,

More information

Mean-Shift Tracking with Random Sampling

Mean-Shift Tracking with Random Sampling 1 Mean-Shift Tracking with Random Sampling Alex Po Leung, Shaogang Gong Department of Computer Science Queen Mary, University of London, London, E1 4NS Abstract In this work, boosting the efficiency of

More information

The Visual Internet of Things System Based on Depth Camera

The Visual Internet of Things System Based on Depth Camera The Visual Internet of Things System Based on Depth Camera Xucong Zhang 1, Xiaoyun Wang and Yingmin Jia Abstract The Visual Internet of Things is an important part of information technology. It is proposed

More information

Fast Accurate Fish Detection and Recognition of Underwater Images with Fast R-CNN

Fast Accurate Fish Detection and Recognition of Underwater Images with Fast R-CNN Fast Accurate Fish Detection and Recognition of Underwater Images with Fast R-CNN Xiu Li 1, 2, Min Shang 1, 2, Hongwei Qin 1, 2, Liansheng Chen 1, 2 1. Department of Automation, Tsinghua University, Beijing

More information

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning : Intro. to Computer Vision Deep learning University of Massachusetts, Amherst April 19/21, 2016 Instructor: Subhransu Maji Finals (everyone) Thursday, May 5, 1-3pm, Hasbrouck 113 Final exam Tuesday, May

More information

Object Categorization using Co-Occurrence, Location and Appearance

Object Categorization using Co-Occurrence, Location and Appearance Object Categorization using Co-Occurrence, Location and Appearance Carolina Galleguillos Andrew Rabinovich Serge Belongie Department of Computer Science and Engineering University of California, San Diego

More information

R-CNN minus R. 1 Introduction. Karel Lenc http://www.robots.ox.ac.uk/~karel. Department of Engineering Science, University of Oxford, Oxford, UK.

R-CNN minus R. 1 Introduction. Karel Lenc http://www.robots.ox.ac.uk/~karel. Department of Engineering Science, University of Oxford, Oxford, UK. LENC, VEDALDI: R-CNN MINUS R 1 R-CNN minus R Karel Lenc http://www.robots.ox.ac.uk/~karel Andrea Vedaldi http://www.robots.ox.ac.uk/~vedaldi Department of Engineering Science, University of Oxford, Oxford,

More information

Hybrid Learning Framework for Large-Scale Web Image Annotation and Localization

Hybrid Learning Framework for Large-Scale Web Image Annotation and Localization Hybrid Learning Framework for Large-Scale Web Image Annotation and Localization Yong Li 1, Jing Liu 1, Yuhang Wang 1, Bingyuan Liu 1, Jun Fu 1, Yunze Gao 1, Hui Wu 2, Hang Song 1, Peng Ying 1, and Hanqing

More information

arxiv:1312.6034v2 [cs.cv] 19 Apr 2014

arxiv:1312.6034v2 [cs.cv] 19 Apr 2014 Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps arxiv:1312.6034v2 [cs.cv] 19 Apr 2014 Karen Simonyan Andrea Vedaldi Andrew Zisserman Visual Geometry Group,

More information

Taking a Deeper Look at Pedestrians

Taking a Deeper Look at Pedestrians Taking a Deeper Look at Pedestrians Jan Hosang Mohamed Omran Rodrigo Benenson Bernt Schiele Max Planck Institute for Informatics Saarbrücken, Germany firstname.lastname@mpi-inf.mpg.de Abstract In this

More information

SEMANTIC CONTEXT AND DEPTH-AWARE OBJECT PROPOSAL GENERATION

SEMANTIC CONTEXT AND DEPTH-AWARE OBJECT PROPOSAL GENERATION SEMANTIC TEXT AND DEPTH-AWARE OBJECT PROPOSAL GENERATION Haoyang Zhang,, Xuming He,, Fatih Porikli,, Laurent Kneip NICTA, Canberra; Australian National University, Canberra ABSTRACT This paper presents

More information

Learning Spatial Context: Using Stuff to Find Things

Learning Spatial Context: Using Stuff to Find Things Learning Spatial Context: Using Stuff to Find Things Geremy Heitz Daphne Koller Department of Computer Science, Stanford University {gaheitz,koller}@cs.stanford.edu Abstract. The sliding window approach

More information

Large-scale Knowledge Transfer for Object Localization in ImageNet

Large-scale Knowledge Transfer for Object Localization in ImageNet Large-scale Knowledge Transfer for Object Localization in ImageNet Matthieu Guillaumin ETH Zürich, Switzerland Vittorio Ferrari University of Edinburgh, UK Abstract ImageNet is a large-scale database of

More information

Weakly Supervised Object Boundaries Supplementary material

Weakly Supervised Object Boundaries Supplementary material Weakly Supervised Object Boundaries Supplementary material Anna Khoreva Rodrigo Benenson Mohamed Omran Matthias Hein 2 Bernt Schiele Max Planck Institute for Informatics, Saarbrücken, Germany 2 Saarland

More information

CNN Based Object Detection in Large Video Images. WangTao, wtao@qiyi.com IQIYI ltd. 2016.4

CNN Based Object Detection in Large Video Images. WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 CNN Based Object Detection in Large Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body

More information

Do Convnets Learn Correspondence?

Do Convnets Learn Correspondence? Do Convnets Learn Correspondence? Jonathan Long Ning Zhang Trevor Darrell University of California Berkeley {jonlong, nzhang, trevor}@cs.berkeley.edu Abstract Convolutional neural nets (convnets) trained

More information

arxiv:1504.08083v2 [cs.cv] 27 Sep 2015

arxiv:1504.08083v2 [cs.cv] 27 Sep 2015 Fast R-CNN Ross Girshick Microsoft Research rbg@microsoft.com arxiv:1504.08083v2 [cs.cv] 27 Sep 2015 Abstract This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object

More information

Group Sparse Coding. Fernando Pereira Google Mountain View, CA pereira@google.com. Dennis Strelow Google Mountain View, CA strelow@google.

Group Sparse Coding. Fernando Pereira Google Mountain View, CA pereira@google.com. Dennis Strelow Google Mountain View, CA strelow@google. Group Sparse Coding Samy Bengio Google Mountain View, CA bengio@google.com Fernando Pereira Google Mountain View, CA pereira@google.com Yoram Singer Google Mountain View, CA singer@google.com Dennis Strelow

More information

Object Recognition. Selim Aksoy. Bilkent University saksoy@cs.bilkent.edu.tr

Object Recognition. Selim Aksoy. Bilkent University saksoy@cs.bilkent.edu.tr Image Classification and Object Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr Image classification Image (scene) classification is a fundamental

More information

Finding people in repeated shots of the same scene

Finding people in repeated shots of the same scene Finding people in repeated shots of the same scene Josef Sivic 1 C. Lawrence Zitnick Richard Szeliski 1 University of Oxford Microsoft Research Abstract The goal of this work is to find all occurrences

More information

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28 Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

More information

Learning Detectors from Large Datasets for Object Retrieval in Video Surveillance

Learning Detectors from Large Datasets for Object Retrieval in Video Surveillance 2012 IEEE International Conference on Multimedia and Expo Learning Detectors from Large Datasets for Object Retrieval in Video Surveillance Rogerio Feris, Sharath Pankanti IBM T. J. Watson Research Center

More information

Latest Advances in Deep Learning. Yao Chou

Latest Advances in Deep Learning. Yao Chou Latest Advances in Deep Learning Yao Chou Outline Introduction Images Classification Object Detection R-CNN Traditional Feature Descriptor Selective Search Implementation Latest Application Deep Learning

More information

Keypoint Density-based Region Proposal for Fine-Grained Object Detection and Classification using Regions with Convolutional Neural Network Features

Keypoint Density-based Region Proposal for Fine-Grained Object Detection and Classification using Regions with Convolutional Neural Network Features Keypoint Density-based Region Proposal for Fine-Grained Object Detection and Classification using Regions with Convolutional Neural Network Features JT Turner 1, Kalyan Gupta 1, Brendan Morris 2, & David

More information

Learning Motion Categories using both Semantic and Structural Information

Learning Motion Categories using both Semantic and Structural Information Learning Motion Categories using both Semantic and Structural Information Shu-Fai Wong, Tae-Kyun Kim and Roberto Cipolla Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, UK {sfw26,

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

More information

Transfer Learning by Borrowing Examples for Multiclass Object Detection

Transfer Learning by Borrowing Examples for Multiclass Object Detection Transfer Learning by Borrowing Examples for Multiclass Object Detection Joseph J. Lim CSAIL, MIT lim@csail.mit.edu Ruslan Salakhutdinov Department of Statistics, University of Toronto rsalakhu@utstat.toronto.edu

More information

Multi-Scale Improves Boundary Detection in Natural Images

Multi-Scale Improves Boundary Detection in Natural Images Multi-Scale Improves Boundary Detection in Natural Images Xiaofeng Ren Toyota Technological Institute at Chicago 1427 E. 60 Street, Chicago, IL 60637, USA xren@tti-c.org Abstract. In this work we empirically

More information

Behavior Analysis in Crowded Environments. XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011

Behavior Analysis in Crowded Environments. XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011 Behavior Analysis in Crowded Environments XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011 Behavior Analysis in Sparse Scenes Zelnik-Manor & Irani CVPR

More information

Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not?

Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not? Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not? Erjin Zhou zej@megvii.com Zhimin Cao czm@megvii.com Qi Yin yq@megvii.com Abstract Face recognition performance improves rapidly

More information

What is an object? Abstract

What is an object? Abstract What is an object? Bogdan Alexe, Thomas Deselaers, Vittorio Ferrari Computer Vision Laboratory, ETH Zurich {bogdan, deselaers, ferrari}@vision.ee.ethz.ch Abstract We present a generic objectness measure,

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Studying Relationships Between Human Gaze, Description, and Computer Vision

Studying Relationships Between Human Gaze, Description, and Computer Vision Studying Relationships Between Human Gaze, Description, and Computer Vision Kiwon Yun1, Yifan Peng1, Dimitris Samaras1, Gregory J. Zelinsky2, Tamara L. Berg1 1 Department of Computer Science, 2 Department

More information

InstaNet: Object Classification Applied to Instagram Image Streams

InstaNet: Object Classification Applied to Instagram Image Streams InstaNet: Object Classification Applied to Instagram Image Streams Clifford Huang Stanford University chuang8@stanford.edu Mikhail Sushkov Stanford University msushkov@stanford.edu Abstract The growing

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Manifold regularized kernel logistic regression for web image annotation

Manifold regularized kernel logistic regression for web image annotation Manifold regularized kernel logistic regression for web image annotation W. Liu 1, H. Liu 1, D.Tao 2*, Y. Wang 1, K. Lu 3 1 China University of Petroleum (East China) 2 *** 3 University of Chinese Academy

More information

How To Use A Near Neighbor To A Detector

How To Use A Near Neighbor To A Detector Ensemble of -SVMs for Object Detection and Beyond Tomasz Malisiewicz Carnegie Mellon University Abhinav Gupta Carnegie Mellon University Alexei A. Efros Carnegie Mellon University Abstract This paper proposes

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

How To Generate Object Proposals On A Computer With A Large Image Of A Large Picture

How To Generate Object Proposals On A Computer With A Large Image Of A Large Picture Geodesic Object Proposals Philipp Krähenbühl 1 and Vladlen Koltun 2 1 Stanford University 2 Adobe Research Abstract. We present an approach for identifying a set of candidate objects in a given image.

More information

Classification using Intersection Kernel Support Vector Machines is Efficient

Classification using Intersection Kernel Support Vector Machines is Efficient To Appear in IEEE Computer Vision and Pattern Recognition 28, Anchorage Classification using Intersection Kernel Support Vector Machines is Efficient Subhransu Maji EECS Department U.C. Berkeley smaji@eecs.berkeley.edu

More information

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite Philip Lenz 1 Andreas Geiger 2 Christoph Stiller 1 Raquel Urtasun 3 1 KARLSRUHE INSTITUTE OF TECHNOLOGY 2 MAX-PLANCK-INSTITUTE IS 3

More information

MULTI-LEVEL SEMANTIC LABELING OF SKY/CLOUD IMAGES

MULTI-LEVEL SEMANTIC LABELING OF SKY/CLOUD IMAGES MULTI-LEVEL SEMANTIC LABELING OF SKY/CLOUD IMAGES Soumyabrata Dev, Yee Hui Lee School of Electrical and Electronic Engineering Nanyang Technological University (NTU) Singapore 639798 Stefan Winkler Advanced

More information

Classifying Manipulation Primitives from Visual Data

Classifying Manipulation Primitives from Visual Data Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if

More information

Vehicle Tracking by Simultaneous Detection and Viewpoint Estimation

Vehicle Tracking by Simultaneous Detection and Viewpoint Estimation Vehicle Tracking by Simultaneous Detection and Viewpoint Estimation Ricardo Guerrero-Gómez-Olmedo, Roberto López-Sastre, Saturnino Maldonado-Bascón, and Antonio Fernández-Caballero 2 GRAM, Department of

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Learning to Process Natural Language in Big Data Environment

Learning to Process Natural Language in Big Data Environment CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview

More information

Evaluating Sources and Strategies for Learning Video Concepts from Social Media

Evaluating Sources and Strategies for Learning Video Concepts from Social Media Evaluating Sources and Strategies for Learning Video Concepts from Social Media Svetlana Kordumova Intelligent Systems Lab Amsterdam University of Amsterdam The Netherlands Email: s.kordumova@uva.nl Xirong

More information

Fast Matching of Binary Features

Fast Matching of Binary Features Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

More information

Discovering objects and their location in images

Discovering objects and their location in images Discovering objects and their location in images Josef Sivic Bryan C. Russell Alexei A. Efros Andrew Zisserman William T. Freeman Dept. of Engineering Science CS and AI Laboratory School of Computer Science

More information

The goal is multiply object tracking by detection with application on pedestrians.

The goal is multiply object tracking by detection with application on pedestrians. Coupled Detection and Trajectory Estimation for Multi-Object Tracking By B. Leibe, K. Schindler, L. Van Gool Presented By: Hanukaev Dmitri Lecturer: Prof. Daphna Wienshall The Goal The goal is multiply

More information

Simultaneous Deep Transfer Across Domains and Tasks

Simultaneous Deep Transfer Across Domains and Tasks Simultaneous Deep Transfer Across Domains and Tasks Eric Tzeng, Judy Hoffman, Trevor Darrell UC Berkeley, EECS & ICSI {etzeng,jhoffman,trevor}@eecs.berkeley.edu Kate Saenko UMass Lowell, CS saenko@cs.uml.edu

More information