Semantic Recognition: Object Detection and Scene Segmentation Xuming He xuming.he@nicta.com.au Computer Vision Research Group NICTA Robotic Vision Summer School 2015 Acknowledgement: Slides from Fei-Fei Li, R. Fergus, A. Torralba, K. Grauman.
Semantic scene understanding Semantic recognition tasks: Object detection Semantic segmentation
Semantic scene understanding Semantic recognition tasks: Object detection Semantic segmentation Activity recognition Scene layout estimation Scene categorization Geo-localization
Outline Object (category) detection Basic framework of object detection Case study: Viola-Jones, DPM and R-CNN Discussions Semantic scene segmentation Pixel labeling and CRF Design of CRF models Discussions
Kristen Grauman Object detection: why it is so hard? Illumination Object pose Clutter Occlusions Intra-class appearance Viewpoint
Kristen Grauman Object detection: why it is so hard? Realistic scenes are crowded, cluttered, have overlapping objects.
What works reliably today (semi-) Rigid objects Eg., face, car, pedestrian, license plate/ traffic sign
Kristen Grauman Generic object detection: basic framework Build/train object model (training stage) Choose an object representation Learn or fit parameters of object model Generate candidates in new image Score the candidates (inference/prediction stage)
Case study I: Viola-Jones face detector Overview: A seminal approach to real-time object detection (Viola and Jones, IJCV 2004) Object / feature representation: Global template with Haar wavelet features Object model and scoring: Classifying object candidates into face/non-face Use boosted combination of discriminative features as final classifier
Kristen Grauman Viola-Jones detector: Haar wavelets Rectangular filters Feature output is difference between adjacent regions Efficiently computable with integral image: any sum can be computed in constant time. Value at (x,y) is sum of pixels above and to the left of (x,y) Integral image
Viola-Jones detector: features Which subset of these features should we use to determine if a window has a face? Considering all possible filter parameters: position, scale, and type: 180,000+ possible features associated with each 24 x 24 window Use AdaBoost both to select the informative features and to build the classifier Kristen Grauman
Viola-Jones detector: Boosting Defines a classifier using an additive model: Strong classifier Features vector Weight Weak classifier Training: incrementally selecting weaker classifiers During each step, we select a weak learner that does well on examples that were hard for the previous weak learners Hardness is captured by weights attached to training examples
Kristen Grauman Viola-Jones detector: AdaBoost Want to select the single rectangle feature and threshold that best separates positive (faces) and negative (nonfaces) training examples, in terms of weighted error. Resulting weak classifier: Outputs of a possible rectangle feature on faces and non-faces. For next round, reweight the examples according to errors, choose another filter/threshold combo.
Viola-Jones detector: Learned model First two features selected
Kristen Grauman Viola-Jones detector: Candidate generation Sliding window at multiple scales face/non-face Classifier
Kristen Grauman Cascading classifiers for detection Form a cascade with low false negative rates early on Apply less accurate but faster classifiers first to immediately discard windows that clearly appear to be negative
Detecting profile faces? Can we use the same detector?
Viola-Jones detector: Strengths Sliding window detection and global appearance descriptors: Simple detection protocol to implement Good feature choices critical Past successes for certain classes Implementation: OpenCV Kristen Grauman
Viola-Jones detector: Limitations Non-rigid, deformable objects not captured well with representations assuming a fixed 2d structure; or must assume fixed viewpoint Objects with less-regular textures not captured well with holistic appearance-based descriptions Kristen Grauman
Case Study II: Deformable Part-based Models Overview: Felzenszwalb et al. PAMI 10, and winner of the PASCAL detection challenge (2008,2009) Part-based representation: Global (root) template + deformable parts Trained from global bounding-boxes only
DPMs: Part-based representation Objects are decomposed into parts and spatial relations among parts Fischler and Elschlager 73 22
DPMs: Object representation Based on HOG features (1 root + 6 parts) Full model is a mixture of deformable part-based models
DPMs: Object model Object candidates obtained in a multi-scale fashion
DPMs: Object model Score of candidates
DPMs: Learning object models Training data consists of images with labeled bounding boxes Need to learn the model structure, filters and deformation costs: Latent SVM
DPMs: Candidate generation and scoring Detection: Defined by a high-scoring root location Relies on an overall score based on the placement of the parts Efficient dynamic programming/bp
Deformable Part-based Models: Results Car detections
Deformable Part-based Models: Results Person detections
Other part-based representations Tree model Articulated objects 30
Part-based models: Pose estimation Pose estimation in video (Ramanan et al, 2007) Running on street Dancing
DPMs: Strengths Part-based representation : Flexible object model with deformation/pose Discriminatively learned with bounding box annotation Past successes in PASCAL detection challenges. Implementation: http://www.cs.berkeley.edu/~rbg/latent/ Kristen Grauman
DPMs: Limitations Manually designed feature (HOG) Pre-defined object-part structure Trainable classifier is often generic (e.g. SVM) Where next? Better classifiers? Or keep building more features? Object candidates Hand-designed feature extraction Trainable classifier Object Class Kristen Grauman
Case Study III: Regions with CNN features Overview: Girshick et al. CVPR 14, and significant improvement on the PASCAL VOC Learned object representation based on Convolutional Neural Network. Candidate generation by region proposal (objectness)
RCNN: Object representation Learn a feature hierarchy all the way from pixels to classifier Each layer extracts features from the output of previous layer Train all layers jointly Object Candidates Layer 1 Layer 2 Layer 3 Simple Classifier
Layer 1: Top-9 Patches Patches from validation images that give maximal activation of a given feature map
Layer 2: Top-9 Patches
Layer 3: Top-9 Patches
RCNN: Candidate generation Class-generic Object Detection, or Objectness (eg. Alexe, Deselaers, and Ferrari, 2010) Saliency Edge Map Segments RCNN uses Selective Search method (van de Sande, Uijlings, Gevers, Smeulders, 2011)
RCNN: Detection pipeline and results Strength: Significant improvement on public benchmarks (map = 53.7% on PASCAL VOC 2010 vs ~35% with DPM.) Implementation https://github.com/rbgirshick/rcnn
Outline Object (category) detection Basic framework of object detection Case study: Viola-Jones, DPM and R-CNN Discussions Semantic scene segmentation Pixel labeling and CRF Design of CRF models Discussions
Semantic scene understanding Semantic recognition tasks: Object detection Semantic segmentation
Pixel labeling problem Problem formulation Assign predefined labels to image elements Multiple label spaces Typical settings Semantic object class Segmentation + recognition Object instance labeling Geometric class labeling etc. (Gould and He, CACM 2014)
Pixel labeling problem Surface layout (Hoiem, Efros & Hebert ICCV05; Gupta et al, ECCV 2010) Sky Non-Planar Porous Vertical Non-Planar Solid Support Planar (Left/Center/Right) Geometry + semantic segmentation (Gould et al, ICCV09) 45
A local solution Multiclass segmentation Image inputs A pixel-wise classifier energy function
Challenges in local approaches Local cues can be ambiguous for scene analysis Objects/regions are correlated in a scene P.5 P.5 Sky Water Sky Water P.5 P.5 Sky Water Sky Water (He et al, CVPR 2004)
Adding contextual information Incorporating spatial context Labels are generally spatially smooth Image inputs Local image cues Contextual information
An example: A simple smooth model Same labeling for neighboring pixels unless an intensity gradient exists Unary only + Pairwise (Shotton et al, ECCV2006)
Conditional Random Field framework Input Output CRF model Unary potential Pairwise potential Higher-order potential Examples: surface normal, object class, depth, etc.
Conditional Random Field framework Energy minimization perspective Label prediction: MAP estimation Global optimization of combinatorial problems Design choices in scene modeling Feature representation Modeling context Integrating top-down information
Image features and unary potentials Manually designed features Stuff class: local features Thing class: + shape cues Global image features Deep network features (Long, Shelhamer and Darrel, Arxiv 2014) (Farabet, et al, PAMI2013)
Image features and unary potentials
Pixel vs superpixel Pixel representation Redundancy Leading to complex models Super-pixel representation Over-segmentation Reduced model size Larger support for feature extraction Fast and regular-shaped e.g. SLIC (Achanta, et al, PAMI2012) implemented in VLFeat. Irregular graph and Inaccurate object boundaries Better to combine both representations.
Modeling context Local context Bottom-up grouping Superpixels to labels Regional context Pairwise interaction between neighboring regions Spatial relations between regions (Galleguillos et al., ICCV07, CVPR08)
Modeling longer-range context Fully-connected CRFs (Krahenbuhl and Koltun, NIPS2012) Higher-order models (Kohli et al., CVPR08; Park & Gould, ECCV12)
Integrating object-specific cues Previous potentials: smoothing Object shape mask as a top-down cue Integrating scene classification, object detection, etc (Yao, et al. CVPR 2012)
Integrating object-specific cues Object shape mask as a top-down cue Integrating object detection with semantic video labeling (Liu, et al. WACV & CVPR 2015) NICTA Copyright 2012 From imagination to impact
Datasets and software Datasets Stanford Background Dataset ; Microsoft Research Cambridge Dataset Pascal VOC; Labelme Dataset; MSCOCO Dataset Software packages Darwin software framework http://drwn.anu.edu.au ALE (Automatic Labeling Environment) http://cms.brookes.ac.uk/staff/philiptorr/ale.htm
Summary Pixel labeling and CRF framework Design choices in semantic segmentation Image feature representation Modeling context (short-range vs. long-range) Integrating top-down information at object and scene level Ongoing research directions Deep network features and CRF framework Nonparametric label transfer Multiple modality in scene labeling Gould and He, Scene Understanding by Labeling Pixels. Communications of the ACM, 2014 NICTA Copyright 2012 From imagination to impact 60