Semantic Recognition: Object Detection and Scene Segmentation

Semantic Recognition: Object Detection and Scene Segmentation Xuming He xuming.he@nicta.com.au Computer Vision Research Group NICTA Robotic Vision Summer School 2015 Acknowledgement: Slides from Fei-Fei Li, R. Fergus, A. Torralba, K. Grauman.

Semantic scene understanding Semantic recognition tasks: Object detection Semantic segmentation

Semantic scene understanding Semantic recognition tasks: Object detection Semantic segmentation Activity recognition Scene layout estimation Scene categorization Geo-localization

Outline Object (category) detection Basic framework of object detection Case study: Viola-Jones, DPM and R-CNN Discussions Semantic scene segmentation Pixel labeling and CRF Design of CRF models Discussions

Kristen Grauman Object detection: why it is so hard? Illumination Object pose Clutter Occlusions Intra-class appearance Viewpoint

Kristen Grauman Object detection: why it is so hard? Realistic scenes are crowded, cluttered, have overlapping objects.

What works reliably today (semi-) Rigid objects Eg., face, car, pedestrian, license plate/ traffic sign

Kristen Grauman Generic object detection: basic framework Build/train object model (training stage) Choose an object representation Learn or fit parameters of object model Generate candidates in new image Score the candidates (inference/prediction stage)

Case study I: Viola-Jones face detector Overview: A seminal approach to real-time object detection (Viola and Jones, IJCV 2004) Object / feature representation: Global template with Haar wavelet features Object model and scoring: Classifying object candidates into face/non-face Use boosted combination of discriminative features as final classifier

Kristen Grauman Viola-Jones detector: Haar wavelets Rectangular filters Feature output is difference between adjacent regions Efficiently computable with integral image: any sum can be computed in constant time. Value at (x,y) is sum of pixels above and to the left of (x,y) Integral image

Viola-Jones detector: features Which subset of these features should we use to determine if a window has a face? Considering all possible filter parameters: position, scale, and type: 180,000+ possible features associated with each 24 x 24 window Use AdaBoost both to select the informative features and to build the classifier Kristen Grauman

Viola-Jones detector: Boosting Defines a classifier using an additive model: Strong classifier Features vector Weight Weak classifier Training: incrementally selecting weaker classifiers During each step, we select a weak learner that does well on examples that were hard for the previous weak learners Hardness is captured by weights attached to training examples

Kristen Grauman Viola-Jones detector: AdaBoost Want to select the single rectangle feature and threshold that best separates positive (faces) and negative (nonfaces) training examples, in terms of weighted error. Resulting weak classifier: Outputs of a possible rectangle feature on faces and non-faces. For next round, reweight the examples according to errors, choose another filter/threshold combo.

Viola-Jones detector: Learned model First two features selected

Kristen Grauman Viola-Jones detector: Candidate generation Sliding window at multiple scales face/non-face Classifier

Kristen Grauman Cascading classifiers for detection Form a cascade with low false negative rates early on Apply less accurate but faster classifiers first to immediately discard windows that clearly appear to be negative

Detecting profile faces? Can we use the same detector?

Viola-Jones detector: Strengths Sliding window detection and global appearance descriptors: Simple detection protocol to implement Good feature choices critical Past successes for certain classes Implementation: OpenCV Kristen Grauman

Viola-Jones detector: Limitations Non-rigid, deformable objects not captured well with representations assuming a fixed 2d structure; or must assume fixed viewpoint Objects with less-regular textures not captured well with holistic appearance-based descriptions Kristen Grauman

Case Study II: Deformable Part-based Models Overview: Felzenszwalb et al. PAMI 10, and winner of the PASCAL detection challenge (2008,2009) Part-based representation: Global (root) template + deformable parts Trained from global bounding-boxes only

DPMs: Part-based representation Objects are decomposed into parts and spatial relations among parts Fischler and Elschlager 73 22

DPMs: Object representation Based on HOG features (1 root + 6 parts) Full model is a mixture of deformable part-based models

DPMs: Object model Object candidates obtained in a multi-scale fashion

DPMs: Object model Score of candidates

DPMs: Learning object models Training data consists of images with labeled bounding boxes Need to learn the model structure, filters and deformation costs: Latent SVM

DPMs: Candidate generation and scoring Detection: Defined by a high-scoring root location Relies on an overall score based on the placement of the parts Efficient dynamic programming/bp

Deformable Part-based Models: Results Car detections

Deformable Part-based Models: Results Person detections

Other part-based representations Tree model Articulated objects 30

Part-based models: Pose estimation Pose estimation in video (Ramanan et al, 2007) Running on street Dancing

DPMs: Strengths Part-based representation : Flexible object model with deformation/pose Discriminatively learned with bounding box annotation Past successes in PASCAL detection challenges. Implementation: http://www.cs.berkeley.edu/~rbg/latent/ Kristen Grauman

DPMs: Limitations Manually designed feature (HOG) Pre-defined object-part structure Trainable classifier is often generic (e.g. SVM) Where next? Better classifiers? Or keep building more features? Object candidates Hand-designed feature extraction Trainable classifier Object Class Kristen Grauman

Case Study III: Regions with CNN features Overview: Girshick et al. CVPR 14, and significant improvement on the PASCAL VOC Learned object representation based on Convolutional Neural Network. Candidate generation by region proposal (objectness)

RCNN: Object representation Learn a feature hierarchy all the way from pixels to classifier Each layer extracts features from the output of previous layer Train all layers jointly Object Candidates Layer 1 Layer 2 Layer 3 Simple Classifier

Layer 1: Top-9 Patches Patches from validation images that give maximal activation of a given feature map

Layer 2: Top-9 Patches

Layer 3: Top-9 Patches

RCNN: Candidate generation Class-generic Object Detection, or Objectness (eg. Alexe, Deselaers, and Ferrari, 2010) Saliency Edge Map Segments RCNN uses Selective Search method (van de Sande, Uijlings, Gevers, Smeulders, 2011)

RCNN: Detection pipeline and results Strength: Significant improvement on public benchmarks (map = 53.7% on PASCAL VOC 2010 vs ~35% with DPM.) Implementation https://github.com/rbgirshick/rcnn

Outline Object (category) detection Basic framework of object detection Case study: Viola-Jones, DPM and R-CNN Discussions Semantic scene segmentation Pixel labeling and CRF Design of CRF models Discussions

Semantic scene understanding Semantic recognition tasks: Object detection Semantic segmentation

Pixel labeling problem Problem formulation Assign predefined labels to image elements Multiple label spaces Typical settings Semantic object class Segmentation + recognition Object instance labeling Geometric class labeling etc. (Gould and He, CACM 2014)

Pixel labeling problem Surface layout (Hoiem, Efros & Hebert ICCV05; Gupta et al, ECCV 2010) Sky Non-Planar Porous Vertical Non-Planar Solid Support Planar (Left/Center/Right) Geometry + semantic segmentation (Gould et al, ICCV09) 45

A local solution Multiclass segmentation Image inputs A pixel-wise classifier energy function

Challenges in local approaches Local cues can be ambiguous for scene analysis Objects/regions are correlated in a scene P.5 P.5 Sky Water Sky Water P.5 P.5 Sky Water Sky Water (He et al, CVPR 2004)

Adding contextual information Incorporating spatial context Labels are generally spatially smooth Image inputs Local image cues Contextual information

An example: A simple smooth model Same labeling for neighboring pixels unless an intensity gradient exists Unary only + Pairwise (Shotton et al, ECCV2006)

Conditional Random Field framework Input Output CRF model Unary potential Pairwise potential Higher-order potential Examples: surface normal, object class, depth, etc.

Conditional Random Field framework Energy minimization perspective Label prediction: MAP estimation Global optimization of combinatorial problems Design choices in scene modeling Feature representation Modeling context Integrating top-down information

Image features and unary potentials Manually designed features Stuff class: local features Thing class: + shape cues Global image features Deep network features (Long, Shelhamer and Darrel, Arxiv 2014) (Farabet, et al, PAMI2013)

Image features and unary potentials

Pixel vs superpixel Pixel representation Redundancy Leading to complex models Super-pixel representation Over-segmentation Reduced model size Larger support for feature extraction Fast and regular-shaped e.g. SLIC (Achanta, et al, PAMI2012) implemented in VLFeat. Irregular graph and Inaccurate object boundaries Better to combine both representations.

Modeling context Local context Bottom-up grouping Superpixels to labels Regional context Pairwise interaction between neighboring regions Spatial relations between regions (Galleguillos et al., ICCV07, CVPR08)

Modeling longer-range context Fully-connected CRFs (Krahenbuhl and Koltun, NIPS2012) Higher-order models (Kohli et al., CVPR08; Park & Gould, ECCV12)

Integrating object-specific cues Previous potentials: smoothing Object shape mask as a top-down cue Integrating scene classification, object detection, etc (Yao, et al. CVPR 2012)

Integrating object-specific cues Object shape mask as a top-down cue Integrating object detection with semantic video labeling (Liu, et al. WACV & CVPR 2015) NICTA Copyright 2012 From imagination to impact

Datasets and software Datasets Stanford Background Dataset ; Microsoft Research Cambridge Dataset Pascal VOC; Labelme Dataset; MSCOCO Dataset Software packages Darwin software framework http://drwn.anu.edu.au ALE (Automatic Labeling Environment) http://cms.brookes.ac.uk/staff/philiptorr/ale.htm

Summary Pixel labeling and CRF framework Design choices in semantic segmentation Image feature representation Modeling context (short-range vs. long-range) Integrating top-down information at object and scene level Ongoing research directions Deep network features and CRF framework Nonparametric label transfer Multiple modality in scene labeling Gould and He, Scene Understanding by Labeling Pixels. Communications of the ACM, 2014 NICTA Copyright 2012 From imagination to impact 60