Semantic Recognition: Object Detection and Scene Segmentation

Similar documents
Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Local features and matching. Image classification & object localization

Deformable Part Models with CNN Features

Object Recognition. Selim Aksoy. Bilkent University

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

The Visual Internet of Things System Based on Depth Camera

Convolutional Feature Maps

Scalable Object Detection by Filter Compression with Regularized Sparse Coding

Pedestrian Detection with RCNN

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Semantic Image Segmentation and Web-Supervised Visual Learning

Robust Real-Time Face Detection

Learning Detectors from Large Datasets for Object Retrieval in Video Surveillance

Edge Boxes: Locating Object Proposals from Edges

Finding people in repeated shots of the same scene

Multi-view Face Detection Using Deep Convolutional Neural Networks

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Pedestrian Detection using R-CNN

Localizing 3D cuboids in single-view images

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

How To Generate Object Proposals On A Computer With A Large Image Of A Large Picture

How To Model The Labeling Problem In A Conditional Random Field (Crf) Model

Bert Huang Department of Computer Science Virginia Tech

Latest Advances in Deep Learning. Yao Chou

How To Use A Near Neighbor To A Detector

3D Model based Object Class Detection in An Arbitrary View

High Level Describable Attributes for Predicting Aesthetics and Interestingness

What, Where & How Many? Combining Object Detectors and CRFs

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Fast Semantic Segmentation of 3D Point Clouds using a Dense CRF with Learned Parameters

Segmentation as Selective Search for Object Recognition

Object Categorization using Co-Occurrence, Location and Appearance

Segmentation & Clustering

Decomposing a Scene into Geometric and Semantically Consistent Regions

Fast R-CNN Object detection with Caffe

Practical Tour of Visual tracking. David Fleet and Allan Jepson January, 2006

Vehicle Tracking by Simultaneous Detection and Viewpoint Estimation

CAP 6412 Advanced Computer Vision

Learning Spatial Context: Using Stuff to Find Things

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

SLIC Superpixels. Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk

Improving Spatial Support for Objects via Multiple Segmentations

Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.

Unsupervised Discovery of Mid-Level Discriminative Patches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:

Image Classification for Dogs and Cats

Big Data: Image & Video Analytics

Informed Haar-like Features Improve Pedestrian Detection

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Taking Inverse Graphics Seriously

Randomized Trees for Real-Time Keypoint Recognition

Pictorial Structures Revisited: People Detection and Articulated Pose Estimation

Part-Based Recognition

SEMANTIC CONTEXT AND DEPTH-AWARE OBJECT PROPOSAL GENERATION

Interactive Offline Tracking for Color Objects

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Human Pose Estimation from RGB Input Using Synthetic Training Data

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

Mean-Shift Tracking with Random Sampling

Geometric Context from a Single Image

R-CNN minus R. 1 Introduction. Karel Lenc Department of Engineering Science, University of Oxford, Oxford, UK.

Learning and transferring mid-level image representions using convolutional neural networks

Bringing Semantics Into Focus Using Visual Abstraction

Bottom-up Segmentation for Top-down Detection

CS231M Project Report - Automated Real-Time Face Tracking and Blending

T O B C A T C A S E G E O V I S A T DETECTIE E N B L U R R I N G V A N P E R S O N E N IN P A N O R A MISCHE BEELDEN

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

Multi-View Object Class Detection with a 3D Geometric Model

Introduction. Selim Aksoy. Bilkent University

Image and Video Understanding

Task-driven Progressive Part Localization for Fine-grained Recognition

Color Segmentation Based Depth Image Filtering

LabelMe: Online Image Annotation and Applications

Jiří Matas. Hough Transform

A Study on SURF Algorithm and Real-Time Tracking Objects Using Optical Flow

A Convolutional Neural Network Cascade for Face Detection

Multi-fold MIL Training for Weakly Supervised Object Localization

Behavior Analysis in Crowded Environments. XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011

VEHICLE LOCALISATION AND CLASSIFICATION IN URBAN CCTV STREAMS

Do Convnets Learn Correspondence?

LIBSVX and Video Segmentation Evaluation

Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models

Classroom Monitoring System by Wired Webcams and Attendance Management System

Automatic Maritime Surveillance with Visual Target Detection

The use of computer vision technologies to augment human monitoring of secure computing facilities

A Learning Based Method for Super-Resolution of Low Resolution Images

Object class recognition using unsupervised scale-invariant learning

Who are you? Learning person specific classifiers from video

Feature Tracking and Optical Flow

Density-aware person detection and tracking in crowds

Transcription:

Semantic Recognition: Object Detection and Scene Segmentation Xuming He xuming.he@nicta.com.au Computer Vision Research Group NICTA Robotic Vision Summer School 2015 Acknowledgement: Slides from Fei-Fei Li, R. Fergus, A. Torralba, K. Grauman.

Semantic scene understanding Semantic recognition tasks: Object detection Semantic segmentation

Semantic scene understanding Semantic recognition tasks: Object detection Semantic segmentation Activity recognition Scene layout estimation Scene categorization Geo-localization

Outline Object (category) detection Basic framework of object detection Case study: Viola-Jones, DPM and R-CNN Discussions Semantic scene segmentation Pixel labeling and CRF Design of CRF models Discussions

Kristen Grauman Object detection: why it is so hard? Illumination Object pose Clutter Occlusions Intra-class appearance Viewpoint

Kristen Grauman Object detection: why it is so hard? Realistic scenes are crowded, cluttered, have overlapping objects.

What works reliably today (semi-) Rigid objects Eg., face, car, pedestrian, license plate/ traffic sign

Kristen Grauman Generic object detection: basic framework Build/train object model (training stage) Choose an object representation Learn or fit parameters of object model Generate candidates in new image Score the candidates (inference/prediction stage)

Case study I: Viola-Jones face detector Overview: A seminal approach to real-time object detection (Viola and Jones, IJCV 2004) Object / feature representation: Global template with Haar wavelet features Object model and scoring: Classifying object candidates into face/non-face Use boosted combination of discriminative features as final classifier

Kristen Grauman Viola-Jones detector: Haar wavelets Rectangular filters Feature output is difference between adjacent regions Efficiently computable with integral image: any sum can be computed in constant time. Value at (x,y) is sum of pixels above and to the left of (x,y) Integral image

Viola-Jones detector: features Which subset of these features should we use to determine if a window has a face? Considering all possible filter parameters: position, scale, and type: 180,000+ possible features associated with each 24 x 24 window Use AdaBoost both to select the informative features and to build the classifier Kristen Grauman

Viola-Jones detector: Boosting Defines a classifier using an additive model: Strong classifier Features vector Weight Weak classifier Training: incrementally selecting weaker classifiers During each step, we select a weak learner that does well on examples that were hard for the previous weak learners Hardness is captured by weights attached to training examples

Kristen Grauman Viola-Jones detector: AdaBoost Want to select the single rectangle feature and threshold that best separates positive (faces) and negative (nonfaces) training examples, in terms of weighted error. Resulting weak classifier: Outputs of a possible rectangle feature on faces and non-faces. For next round, reweight the examples according to errors, choose another filter/threshold combo.

Viola-Jones detector: Learned model First two features selected

Kristen Grauman Viola-Jones detector: Candidate generation Sliding window at multiple scales face/non-face Classifier

Kristen Grauman Cascading classifiers for detection Form a cascade with low false negative rates early on Apply less accurate but faster classifiers first to immediately discard windows that clearly appear to be negative

Detecting profile faces? Can we use the same detector?

Viola-Jones detector: Strengths Sliding window detection and global appearance descriptors: Simple detection protocol to implement Good feature choices critical Past successes for certain classes Implementation: OpenCV Kristen Grauman

Viola-Jones detector: Limitations Non-rigid, deformable objects not captured well with representations assuming a fixed 2d structure; or must assume fixed viewpoint Objects with less-regular textures not captured well with holistic appearance-based descriptions Kristen Grauman

Case Study II: Deformable Part-based Models Overview: Felzenszwalb et al. PAMI 10, and winner of the PASCAL detection challenge (2008,2009) Part-based representation: Global (root) template + deformable parts Trained from global bounding-boxes only

DPMs: Part-based representation Objects are decomposed into parts and spatial relations among parts Fischler and Elschlager 73 22

DPMs: Object representation Based on HOG features (1 root + 6 parts) Full model is a mixture of deformable part-based models

DPMs: Object model Object candidates obtained in a multi-scale fashion

DPMs: Object model Score of candidates

DPMs: Learning object models Training data consists of images with labeled bounding boxes Need to learn the model structure, filters and deformation costs: Latent SVM

DPMs: Candidate generation and scoring Detection: Defined by a high-scoring root location Relies on an overall score based on the placement of the parts Efficient dynamic programming/bp

Deformable Part-based Models: Results Car detections

Deformable Part-based Models: Results Person detections

Other part-based representations Tree model Articulated objects 30

Part-based models: Pose estimation Pose estimation in video (Ramanan et al, 2007) Running on street Dancing

DPMs: Strengths Part-based representation : Flexible object model with deformation/pose Discriminatively learned with bounding box annotation Past successes in PASCAL detection challenges. Implementation: http://www.cs.berkeley.edu/~rbg/latent/ Kristen Grauman

DPMs: Limitations Manually designed feature (HOG) Pre-defined object-part structure Trainable classifier is often generic (e.g. SVM) Where next? Better classifiers? Or keep building more features? Object candidates Hand-designed feature extraction Trainable classifier Object Class Kristen Grauman

Case Study III: Regions with CNN features Overview: Girshick et al. CVPR 14, and significant improvement on the PASCAL VOC Learned object representation based on Convolutional Neural Network. Candidate generation by region proposal (objectness)

RCNN: Object representation Learn a feature hierarchy all the way from pixels to classifier Each layer extracts features from the output of previous layer Train all layers jointly Object Candidates Layer 1 Layer 2 Layer 3 Simple Classifier

Layer 1: Top-9 Patches Patches from validation images that give maximal activation of a given feature map

Layer 2: Top-9 Patches

Layer 3: Top-9 Patches

RCNN: Candidate generation Class-generic Object Detection, or Objectness (eg. Alexe, Deselaers, and Ferrari, 2010) Saliency Edge Map Segments RCNN uses Selective Search method (van de Sande, Uijlings, Gevers, Smeulders, 2011)

RCNN: Detection pipeline and results Strength: Significant improvement on public benchmarks (map = 53.7% on PASCAL VOC 2010 vs ~35% with DPM.) Implementation https://github.com/rbgirshick/rcnn

Outline Object (category) detection Basic framework of object detection Case study: Viola-Jones, DPM and R-CNN Discussions Semantic scene segmentation Pixel labeling and CRF Design of CRF models Discussions

Semantic scene understanding Semantic recognition tasks: Object detection Semantic segmentation

Pixel labeling problem Problem formulation Assign predefined labels to image elements Multiple label spaces Typical settings Semantic object class Segmentation + recognition Object instance labeling Geometric class labeling etc. (Gould and He, CACM 2014)

Pixel labeling problem Surface layout (Hoiem, Efros & Hebert ICCV05; Gupta et al, ECCV 2010) Sky Non-Planar Porous Vertical Non-Planar Solid Support Planar (Left/Center/Right) Geometry + semantic segmentation (Gould et al, ICCV09) 45

A local solution Multiclass segmentation Image inputs A pixel-wise classifier energy function

Challenges in local approaches Local cues can be ambiguous for scene analysis Objects/regions are correlated in a scene P.5 P.5 Sky Water Sky Water P.5 P.5 Sky Water Sky Water (He et al, CVPR 2004)

Adding contextual information Incorporating spatial context Labels are generally spatially smooth Image inputs Local image cues Contextual information

An example: A simple smooth model Same labeling for neighboring pixels unless an intensity gradient exists Unary only + Pairwise (Shotton et al, ECCV2006)

Conditional Random Field framework Input Output CRF model Unary potential Pairwise potential Higher-order potential Examples: surface normal, object class, depth, etc.

Conditional Random Field framework Energy minimization perspective Label prediction: MAP estimation Global optimization of combinatorial problems Design choices in scene modeling Feature representation Modeling context Integrating top-down information

Image features and unary potentials Manually designed features Stuff class: local features Thing class: + shape cues Global image features Deep network features (Long, Shelhamer and Darrel, Arxiv 2014) (Farabet, et al, PAMI2013)

Image features and unary potentials

Pixel vs superpixel Pixel representation Redundancy Leading to complex models Super-pixel representation Over-segmentation Reduced model size Larger support for feature extraction Fast and regular-shaped e.g. SLIC (Achanta, et al, PAMI2012) implemented in VLFeat. Irregular graph and Inaccurate object boundaries Better to combine both representations.

Modeling context Local context Bottom-up grouping Superpixels to labels Regional context Pairwise interaction between neighboring regions Spatial relations between regions (Galleguillos et al., ICCV07, CVPR08)

Modeling longer-range context Fully-connected CRFs (Krahenbuhl and Koltun, NIPS2012) Higher-order models (Kohli et al., CVPR08; Park & Gould, ECCV12)

Integrating object-specific cues Previous potentials: smoothing Object shape mask as a top-down cue Integrating scene classification, object detection, etc (Yao, et al. CVPR 2012)

Integrating object-specific cues Object shape mask as a top-down cue Integrating object detection with semantic video labeling (Liu, et al. WACV & CVPR 2015) NICTA Copyright 2012 From imagination to impact

Datasets and software Datasets Stanford Background Dataset ; Microsoft Research Cambridge Dataset Pascal VOC; Labelme Dataset; MSCOCO Dataset Software packages Darwin software framework http://drwn.anu.edu.au ALE (Automatic Labeling Environment) http://cms.brookes.ac.uk/staff/philiptorr/ale.htm

Summary Pixel labeling and CRF framework Design choices in semantic segmentation Image feature representation Modeling context (short-range vs. long-range) Integrating top-down information at object and scene level Ongoing research directions Deep network features and CRF framework Nonparametric label transfer Multiple modality in scene labeling Gould and He, Scene Understanding by Labeling Pixels. Communications of the ACM, 2014 NICTA Copyright 2012 From imagination to impact 60