Using geometry and related things

Region labels + Boundaries and objects Stronger geometric constraints from domain knowledge Reasoning on aspects and poses 3D point clouds Qualitative More quantitative more precise Explicit

[D. Hoiem, A. A. Efros, and M. Hebert. Recovering surface layout from an image. IJCV, 75(1):151 172, 2007] What level of representation? How qualitative? What type of training information is available? Assumptions (camera geometry, etc )? Learning from image features to depth + MRF: A. Saxena, et al.. 3-D depth reconstruction from a single still image. IJCV, 76, 2007. Make3D: Learning 3D Scene Structure from a Single Still Image: A. Saxena, et al. TPAMI, 2010. Stage classes: Nedovic, V., Smeulders, A., Redert, A., Geusebroek, J.: Stages as models of scene geometry. In: PAMI (2010)

Next. Can coarse surface labels be used for improving object recognition and scene analysis performance through better geometric reasoning?

Object Detection Surface Estimates Viewpoint Prior Local Car Detector From geometry to objects and back? Distributions versus decisions? Local Ped Detector D. Hoiem, A. Efros, M. Hebert. Putting objects in perspective. IJCV 2009 S.Y. Bao, M. Sun, S.Savarese. Toward Coherent Object Detection And Scene Layout Understanding. CVPR 2010. B. Leibe, N. Cornelis, K. Cornelis, and L. Van Gool. Dynamic 3D Scene Analysis from a Moving Vehicle. CVPR07

Is a more precise representation possible? Boundaries, interposition, relative depth ordering. Still low-level: Can we combine reasoning about semantic labels Region labels + More geom. relations and semantic labels Stronger geometric constraints from domain knowledge Reasoning on aspects and poses 3D point clouds Qualitative More quantitative more precise Explicit

Iterative refinement Iter 1 Iter 2 Toward Final true integration of geometric D. Hoiem, A. A. Efros, and M. Hebert. Closing the loop on scene interpretation. In CVPR, 2008 semantic cues: How to beat the intractable Global nature optimization of the problem? What features/cues can be used? F(D I,L) = p 1(d p ) + pqr 2(d p,d q,d r ) + p 3(d p,d g ) F( I,L,S) = p 1( i ) + i 2( i ) + ij 3( i, j) i T x = 1 j T x = 1 B. Liu, S. Gould, D. Koller. Single image depth estimation from predicted semantic labels. CVPR 2010 S. Gould, R. Fulton, D. Koller. Decomposing a scene into geometric and semantically consistent regions. ICCV 2009

Still mostly bottom-up classification approach No use of domain constraints or constraints governing the physical world Region labels + Boundaries and objects Stronger geometric constraints from domain knowledge Reasoning on aspects and poses 3D point clouds Qualitative More quantitative more precise Explicit

Lines Faces Score How to generate and search through hypotheses (in a tractable manner)? How to evaluate score? How to avoid early decisions? Classifiers f(x,y,w) = w T (x,y) D. Lee, T. Kanade, M. Hebert. Geometric Reasoning for Single Image Structure Recovery. CVPR09. V. Hedau, D. Hoiem, D.Forsyth, Recovering the Spatial Layout of Cluttered Rooms, IEEE International Conference on Computer Vision (ICCV), 2009. How to represent constraints in a more general way? H. Wang, S. Gould, D. Koller. Discriminative learning with latent variables for cluttered indoor scene understanding. ECCV 2010. Learned weight vector Feature vector measuring agreement between lines, faces and labels

Region labels + Boundaries and objects Stronger geometric constraints from domain knowledge + more constraints 3D point clouds Qualitative More quantitative more precise Explicit

Finite volume Spatial exclusion Containment Stability Contact Proximity.

Search through hypothesis space Input image Line segments and Vanishing points Room hypotheses Reject invalid configurations Geometric context Orientation map Object hypotheses Compatibility of image data with geometric configuration Penalty term for incompatible configurations f x, y w T x, y w T y Features from image (surface labels, vanishing points, etc.) Hypothesis: Scene layout+ object hypothesis D. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces. Advances in Neural Information Processing Systems (NIPS), Vol. 24, 2010.

Surface Layout Density Map Bag of Segments Frontal Front-Right Front-Left Left-Right Left-Occluded Right-Occluded Porous Solid Catalogue

above sky above above above Medium Medium Medium Medium Infront High Infront Original Image Pointsupported Pointsupported supported High Ground 3D Parse Graph supported A. Gupta, A. Efros, and M. Hebert. Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics. ECCV 2010. http://www.cs.cmu.edu/~abhinavg/blocksworld

Direct search through hypothesis space Sampling Sample room boundary L. Del Pero, J. Guan, E. Brau, J. Schlecht, K. Barnard. Sampling Bedrooms. CVPR 2011. Diffusion moves: Sample camera Change r = (x,y,z,w,h,l, ) Change c = f Sample object parameters Sample over a block edge only Change o = (x,y,z,w,h,l) Jump moves:

Direct search through hypothesis space Sampling (Constrained) object detection P(O 1,..,O N,L,H I) = P(H)P(L H,I) i P(O i L,H,I) V. Hedau, D. Hoiem, D. Forsyth, Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry, European Conference on Computer Vision (ECCV), 2010.

Direct search through hypothesis space Sampling Object detection Grammars Truth Max Window Wall Balcony Door Roof = L. Simon, O. Teboul, P. Koutsourakis and N. Paragios. Random Exploration of the Procedural Space for Single-View 3D Modeling of Buildings. International Journal of Computer Vision (IJCV), 2010. O. Teboul, I. Kokkinos, P. Koutsourakis, L. Simon and N. Paragios. Shape Grammar Parsing via Reinforcement Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2011.

Direct search through hypothesis space Sampling Object detection Grammars What constraints? How to combine low level classifiers with higher-level reasoning? What are the right tools to search through and score hypotheses? How to represent partial interpretations without early commitment?

Region labels Qualitative + Boundaries and objects Stronger geometric constraints from domain knowledge More quantitative more precise + sparse/partial 3D data + more constraints 3D + point other clouds constraints + (large) prior data Explicit G. Tsai et al. Realtime Indoor Scene Understanding using Bayesian Filtering with Motion Cues. ICCV 2011.

What level of representation? Do we need explicit parse of the input or how far can we go with associations? Hierarchical, region labels, associations,... How to incorporate knowledge/context external to the input image? Task, geometry, contextual info (scene type, location), text,... Should we use global models vs. sequences of simpler models? Is the problem too hard as posed, i.e., intractable? What are the right definitions of actions, activities, behaviors? How to combine temporal (actions) and spatial (scenes, objects) information effectively? What should be evaluated and what stage? bounding boxes, pixelwise labels, 3D models, actions/behaviors, predictions?