Seminari XXIII ciclo Tracking in flussi video 3D Ing. Tutors: Prof. Tullio Salmon Cinotti Prof. Luigi Di Stefano
The Tracking problem Detection Object model, Track initiation, Track termination, Tracking Object motion model, Model update, Multi-target tracking / Data association Occlusion handling, Combinatorial problem (Exponential complexity with growing number of targets),
2D Tracking State of the art performances in 2D videos Main idea: Tracking-by- Detection Reliable detector used in every frame: Implicit Shape Model (ISM), Histogram-of-Gradient (HOG), etc Tracking reformulated as data association across frames Limitations People pose Occlusions & clutter Illumination changes Output 2D Liebe & al., IJCV 08, Breitentesin & al., ICCV 09
Why not just one image? Analyzing a single view is not possible to unambiguously reconstruct the 3D structure of the scene This is due to effects of the perspective projection that maps points of a 3D space in a 2D space (the image plane of the camera)
3D acquisition devices
3D data and previous work Most exploited approach Camera calibrated wrt the ground plane People detected with background subtraction 2D projection of 3D data Tracking in 2D plan view Limitations Assume static camera Requires a background model Requires calibration Bottom-up approach Beymer & Konolige 2000 Iocchi & Bolles ICIP 2005 Harville & Li, CVPR 04 Yous & al., ECCV WS 2008
My contribution Design an enhanced people detector, exploiting the full potential of 3D data Toward this goal propose a new 3D descriptor of local shape suitable for our task Design a theoretically sound and adaptive way to merge 2D and 3D info for the purpose of people detection (i.e. object category recognition) Plug this in a tracking framework conceived for time critical, online applications No global optimization More emphasis on tracking than on data association Recursive Bayesian Estimation (RBE) methods Enhance RBE via machine learning
3D shape descriptor Our proposal dubbed HON: Histogram of Normals Designed to be Fast Robust to noise and clutter Robust to sampling density variations Definition of a new, robust way to compute an invariant local reference frame Inspired to successful approaches for 2D texture description Lowe, IJCV 04 cos θ
HON: Results on noise and clutter recall 1-precision
HON: Results on sampling density recall 1-precision
My contribution Design an enhanced people detector, exploiting the full potential of 3D data Toward this goal propose a new 3D descriptor of local shape suitable for our task Design a theoretically sound and adaptive way to merge 2D and 3D info for the purpose of people detection (i.e. object category recognition) Plug this in a tracking framework conceived for time critical, online applications No global optimization More emphasis on tracking than on data association Recursive Bayesian Estimation (RBE) methods Enhance RBE via machine learning
Recursive Bayesian Estimation RBE provides a theoretically sound conceptual solution to the problem of state estimation in presence of uncertainty. RBE is widely employed in the context of Visual Tracking and Motion Analysis. In this framework the system is completely specified by a first order Markov model compound of a transition model in state space x (, k = fk xk 1 υ ) p x k k xk a measurement model zk = hk ( xk, ηk) p zk xk an initial state x 0 p ( x0 ) Practical instantiations ( 1 ) ( ) the Kalman filter (Linear & Gaussian scenario, optimal solution) the particle filter (Non-Linear / Non-Gaussian scenario, sub-optimal solution)
Motivations A major limitation of RBE is the requirement to a priori specify the transition model. In most cases this model is unknown and is empirically selected among a restricted set of standard ones or it is learned off-line Both approaches do not allow for changing the transition model trough time, although this would be beneficial and neither the conceptual solution nor the solving algorithms require this.
Proposal In case of a completely observable system, we propose to learn the transition model on-line. In such a case, the transition model is directly related to the dynamics exhibited by the measures. Hence, it is possible to exploit their temporal evolution in order to learn the function f x υ, and, implicitly, the PDF x x. ( ), k z1: k 1 k 1 k p k z Furthermore, we propose to learn the motion model using Support Vector Machine in ε-regression mode (SVR) SVR theoretical properties minimize the risk of overfitting SVR can learn non-linear mapping effectively via the kernel trick SVR can be trained very efficiently exploiting SMO ( ) 1: k 1 k k 1
Support Vector Kalman RBE in the linear & Gaussian scenario becomes: In this case, the PDF we want to estimate becomes ( ) = ( ; μ; Σ ) = ( ; ; ) p x x N x N x Fx Q k z 1: k 1 k k 1 k k k k 1 k Therefore, we use SVRs to estimate the transition matrix F k the associated noise covariance matrix, Q k
Simulations
Mean Shift Tracking
Future work Design an enhanced people detector, exploiting the full potential of 3D data Toward this goal propose a new 3D descriptor of local shape suitable for our task Design a theoretically sound and adaptive way to merge 2D and 3D info for the purpose of people detection (i.e. object category recognition) Plug this in a tracking framework conceived for time critical, online applications No global optimization More emphasis on tracking than on data association Recursive Bayesian Estimation (RBE) methods Enhance RBE via machine learning