FACTS - A Computer Vision System for 3D Recovery and Semantic Mapping of Human Factors

FACTS - A Computer Vision System for 3D Recovery and Semantic Mapping of Human Factors Lucas Paletta, Katrin Santner, Gerald Fritz, Albert Hofmann, Gerald Lodron, Georg Thallinger, Heinz Mayer

2 Human Attention & Environment Selectively attending to one aspect of the environment Study of joint attention for communication on objects Human factors in the context of environments Study of attention, workload, memory, stress, emotion and decision making Study of wayfinding systems, marketing concepts, usability of user interfaces and products

3 Wearable Eye Tracking Glasses HD camera

4 Eye Tracking Glasses (SMI ETG) wearable, 30 Hz binocular Suite of (Wearable) Sensors Arousal (Affectiva Q) Computational audition Biosensor pulse sensor acceleration galvanic skin response temperature limb motion 6DOF Eye Tracker static, 500 Hz binocular, SMI RED 500

5 Human Factors Analysis, User Modeling, and Simulation Wearable Multimodal Sensing User Interaction & Human Factors Analysis User Model Attention Model Simulation in 3D Model Statistical Analysis 3D model

6 Motivation 3D Gaze Estimation Understanding behavior in task specific ambiente Localise Real Human Gaze in the 3D environment Saliency map on attended infrastructure Vrvis, JR, AIT, 2006

7 Previous Work on 3D Attention Mapping Munn et al. [ETRA, 2008] Introduced monocular eye-tracking and triangulation of 2D gaze positions of subsequent key frames within the scene video of the eyetracking system. Reconstructed only single 3D points without the reference to a complete 3D model achieving angular error of 3.8 (our: 0.6 ) Voßkühler et al. [ECEM 2009], Pirri et al. [CVPR 2011] Requires special, not mass marketed stereo rig that is required in addition to a commercial eye-tracking device. The achieved accuracy indoor is 3.6 cm at 2 m distance to the target (our: 0.9 cm) at the same distance of our proposed workflow. No reference to 3D model

8 Workflow: Recovery of 3D Gaze & Semantics

9 3D Model Generation: RGB-D based Map Building Depth assocation by means of stereo calibration pointcloud Pose trajectory on ground plane

10 3D Model Generation: Methodology Fully automated 3D model generation Grabbing RGB-D images of environment with Kinect Performing depth based visual SLAM using both image and depth information [*] Reconstruction of sparse point cloud consisting of 3D feature points Each feature point is attached to a SIFT descriptor for robust data association during pose estimation Pose estimation using sliding window bundle adjustment while minimizing reprojection error and depth discrepancy using 2D-3D correspondences [*] K. Pirker Katrin, G. Schweighofer, M. Rüther, H. Bischof. GPSlam: Marrying Sparse Geometric and Dense Probabilistic Visual Mapping, Proc. 22nd British Machine Vision Conference (BMVC), 2011.

11 3D Model Generation: Loop Closing Loop closure detection through vocabulary tree search query frame potential loop closing candidates returned by the vocabulary tree Returns a probability for each image in the map/tree Geometr. consistency check delivers candidate frame Low memory and fast computation time

12 3D Model Generation: Dense Model For human attention analysis and realistic surface reconstruction, a dense environment model is constructed afterwards Using probabilistic occupancy grid mapping Every depth image is inserted into the voxel space Using pyramidal approach presented in [*] Real-time performance using GPU implementation Surface reconstruction is handled by standard marching cubes algorithm [**] [*] K. Pirker, G.Schweighofer, M. Rüther, H. Bischof: Fast and Accurate Environment Modeling using Three-Dimensional Occupancy Grids, Proc. 1st IEEE/ICCV Workshop on Consumer Depth Cameras for Computer Vision, 2011. [**] W. E. Lorensen, H. E. Cline: Marching Cubes: A high resolution 3D Surface Construction Algorithm, in Computer Graphics, vol. 21, 1987, pp. 163-169.

13 Result: 3D Model

14 Image based Pose Estimation: Matching Process matching point cloud matching Results in pose for every ETG frame

15 Image based Pose Estimation [**] Estimate the user s pose within previously reconstructed area Sparse three-dimensional point cloud and its SIFT keypoints build the matching model ETG 2D image descriptors are matched against those in the 3D point cloud (global/local) Pose estimation through perspective n-point algorithm [*] RANSAC is used to eliminate matching outlier [*] Lepetit V., Moreno-Noguer F. and Fua P.: EPnP: An Accurate O(n) Solution to the PnP Problem, International Journal of Computer Vision, pp. 155-166, 2009. [**] Santner, K., Paletta, L., Fritz, G., Mayer, H., Visual Recovery of Saliency Maps from Human Attention in 3D Environments, Proc. ICRA 2013.

16 Image based Pose Estimation: Issues? point cloud? 200 out of 2200 poses could not be estimated (~90% coverage)! less image feature points (textureless area)! rapid head movements (motion blur)

17 6 DOF Reconstruction of Human Gaze Given the estimated camera pose intersection of viewing ray with the dense environment model fast interference detection using object oriented bounding box tree [*] [*] Gottschalk S. & Lin M. C. & Manocha D.; OBB-Tree: A Hierarchical Structure for Rapid Interference Detection, Proc. 23rd Annual Conference on Computer Graphics and Interactive Techniques, 1996.

18 Reconstruction of Human Gaze

19 Reconstruction of Human Gaze

20 Precision of Gaze Mapping Angular Error max. 0,6 º Euclidean Error max. 1,1 cm

21 Continuous Estimation of 3D Attention

22 Large 3D Model

23 23 Mapping of Gaze and Arousal in Large Environments 3D attention shop

24 Attention Guided Behaviors: Exploration and Visual Search

25 ROIs for Visual Search Region (=objects) of interest (ROI) detection Annotation in 2D Annotation in 3D

26 Towards Cognition from Attention Mapping Dwell time indicates that gaze / points of regard (PORs) are in series within ROI Dwell times on ROI indicate conscious processing of object information (e.g., ROI #1) region of interest (ROI)

27 related work Context of the FACTS System Eye-tracking videos Computer vision /multisensor analysis applied Driver analysis: Driver distraction analysis Usability engineering: Mobile user behavior analysis User modeling: Eye contact behavior analysis

28 related work: Driver Distraction Analysis Driver with Eye Tracking Glasses Gaze tracked with optical flow analysis Projection onto reference images Collective saliency map onto environment Time analysis

29 Localisation of smartphone in eye-tracking videos Attention on display vs. environment Marker free tracking of the smartphone Saliency mapping on display image capture, rectified Behavior analysis related work: Mobile User Behavior Analysis Smartphone eye-tracking Smartphone saliency mapping

30 related work: Eye Contact - Behavior Analysis Eyben, Schuller, Paletta, et al., submitted to IEEE Pervasive Computing 2013 unweighted average recall area under the ROC subject A B C D mean UAR 70 % 67 % 65 % 68 % 67.4 % ±.02 AUC 77 % 71 % 68 % 78 % 73.2 % ±.05

31 System Components

32 Summary & Conclusions Summary Recovery of 3D gaze: Automated reconstruction of a 3D model Automated mapping of gaze into a 3D model Full recovery of semantic analysis (in the frame of ROIs) System approach various applications Future work Multisensor positioning (accelerometer, vision) Computational attention model using 3D information

Thank you for your attention Dr. Lucas Paletta +43 664 602 876 1769 lucas.paletta@joanneum.at JOANNEUM RESEARCH Forschungsgesellschaft mbh Institute for Information and Communication Technologies www.joanneum.at/digital