A Prototype For Eye-Gaze Corrected Video Chat on Graphics Hardware Maarten Dumont, Steven Maesen, Sammy Rogmans and Philippe Bekaert
Introduction Traditional webcam video chat: No eye contact. No extensive context information. Long-term goal: full immersive augmented environment where participants can communicate and cooperate as if they were in the same room.
Overview Related Work System Architecture Preprocessing View Interpolation Joint View/Depth Refinement Movement Analysis Eye Tracking Networking Results Conclusion
Related Work Implemented on commodity CPUs, low framerate [Criminisi et al., 2003] Expensive dedicated hardware [Baker et al., 2002] Unpractical camera setup [Schreer et al., 2001] Optimize parts of the application instead of end-to- end performance: Multi-camera video coding [Chien et al., 2003; Guo et al., 2005] Real-time view synthesis [Yang and Pollefeys, 2003; Geys and Van Gool, 2004; Nozick et al., 2006]
Our Solution Peer-to-peer eye gaze corrected video chat. N input images I 1,, I N are fetched from N cameras C 1,, C N that are closely aligned along the screen. A virtual camera viewpoint is interpolated to restore eye contact. Implemented on the GPU for real-time performance.
System Architecture Four consecutive GPU processing modules: Preprocessing: lens correction and background/foreground segmentation. View Interpolation: interpolate the eye-gaze corrected view. Joint View/Depth Refinement of the interpolated view. Movement Analysis: avoid heavy constraints on the user s movements. Concurrent CPU processing: Eye Tracking Networking Stand-alone processing modules.
Preprocessing Radial distortion for each input image is corrected according to the Brown-Conrady distortion model [Brown, 1966]. with radial distortion radial distortion corrected
Preprocessing Each input image I i is segmented into a binary foreground/background silhouette. The consecutive processing modules will rely on these silhouettes. Background subtraction:
Preprocessing Greenscreening: Very precise silhouettes allow to easily develop the consecutive processing modules and thereby reduce the design space complexity.
View Interpolation Interpolate an image I v (and consistent depth map Z v ) as seen with a virtual camera C v that is positioned behind the screen. The image I v is computed as if the camera C v captured it through a completely transparent screen and is thus eye-gaze corrected.
View Interpolation Plane sweep approach [Yang et al., 2002]. The e3d spaceis discretized into M planes {D 1,, D M } parallel to the image plane of the virtual camera C v.
View Interpolation For each plane D j, every pixel f v of the virtual camera image I v is backprojected on the plane D j and reprojected to the input images I i. For each pixel on each plane Dj, the interpolated t color Ψ and the matching cost Κ are computed and the best color consensus (i.e. minimum cost) is selected. N cameras are used to interpolate the color instead of stereo- interpolation as in [Yang et al., 2002].
View Interpolation However, points on the plane D j that t project outside a foreground silhouette in at least one of the input images are immediately rejected. Levers both: Speed: all further operations are automatically discarded d d by the GPU hardware. Quality: segmentation noise will, with a high probability, not be available in all N cameras.
View Interpolation Result: interpolated eye-gaze corrected image I v and joint depth map Z v. g v j v
Joint View/Depth Refinement Still visually disturbing artifacts. Interpolated image I v and depth map Z v are jointly linked. Errors are even more apparent in the depth map Z v. Detect and restore errors in the depth map Z v, then restore the link between I v and Z v by recoloring Z v.
Joint View/Depth Refinement Two types of errors: Erroneous patches Speckle noise Due to illumination changes, partially occluded areas and natural homogeneous texturing of the human face. patch error noise error
Joint View/Depth Refinement Erroneous patches solution: Naive Gaussian smoothing to remove patches does not work. Photometric outlier detection algorithm that (a) detects and (b) restores erroneous patches in the depth map Z v. Gaussian smoothing outlier detection
Joint View/Depth Refinement Erroneous patch filtering: (a) Centers of patches are detected and morphologically grown from center to border.
Joint View/Depth Refinement Erroneous patch filtering: (b) Patches are filled with reliable depth values from its neighbourhood by reverse morphological grow from border to center.
Joint View/Depth Refinement Speckle noise solution: Large homogeneous texture regions of the human face cause the depth map to contain spatial high frequency speckle noise that can be smoothed with a Gaussian low-pass filter. Gaussian smoothing eliminates the geometrical correctness of the depth map, but enhances perceptual visual quality.
Joint View/Depth Refinement Joint View/Depth Refinement example result.
Movement Analysis Because the position of the user towards the screen is not known, a large depth range has to be scanned. Problems: High probability of mismatches (bad visual quality). Real-time performance is endangered. d
Movement Analysis Solution: limit the effective depth range to narrowly encompass the user s head: Lower probability of mismatches (much better visual quality). Improves real-time performance.
Movement Analysis Problem: a small depth range heavily constraints the user s movements. Solution: dynamically adjust depth range to track the user s y y j p g head and narrowly encompass it at all times.
Movement Analysis How: peak of Gaussian distribution G(μ,σ) fitted on the depth map histogram indicates the position of the user. Dynamically place the depth range around this peak. Three separate cases: Forward: User moves forward and exits the active scanning range. Peak towards the front of the histogram. Stable: User remains stationary. Clear peak in the middle. Backward: User moves backward and exits the active scanning range. Peak towards the back of the histogram.
Movement Analysis Histogram can be efficiently implemented on the GPU. Optimizations: Fewer bins. Fewer samples. Approximated peak location remains virtually it the same = quality vs. complexity tadeoff trade-off.
Eye Tracking The virtual camera C v needs to directly look into the user s eyes at all times to ensure eye contact. t Eye tracking module concurrent on CPU: Face and eye candidates are detected in every input image. Eye candidates are used to triangulate the 3D positions of the eyes. 3D eye position expressed in a coordinate system relative to the screen.
Networking Only the interpolated image I v (instead of N images) and the eye coordinates are send over the network: Minimum network communication allows for real-time speeds over various types of networks.
Results Setup: N = 6 auto-synchronized PGR Grasshopper cameras mounted closely around the screen. Few occlusions No extrapolations Can be integrated into monitor frame (avoid tedious calibration procedures) Still allows real-time processing (as opposed to N=99999)
Results Workload profiling on NVIDIA GeForce 8800 GTX, 800x600 @ 15 Hz cameras. 33 ms processing time for single frame, theoretic speed of 30 fps. Image Download / Readback (54%): Demonstrates t the data locality lit importance, justifies porting all processing to the GPU. Preprocessing (15%) and View Interpolation ti (7%): Computational complexity is linear to N (amount of cameras). Joint View/Depth Refinement (15%) and Movement Analysis (9%): Levers the quality independent of the amount of input images.
Results Although still minor artifacts, results yield high perceptual visual quality. Convincingly seem to be making eye contact.
Conclusion Prototype for eye-gaze correction between two video chat participants: Convenient camera setup. Minimal amount of constraints, large freedom of movement. Achieve real-time performance through GPGPU. High perceptual visual quality. Practical usability. Future work: Improving the movement analysis. Multi-party video conferencing. Interpolate background with correct motion parallax. Create the immersive effect of a virtual window into the world of the other participant!
Demo
Thank you! Questions?