PROJECT REPORT CSE 527 : Introduction to Computer Vision Nafees Ahmed : 107403294
Abstract The problem of skeleton reconstruction is an integral part of gesture driven computer interfaces where the input is driven by human body movement. Gesture recognition and classification requires identification of features and for that, in most cases the first step is to reconstruct human skeleton structure from the given input. Depending upon the type of input setup the problem can vary into many different dimensions. In this project, we concentrate on the issue of self-occlusion of human body when we try to capture the motion from a single perspective environment. As a solution to this problem, we propose an application specific data model driven approach for reconstructing occluded skeleton joints and also asses its performance compared to the already existing ones. Computer interfaces driven by human gestures rely on the features driven by skeletal structures reconstructed from human body captures. Depending upon the type of capture, this reconstruction problem is posed in many different ways. Reconstruction can be done from a single image, from multiple images over continuous time or from different perspectives using multiple cameras. How truthfully can we extract and recreate human skeletal form depends upon the amount of information there is about the human structure and also what we seek from those input information. We can choose to reconstruct 2D skeleton from a single/multiple images or 3D skeleton from single/multiple perspectives. Figure 1 shows an example of 3D reconstruction from single image.
Figure 1 3D reconstruction from images When we do 3D skeleton reconstruction from source image/videos, one of the major obstacles is self-occlusion of human body. If we only rely on the images/videos then for reconstruction, the only information we have in our hand is what the camera sees from its predefined position. Due to the fixed position of the camera, it can happen that not all parts of the body will be clearly visible, rather in most cases some articulated movements present images where hand/leg or part of the body is occluded and ambiguously placed. The problem is less prominent when we have multi perspective setup where we take pictures from many different locations, making the system able to have insight into positions which otherwise was occluded. Also, presence of depth image provides more information towards true reconstruction. In recent times, Microsoft has introduced Kinect, which is commercially available and comparatively very low priced solution of monocular depth image perception. Kinect utilizes IR patterns and IR sensors to produce a low spatial resolution depth perception of objects in-front of the sensor in reflective manner. Introduction of this cheap depth sensor has produced many opportunities in both gaming and useful computer interaction tool development driven by human motion. Figure 2 shows a Kinect device and a simple setup.
Figure 2 (Top Left) Kinect Device (Top Right) Simple Kinect Setup (Bottom) Skeleton Tracking using Kinect Since Kinect is a single perspective device, skeleton reconstruction from the depth image faces the standard problem of occlusion. Now, given only a single frame and no other information, any skeleton reconstruction algorithm will try best to fit a skeleton into the frame with some joints identified fully and the rest with some error range. To understand what sort of problem is faced when the reconstruction algorithm only utilizes the image data, we take into count a specific application. We consider tracking of skeleton in the game of cricket where Kinect is used to do reconstruction for skeletons of Batsman and Bowlers. Cricket is considered here because of the posture of batsman and bowlers during the gameplay which make reconstruction from a single perspective device really hard because of partial self-occlusions. Figure 3 shows a setup of a real cricket game. Figure 5,6 show example the possible motions generated by players in real life.
Figure 3 : Snapshot from a Cricket game Figure 4 : One of the many possible bowling motions, looking from a side. Figure 5 : Several examples of batting motion. Identification of shot requires identification of both hands, legs, wrists and also shots can be played all around 360
Now, if we want to track the skeleton of a human using Kinect by placing it in-front of a batsman, in most cases, due to occlusion, the reconstruction will be incomplete and hence getting the right position and movement of the bat will be tough. For example, we take into count a specific batting position as shown in figure 6. Figure 6 Front-foot defense Occluded Limbs Visible Limbs Figure 7 Skelton Reconstruction by OpenNI using Kinect Figure 7 shows the output of the standard skeleton reconstruction based on only the image captured by kinect for the shot. The lines in gray identify the joints with reconstruction confidence level less than 1.0. To account for this problem, we consider the fact that, we are not utilizing all the information we have in our hand. The reconstruction algorithm is only utilizing the input depth image from the sensor and only using that to produce the human skeleton joint positions. But, like this
scenario, if we know beforehand that this reconstruction is purely for the purpose of tracking the motion of a cricket batsman with a specified range of shots, then, the unconcluded joints should provide a very good cue about where the occluded joints should be. To derive such probabilistic values of occluded joints, we need to have a model that is built upon the batting motion captures for many different shots and interpolate the unknown position from that. In this project we try to explore such possibilities and show an example reconstruction of a batting motion. The framework is summarized in the following diagram (Figure 8). The system works in two phases. Model Construction Phase: In this phase, we allow the batsman to play a lot of shots in a controlled environment. We capture the motions and keep the poses that have full confidence level. Using these known values (joints relative positions, orientation, velocities etc.) as features we construct a model. User Tracking Phase: In this phase, we use kinect in the standard setup. At each capture, first we use OpenNI to reconstruct the skeleton. From that skeleton we identify which joints came from occluded field of view using the confidence level. Then, using the unoccluded joint positions as input, we interpolate the most probable value of the occluded joint positions. Merging these two sets of values, we produce the final skeleton.
Separate occluded joints from visible joints Filter to extract full skeletons with complete confidence value Approximate occluded joints using model Compute Model Parameters Merge visible joints with approximated joints MODEL Figure 8 Framework
The performance of the system relies a great deal upon the kind of model construction adopted. Many different approaches can be taken and many features can be considered. In this project, we test a very simple linear system and compare it with some trivial models. The models considered are listed below, Model 0 : The Zero Model In case there is a occluded joint. Replace the values with (0,0,0). Model 1 : Last Known Position Whenever the system captures a joint with full confidence, store it in the model. Anytime the system fails to provide with confident joint position, replace it with the value from model. Hence, it keeps the occluded limb in the last seen location. Model 2 : Last Known Orientation Almost same as the previous one, but here, the system only stores the last valid orientation and drives the occluded joint with that value. Model 3 : Linear Interpolation We build this model with this principle in mind Given the orientation and relative positions of the un-occluded joints, it is possible to interpolate the most probable values of orientations for occluded joints given a proper model So, during model construction phase, we capture joint positions for different batsman postures, compute relative orientations for different joints with respect to its parent body part in the Scene graph of human body structure and store them. During tracking phase, we resolve the occluded joints by first finding the un-occluded joint orientation values. Then search for the nearest two points in the high dimensional space of the model. Then do a linear interpolation between those points for the value of the occluded joint.
Model 0 : The Zero Model Model 1 : Last Known Position
Model 2 : Last Known Orientation Model 3 : Linear Interpolation
The results clearly show that if trained with proper data and driven by flexible enough model, the reconstruction algorithm has the opportunity to provide better skeletal structure given a specific application. In this project, we showed one such model to improve upon present reconstruction method. But there can be many different approaches for model construction and occluded joint interpolations. Each might provide advantage than the other in specific settings and applications. As for future work, we intend to explores such models.