Hand Detection and Tracking Using Depth and Color Information Minsun Park, Md. Mehedi Hasan, Jaemyun Kim and Oksam Chae Department of Computer Engineering, Kyung Hee University, 446-701, Seocheon-dong, Yongin-si, Gyeonggi-do, Republic of Korea { romana2ms, mehedi, sense21c, oschae } @ khu.ac.kr Abstract - The detection and tracking of a hand is an emerging research issue now-a-days to control the devices by hand motion. Conventional hand detection methods use color and shape information from a RGB camera. With the recent advent of the depth camera, some researchers show that they can improve the performance of hand detection by combining the color (or intensity) information with the information from the depth camera. In this paper, we propose a novel method for hand detection using both color and depth information from Microsoft s Kinect device. The proposed method extract the candidate hand regions from the depth image and select the best candidate based on the color and shape feature of each candidate regions. Then the contour of the selected candidate is determined in the higher resolution RGB image to improve the positional accuracy. For the tracking of the detected hand, we propose the boundary tracking method based on Generalized Hough Transform (GHT). The experimental results show that proposed method can improve the accuracy of hand motion detection over conventional methods. Keywords: Hand detection, Depth and color, Kinect, Histogram, Tracking. 1 Introduction The advent of relatively low resolution image and depth sensors has spurred research in the field of object tracking and gesture recognition. Making an interface and controlling a device by detecting & tracking different gestures of an object is an emerging research issue now-a-days. Among a variety of motions to interface with devices, the hand is the most convenient body part and has been widely utilized. To do this type of research, Microsoft Kinect is one of the most popular devices, which has sensors that capture both rgb and depth information. Many experiments have done to detect hand based on skin color and do hand detection based on depth information from Kinect. Detecting the location of the hand is the initial step to detect and track hand gestures. It is more challenging than face or body recognition. Because the depth images of Kinect has very low resolution and it is hard to detect and track small objects, which cover small area than background. The most common way to detect hand is to threshold based on the depth information which is also a challenge to choose adaptive threshold to select hand regions. This involves in cropping out those pixels whose z-value (depth) deviates too far from this estimated depth. While this works well for more expensive cameras with high spatial and depth resolution (images with dimensions on the order of 100s to 1000s of pixels, resolution on the order of a few millimeters [1]). But in Kinect images have dimensions on the order of 10s of pixels and depth images provided by the sensors has a nominal accuracy of 3mm. The depth information provided by the sensors reluctantly consents to infrared occlusion and other noisy effects. Skin color-based [2] hand detection method has the advantage of making hand region detection relatively easier because of using color information. For the same reason, to differentiate hands from overlapping hands or objects with similar skin color is too difficult. Also, color is sensitive to illumination variations and noise, which is another drawback. The depth image can overcome the above drawback. We can distinguish object from the background using distance information much easier and is less sensitive to lighting changes and complicated background. However, it is still hard to distinguish different objects at the same distance and extract shape of an object in detail because of low resolution of depth image. Moreover, a lot of noises are added due to the use of infrared camera. It also generates occlusion, and then we need additional work for the compensation. To overcome all the obstacles, we incorporate rgb information in addition to depth information to enhance our estimation of hand locations. In this paper, a novel method of hand detection and tracking is proposed that can detect hand faster and more accurate than conventional methods. Here, we use Kinect to obtain color and depth images at the same time. Our method first calculates a histogram of the depth image according to distance information and analyzes the histogram to find the appropriate candidates values and generates a criteria function to extract hand regions from the image. This proposed method can extract hand regions adaptively without preprocessing used by the conventional methods. 2 Related Works In recent years, research on exactly recognizing hands using color and depth information and compensating the
shortcomings of both color and depth information has been studied simultaneously. In [3], the method of hand detection using combination of stereo and rgb camera is introduced and method based on combination of ToF and rgb camera is illustrated in [4]. In general, the conventional method finds face or body at first, which is easier than detecting hands. Then based on the distance data from camera, we define threshold to detect hands from body part. In this method, the role of color information obtained from a general camera is selecting candidates by using skin color and also compensating the shortcomings of distance information. In [2], the method of hand detection calculating the gray scale histogram from a range image is introduced. They define the noise as threshold in the image to locate humans. For hand detection based on user s face color that is flexible to adjust the range of skin colors is introduced in method [5]. As mentioned earlier, the existing type of camera is stereo and ToF (Time of Flight) camera. In the case of stereo camera, it consists of two general cameras and obtains depth information from this structure, but this camera is not suitable for tracking in real-time system because of installation and calibration problem. In the case of ToF camera, it can obtain distance information directly because of using infrared, but the development has been limited due to the higher price of the camera. In this paper, Microsoft s Kinect which has pioneered the issue as soon as its released is used, Kinect can obtain color and depth information at the same time, and also provides basic library for solving the problem easily; such as calibration. Thus, the device is being welcomed by many developers and researchers who have been interested in this area especially because of low cost and easy installation. In [6], the hand detection is processed using functions provided by OpenNI and in [7], it introduces the method of hand detection based on skin detection, followed by estimating hand position dependent on human s body. In Figure 1 conventional approach for detecting hand is shown. Here finding face or body in the initial approach, which is easier than detecting hands at first because of human face or body has more distinguishable features, but it needs actually much more time and effort for computation on preprocessing, detecting face or body. RGB Information TOF (Distance Information) Human Body/ Face Detection Distance Calculation Skin Detection Threshold Hand Detection Figure 1. Conventional hand detection approach 3 Proposed Method Tracking/ Gesture Recognition In this paper we propose an adaptive hand detection approach by using 3-dimensional information from Kinect and track the hand using GHT-based method. Figure 2 illustrates the system overview of the proposed method. When obtaining both color and depth image from the Kinect at the same time, synchronization and registration between images should be considered because of using color and depth information simultaneously. To summarize proposed algorithm briefly, first we detect candidate hand regions from the histogram of the depth image, and rank each candidate region by using color information to reduce candidate regions. Then obtain the boundary of the hand to get the exact positional accuracy. Actually the depth image includes many unwanted portion of the hand regions because of noise and low resolution troubles, so we use color information to compensate the disadvantage of the characteristics in depth image and improve the rate of accuracy for extracting the contour of hands. Finally, we perform the trace using edge segment based tracking algorithm. Microsoft Kinect Hand Region Selection Using Color and Shape Candidate Region Selection Using Depth Thresholding High Resolution RGB Image Accurate Boundary Extraction Low Resolution Depth Image Hand Contour Search Region Tracing Figure 2. Overview of the proposed system Track Hand Using GHT 3.1 Candidate Hand Region Selection from Depth image To use color and depth information at the same time for detecting and tracking moving hands, initialization is needed to be considered. The first is the synchronization between color and depth image. As a result of measuring the number of frames per second respectively, synchronization is processed automatically by Kinect. Another consideration is the registration between color and depth image. In Kinect the color image consists of 640 480 resolution and depth image resolution is 320 240. So, resolution synchronization is needed to be performed for the two different resolution images. Also the position of rgb camera lens and the position of depth camera lens are not exactly the same. To solve this problem, we use functions which are supported by OpenNI. After the initialization process, hand detection and tracking algorithm begins to perform. A depth image is composed of eight bit gray value. An object which is near to camera is closer to zero and farther away from the camera tends to 255 and the range which cannot be measured is treated as zero. So, the pixel value of depth image increases from 0 to 255 depending on the increasing distance of object from the sensor. In Figure 3 the color information and depth information of the camera is given respectively.
detects hand regions. If x, y and z are the three consecutive candidate points, x y p and y z q and if p q then, (z2 x 2) m (1) (z1 x 2) Figure 3. The high resolution color image and low resolution depth image generated by Kinect From the depth image, we calculate the histogram according to distance and then eliminate unnecessary noise region in histogram before analyzing it. Unnecessary regions which are less than ten are eliminated and histogram smoothing is performed to make histogram analysis more convenient. Before analyzing the histogram and generating the threshold to separate hand regions from background, we assume that the hand is in front of the body and the user is extending his/her hand standing at the fixed position. From the preprocessed histogram, we begin to analyze the histogram in order to find the area which corresponds to hand regions. To do this, we find an appropriate threshold value to separate the hand region from the background. From an accumulated histogram, we figure out the distance from the camera to the hand, body and background, and also size of each object. In general, we realize rapid increase at those regions (hand, body and background). Figure 4 illustrates the accumulated histogram calculated from the depth image and we can recognize the three slopes in the histogram. As we know the area of the hand is very small and it is in front of the body, it will cover less area than body. When we go deeper the regions are growing higher because of the more regions covered by arm and then body part. Finally, the background parts construct a high slope and very high accumulation in the histogram, shown with the third arrow in the following figure. Figure 4. An accumulated and modified histogram from the depth image To generate a criterion function, we choose some candidate points. The points are selected after a certain interval from the histogram. Then calculate the difference of one point with the following candidate point. If we get two increasing differences one after another, we calculate the first order derivative. If the slope is greater than a certain value, then this region is selected as a candidate region to differentiate a hand part from body and background. The following equation is the criteria function or threshold that Where z1 x1 and (x1, x2) and (z1, z2) are the coordinate points of x and z. When, m Thxz the region is selected as a candidate region for threshold. For our experiment we choose y as deciding threshold point. Then unwanted selected noise portions are filtered out by considering the shape or area information. Figure 5 shows the extracted regions (yellow color) selected after computing equation (1) and noise filtering. In the figure we also have seen that unwanted regions are also selected after threshold in the depth image. Figure 5. A depth image where candidate regions are selected in yellow color 3.2 Measure Skin Color from Candidate Regions Since there would be objects surrounding the user with similar distance as a user s hand, unwanted portions are also selected as a hand region that are all called candidate hand regions. We select hand regions from the candidates as following the steps below: I. Accumulate shape information of the regions extracted from depth image. II. Measure the color similarity of the candidate region with the skin from the color image. III. Rank the candidates based on the color and shape similarity. IV. Select the best candidate region as a hand region. To get the shape information, we map the detected regions into a color image and get the extracted shape information from the depth image. Then we measure the color similarity. Since human s hand is skin color we use skin detection algorithm and there are already many famous skin detection algorithms. In this paper, we use Bayesian-based skin color detection algorithm [8]. The algorithm is the one of
the most popular skin detection algorithms that is accurate and fast. It uses skin and non-skin color models to design a skin pixel classifier with an equal error rate of 88%. This is surprisingly good performance given the unconstrained nature of Web images. Visualization studies demonstrate the separation between skin and non-skin color distributions that makes this performance possible. Using this skin classifier, which operates on the color of a single pixel, it constructs a system for detecting images containing naked people. This second classifier is based on simple aggregate properties of the skin pixel classifier output. The naked people detector compares favorably to recent systems by Forsyth et al. [9] and Wang et al. [10], which are based on complex image features. Because it is based on pixel-wise classification, detector is extremely fast. These experiments suggest that skin color can be a more powerful cue for detecting people in unconstrained imagery than was previously suspected. Figure 6 describes a standard likelihood ratio approach to classify skin and equation (2) is the classifier for skin in our method. Phist(rgb skin) PGP(skin rgb) Θ (2) Phist(rgb skin) 3.3 Tracing of Hand Contour We know that a function that can combine a number of entities to form a closed polyline consisting of individual segments is called Contour tracing. Since the depth image has low resolution it is difficult to extract the boundary of a hand distinctly. More accurate hand contour can be extracted from the hand contour of the color image using the low resolution contour determined from the depth image. We overlay the low resolution candidate contour on the color image and define the search area for the contour tracing. Then we trace the hand contour using the skin color information in the search area. Contour tracing [11] incorporates extraction of edge lines from image. In our algorithm, we don t need to search region of interest in the whole image. From the previous steps we define our search area from the depth image. And from that search regions we search the defined region by using the skin color classification criteria. As, we define search regions of the high resolution image from the candidate region defined by the low resolution depth image. It makes our algorithm faster to track hands. Not only that, it also incorporates more confidence to the detection process and gain higher accuracy. Figure 8 shows the hand contour selection process. P(rgb) skin non-skin P( rgb skin) P( rgb skin) rgb (a) Search area defined by depth image (b) Contour selection from color image Figure 6. A skin classifier is derived from standard likelihood ratio approach Where rgb is the color, rgb is skin color and P GP is the Gaussian probability that Phist (rgb skin) and Phist(rgb skin) are the histogram-based probabilities that rgb belongs to the skin and non-skin classes respectively. After measuring color similarity with skin detection, we rank each candidate region with color and shape similarity and choose the highest possibility regions. The result of that process is shown in Figure 7. As it has been seen, hand regions are becoming more specific and accurate by using color information. Figure 7. (a) Mapping depth image to color image and (b) Best candidate region is selected after skin detection and ranking Figure 8. Tracing of Hand Contour 3.4 Tracking Based on Generalized Hough Transform In the step of tracking, we use GHT based moving object tracking algorithm which is suggested in [12]. In order to describe briefly, it tracks the moving objects more robustly by generating a reference pattern and updating the pattern during matching step that minimizes the effect of background pixels. The target matching scheme is based on a Generalized Hough Transform (GHT) [13]. It overcomes edge distortion and can find match from partial information with relatively less amount of time. It uses edge information among the various feature-based methods which has relatively low amount of computation than another features. And it calculates the weight of each edge that is indicating the persistence of the existence on the time axis of edge pixels to overcome the missing edge pixels, which is used for generating a reference pattern and matching target in order to complement obscured. Based on GHT, it can use efficiently missing partial edge information caused by noise. The algorithm is generally used in devices with limited computing power such as mobile devices, digital cameras and smart
phones for tracking continuous subjects in real time. Figure 9 illustrates the overview of GHT based tracking algorithm. hand is closest to camera, proposed method can detect the hand at any situation with the assumption that hand position is in front of the body part. Detected Image Reference Table Initialization Reference Table Update Input Image GHT Based Candidate Selection Confidence Computation Search Region Estimation Figure 9. Overview of GHT based tracking algorithm In this paper, edge segment based tracking is used to track the hand in real time. Figure 10 shows the result of tracking the hand after using Generalized Hough Transform. (a) Hand detection according to different distance Figure 10. Tracking the hand using GHT 4 Experimental Results To setup the experimenting environment we use Microsoft Kinect which is capable of 3D modeling and can generate depth image and color image consequently. OpenNI function is used to synchronize the depth image coordinates with color image coordinates for mapping. We setup the working principle in two ways for experiments. First we verify our hand detection approach for different hand directional approaches and then compute its accuracy. We also compare our result with other popular hand detection algorithms. In the second part we compute hand tracking approach based on segmenting the edges. 4.1 Experiment on Hand Detection Figure 11 shows the result of hand detection. Firstly we stand in front of camera and move back and forth and also rotate our hand at any directions for testing the proposed algorithm. Figure 11 (a) verifies that hand detection process has been done well even though distance is changed by obtaining flexible threshold value. In addition, Figure 11 (b) shows that out method can detect the hand even if human rotate his/her hand in any direction. From the experiments, if (b) Hand detection according to different movement Figure 11. Results of hand detection in different situation To compute the accuracy of our experiment we measure the number of pixels detected in ration with the number of original hand pixels. In our experiment we compute the accuracy for different steps of our algorithm which is shown in Figure 12. Figure 12. Detection rate on normal sequence To compare our algorithm with different approaches the basic measures in accurate detection in general are recall,
precision, and accuracy showed in equation (3) and (4). Recall quantifies what proportion of the correct entities (number of pixels) is detected, while precision quantifies what proportion of the detected entities are correct. Accuracy reflects the temporal correctness of the detected results. Therefore, if we denote by P the pixel positions correctly detected by the algorithm, by P M the number of missed detections (the pixels that should have been detected but were not) and by P F the number of false detections (the positions that should not have been detected but were). Table 1 shows the comparison of our method with Bergh et.al., based on precision and recall. The result shows that our method can accurately determine the hand region than other method. 4.2 Experiment on Hand Tracking Figure 13 shows the result of hand tracking. Based on GHT tracking edge segments are used. Tracing of hands has been done well, which is fast and accurate in general. Since the features of hand are very small, sometimes it is difficult to track the hand, but our tracking process has a good performance in accuracy and speed. Recall P (3) P PM Frame No. 132 Frame No: 136 Precision P (4) P PF Table 1. Comparison with different approaches Approaches Recall Precision RGB + ToF (t s =20) [4] 80.05% 82.36% Frame No: 148 Frame No: 152 RGB + ToF (t s =15) [4] 74.32% 78.76% Proposed Method 81.06% 86.42% In the table we have compared our result with another method for different t s values and it is matter of fact that for different values of the static threshold we will get different accuracy. But our algorithm is adaptive, no need for manually fixing the parameters and also gain higher accuracy than the different method. Frame No: 166 Frame No: 178 Figure 13. Hand Tracking result for different video frames More accurate hand contours are extracted from the hand contour of the color image. We overlay the low resolution candidate contour on the color image and define the search area for the contour tracing. In our algorithm, we restrict our search area from the area information gathered at the time of hand detection process. We define our search area from the depth image and from that search regions we search the defined region by using the skin color classification criteria. As, we define search regions of the high resolution image from the candidate region defined by the low resolution depth image. As we have to concentrate only on the small region our hand tracking becomes faster and more accurate than conventional algorithms. Experimental results describe that our tracking is two to four times faster than conventional hand tracking algorithms.
5 Conclusions We proposed a novel method for extracting hand s features more quickly and accurately from color and depth images at the same time using Kinect for real-time tracking. The proposed method analyzes the histogram of the depth image and finds appropriate threshold value to extract hand region, and use information of color image to improve accuracy rate of hand detection in order to overcome shortcomings of depth image that has low resolution. Even though the proposed method works under restricted environment, we can detect hand directly without searching for face or body unlike the conventional methods. We can use our algorithm as new interface like keyboard and mouse due to its low complexity and speed. Reducing the restricting environment while detecting and tracking hand will be the new research issue of this approach. 6 References [1] A. A. Argyros and M. I. A. Lourakis, Binocular hand tracking and reconstruction based on 2D shape matching, In Proc. International Conference on Pattern Recognition (ICPR), Hong Kong, China, 2006. [2] M. Van den Bergh, F. Bosché, E. Koller-Meier and L. Van Gool, Haarlet-based hand gesture recognition for 3D interaction, Workshop on Applications of Computer Vision (WACV), pp.1-8, December 2009. [3] S. I. Kang, A. Roh and H. Hong, Using depth and skin color for hand gesture classification, 2011 IEEE Internaional Conference on Consumer Electronics (ICCE), pp.155-156, January 2011. [4] M. Van den Bergh, and L. Van Gool,, Combining RGB and ToF Cameras for Real-time 3D Hand Gesture Interaction, 2011 IEEE Workshop on Application of Computer Vision (WACV), pp.66-72, January 2011. [5] R. R. Igorevich, P. Park, D. Min, Y. Park, J. Choi and E. Choi, Hand gesture recognition algorithm based on grayscale histogram of the image, 4th International Conference on Application of Information and Communication Technologies (AICT), pp.1-4, October 2010. [6] M. Van den Bergh, D. Carton, R. De Nijs, N. Mitsou, C. Landsiedel, K. Kuehnlenz, D. Wollherr, L. Van Gool and M. Buss, Real-time 3D Hand Gesture Interaction with a Robot for Understanding Directions from Humans, RO-MAN, 2011 IEEE, pp.357-362, July 31-August 3 2011. [7] M. J. Jones and J. M. Rehg, Statistical Color Models with Application to Skin Detection, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol.1, June 1999. [8] M. J. Jones and J. M. Rehg, Statistical Color Models with Application to Skin Detection, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol.1, June 1999. [9] A. F. David, and M. F. Margaret, Automatic detection of human nudes, International Journal of Computer Vision, 32(1):63 77, August 1999. [10] J. Z. Wang, J. Li, G. Wiederhold, and O. Firschein,, System for screening objectionable images using daubechies wavelets and color histograms, In Proc. of the International Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services, pages 20 30, 1997. [11] G. Simion, V. Gui, and M. Otesteanu, Finger Detection Based on Hand Contour and Color Information, IEEE International Symposium on Applied Computational Intelligence and Informatics, May 19 21, 2011. [12] J. Kim and O. Chae, "Moving object tracking using edge segment matching for mobile devices", 23rd KSPC conference, Vol.23 No.1, pp.381, October 2010. [13] D. H. Ballard, Generalizing the hough transform to detect arbitrary shapes, Pattern Recognition, Vol.13, No.2, p.111-122, 1981. [14] M. Yokoyama, and T. Poggio, A contour-based moving object detection and tracking, IEEE Int l. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp.271 276, China, Oct. 2005. [15] John Canny, "A computational approach to edge detection", IEEE Trans. Pattern Anal. Mach. Intell., Vol.8, No. 6. (November 1986), pp. 679-698. [16] G. Borgefors, Hierarchical chamfer matching: A parametric edge matching algorithm, IEEE Trans. Pattern Anal. Mach. Intell., Vol.10, No.6, pp.849 865, Nov. 1988. [17] Lu Xia, C.C. Chen and J. K. Aggarwal, Human detection using depth information by Kinect, 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.15-22, June 2011. [18] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman and A. Blake, Real-time human pose recognition in parts from single depth images, 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1297-1304, June 2011.