Sign Language Phoneme Transcription with Rule-based Hand Trajectory Segmentation

Transcription

1 J Sign Process Syst (2010) 59: DOI /s Sign Language Phoneme Transcription with Rule-based Hand Trajectory Segmentation W. W. Kong Surendra Ranganath Received: 21 May 2008 / Revised: 31 July 2008 / Accepted: 23 September 2008 / Published online: 16 October Springer Science + Business Media, LLC. Manufactured in The United States Abstract A common approach to extract phonemes of sign language is to use an unsupervised clustering algorithm to group the sign segments. However, simple clustering algorithms based on distance measures usually do not work well on temporal data and require complex algorithms. In this paper, we present a simple and effective approach to extract phonemes from American sign language sentences. We first apply a rule-based segmentation algorithm to segment the hand motion trajectories of signed sentences. We then extract feature descriptors based on principal component analysis to represent the segments efficiently. The segments are clustered by k-means using these high level features to derive phonemes. 25 different continuously signed sentences from a deaf signer were used to perform the analysis. After phoneme transcription, we trained Hidden Markov Models to recognize the sequence of phonemes in the sentences. Overall, our automatic approach yielded 165 segments, and 58 phonemes were obtained based on these segments. The average number of recognition errors was 18.8 (11.4%). In comparison, completely manual trajectory segmentation and phoneme transcription, involving considerable labor yielded 173 segments, 57 phonemes, and the average number of recognition errors was 33.8 (19.5%). W. W. Kong S. Ranganath (B) Department of Electrical and Computer Engineering, National University of Singapore, 4 Engineering Drive 3, Singapore elesr@nus.edu.sg W. W. Kong g @nus.edu.sg Keywords American sign language (ASL) Phoneme transcription Trajectory segmentation Principal component analysis (PCA) Hidden Markov models (HMM) 1 Introduction One of the key issues in sign language recognition is scalability to large vocabulary. In speech recognition, this problem is handled by using phoneme-based modeling. Back in 1959, Fry [1] and Denes [2] built the first phoneme-based speech recognizer to recognize four vowels and nine consonants, and this approach is in common use for speech recognition today. The same strategy also has potential for sign language recognition. However, the major difficulty in sign language is that unlike speech phonemes which are linguistically well defined and can be used without ambiguity to transcribe speech sentences, there is no consistent phoneme lexicon in sign language. There are approximately 40 phonemes in the English language such as \i\, \f\, \aa\, etc., but phonemes in sign language are usually loosely defined, and depend on the modeling approach and features used in different sign language recognition systems. Also, there is no standard way of defining phonemes in sign language, and different schemes have been used for phoneme transcription. Sign language communication involves manual and non-manual gestures. The latter consist of facial expressions, and head and torso movements, while manual gestures use the hand and arm to convey lexical meaning. Our focus is on manual signs and in this context, sign linguists such as Stokoe [3] and Liddell and Johnson [4] offer some guidelines to model the phonemes by distinguishing the basic components of a

2 212 J Sign Process Syst (2010) 59: sign gesture as consisting of the handshape, hand orientation, location, and movement. Stokoe emphasized the simultaneous organization of these components while Liddell and Johnson s Movement-Hold model emphasized sequential organization. An automatic phoneme transcription procedure is an essential step towards building practical sign language recognition systems that scale well with vocabulary size. As there is no phoneme lexicon or a standard approach to transcribe phonemes for sign language, it is important to devise an efficient strategy for consistent phoneme transcription, including an automatic segmentation procedure to work with continuously signed sentences. Towards this end, we propose an effective automatic procedure to transcribe phonemes from continuous American sign language (ASL) sentences which are signed naturally by a deaf signer rather than from textbook definitions of signs. The lexical meaning of a sign is inferred by recognizing the four parallel components of handshape, location,orientation and movement, and thus, different types of phonemes can be defined for each of the components. Sign language sentences can then be labeled with a sequence of these phonemes. In this paper, we consider transcribing phonemes for the hand movement trajectory only as the other three components, handshape, hand orientation, and position, are simpler to deal with, and corresponding phonemes can be defined easily by using simple clustering algorithms. Currently, there are two main approaches for phoneme transcription viz., 1) transcription based on sign language models defined by sign linguists, and 2) transcription which is dependent on the data collected and features used. In the first approach, the sign components are quantized into limited categories and sign language models such as Stokoe s or Liddell s are used to label the signs. Vogler and Metaxas [5] adopted this approach and defined the phonemes for movement and location, using Liddell s model to recognize 22 ASL signs based on these phonemes. The small vocabulary size makes the transcription rather straightforward. Wang et al. [6] defined a phoneme as the smallest unit that has meaning and can distinguish one sign from another. They performed an extensive study of Chinese sign language (CSL) and explicitly defined about 2400 phonemes for CSL. However, the transcription process and the resulting phonemes are not clearly described. In this strategy, a manual transcription process can be very laborious and time-consuming, and when the vocabulary size is large this approach is unreliable and impractical. Hence, it is important to devise an automatic method to define phonemes for recognizing SL sentences. In the second approach, many works use unsupervised learning to obtain phonemes automatically without using any sign language models. Walter et al. [7] adopted a mixture density-based clustering approach for transcribing phonemes from gesture trajectory segments. Mixture parameters were determined using expectation maximization and minimum description length was used as the criterion to automatically determine the number of clusters. Bauer and Kraiss [8] used k-means clustering to self-organize trajectory segments into fenones. In this approach, the fenones formed usually do not relate to phonetic concepts. Also, temporal segments may not be properly aligned when segments are obtained from continuously signed sentences. This poses problems for clustering algorithms such as k-means which use the Euclidean distance measure. Hence, a complex clustering algorithm is often required to handle the problem. Wang et al. [9] adopted dynamic programming to segment the data streams, and a hybrid of neural networks and k-means was used to cluster the segments. Fang et al. [10] proposed a temporal clustering algorithm to group segments using concatenated handshape, position and orientation features of both hands. The temporal clustering algorithm was based on a modified k-means algorithm proposed by Wilpon and Rabiner [11]. However, these are complex and computationally expensive. In this paper, we propose an automatic procedure to perform phoneme transcription. The automatic procedure saves time and intensive labor required for manual transcription. There are two steps in transcribing phonemes from continuously signed sentences, viz., segmentation of the hand trajectories, followed by phoneme transcription. Several works have considered automatic trajectory segmentation for various purposes. Sagawa and Takeuchi [12] considered minima of hand velocity and large changes in hand motion trajectory angle as candidates for segment boundaries. Wang et al. [13] also used a similar method for trajectory segmentation. Gibet and Marteau [14] identified boundary points where the radius of curvature became small and there was a decrease in velocity. They used the product of velocity and curvature as the measure in their segmentation algorithm to detect boundary points. Rao et al. [15] used the spatio-temporal curvature of motion trajectory to describe a dynamic instant, which is taken to be an important change in motion characteristics such as speed, direction and acceleration. These changes were captured by identifying maxima of spatio-temporal curvature. Walter et al. [7] used a two-step segmentation algorithm for 2-D hand motion. They first detected rest and pause positions by identifying points where the velocity dropped

3 J Sign Process Syst (2010) 59: below a pre-set threshold. After this, they identified discontinuities in orientation to recover strokes (movement and hold) by applying Asada and Brady s Curvature Primal Sketch [16]. For phoneme transcription, we extracted principal component analysis (PCA)-based features and clustered them. Our approach alleviates some of the problems in [7 10] by using PCA features, which allows us to use simple k-means, rather than complex algorithms to cluster the temporal segments. Further, unlike the phonemes obtained in [8], the phonemes yielded by our approach are related to the phonetic concepts which are more meaningful for describing sign language. In related work, Vogler [5] used the first and the second eigenvalues from PCA to differentiate between lines and curves and used them as global features for sign language recognition. However, this was not explored further. We believe that this is a good starting point to facilitate phoneme transcription. Several other works have also used PCA-based features to perform gesture or sign language recognition. Nam and Wohn [17] projected the 3-D hand trajectory to a plane found by PCA, and used a chain encoding scheme for describing the hand movement path for recognition. We used Sagawa and Takeuchi s [12] approach of using minimum velocity and maximum change of directional angle as the basis for segmentation, but found that it oversegmented the hand trajectories. However, the true segmentation points are a subset of this initial segmentation, and we devised rules based on the characteristics of the boundary points obtained to identify the true segmentation points and minimize the false alarms. Next, we extracted feature descriptors from these segments using PCA. Even though the hand trajectory of a complete sentence may be a complex 3-D curve, we can expect that the segments obtained will correspond to lines or planar curves. Hence, PCA of these segments will directly yield the directions of the lines and the planes of the curves. We apply PCA to each segment, and cluster features by k-means to specify the phonemes and give geometric meaning to them. After the phonemes are automatically defined, they are used to label the sentences and train Hidden Markov Models (HMMs) for recognition. Of course, once the HMMs are trained, input sentences are implicitly segmented and recognized. The remainder of this paper is organized as follows. In Section 2, the automatic rule-based segmentation algorithm is presented. Phoneme specification and transcription are described in Section 3. Recognition by HMMs is explained in Section 4. Experimental results are presented and discussed in Section 5.Section6 gives the conclusions. 2 Automatic Rule-based Trajectory Segmentation Automatic trajectory segmentation is performed in two steps. First, we obtain initial segmentation points based on minimal velocity and maximal change of directional angle. Typically, the trajectories are over segmented by this procedure. These over segmented trajectories are then processed by rules to identify the true segmentation points and minimize false alarms. 2.1 Initial Segmentation Temporal segmentation is implemented by detecting points of minimal velocity and maximal change of directional angle. The continuous raw 3-D hand trajectory data is first interpolated and smoothed using splines. Figure 1 shows an example original and splined hand trajectories of a sentence. This step is useful for more accurate and reliable velocity and directional angle computation. Velocity v t is estimated as v t = p t+1 p t (1) where p t = (x t, y t, z t ) is the 3-D position at time t. Figure 1 Original and splined trajectories.

4 214 J Sign Process Syst (2010) 59: Figure 2 Directional angle. Directional angle change, θ, is computed as the angle between two vectors formed by three consecutive 3-D positions as shown in Fig. 2. Thus cos(θ) = u 1 u 2 (2) u 1 u 2 where u 1 = p t p t 1 and u 2 = p t+1 p t. The initial segment boundaries are marked at the points of local velocity minima and maxima of directional angle change. These are processed by rules to minimize the false boundary points. 2.2 Rule Formulation The rules are specified by observation and by using features that characterize the boundary points adequately; these features are summarized in Table 1. minvel and maxang are binary features to indicate a point of minimal velocity and a point of maximal change of directional angle, respectively. normvel is the normalized velocity with respect to the peak and lies in [0 1], and dirang is the absolute directional angle change of a point in [0 180 ]. lftvalley(p vl /H vl ) and rgtvalley(p vr /H vr ) characterize the valley associated with a velocity minimum. Similarly, lftpeak(p al /H al ) and rgtpeak(p ar /H ar ) characterize the peak associated with an angle maximum. Figure 3 illustrates the idea. The rules are summarized in Table 2. Rule 1 checks if a boundary point corresponds to a local minimum Table 1 Features characterizing velocity minima and maxima of directional angle change. Feature minvel maxang Description Point is a local minimum of velocity or not. Point is a local maximum of directional angle change or not. Normalized velocity values. Absolute angle values. normvel dirang lftvalley P vl /H vl (see Fig. 3). rgtvalley P vr /H vr (see Fig. 3). lftpeak P al /H al (see Fig. 3). rgtpeak P ar /H ar (see Fig. 3). Figure 3 Definition of parameters for features described in Table 2. of velocity and maximum change of directional angle, and indicates a strong potential boundary point if both are true. Rules 2, 3 and 4 examine the characteristics of the valley of the minimal velocity and the peak of the maximal change of directional angle. A true detection should be characterized by a deep valley while a shallow valley is possibly a false alarm. A true maximal angle change is characterized by relatively sharp peak. Rule 5 checks the values of the normalized velocity and directional angle change. A point with a high velocity value, and a low directional angle change is likely to be a false alarm, while a point with a low velocity value and a high directional angle change is a potential boundary Table 2 Formulated rules. Rule Description Rule 1 if (minvel = TRUE) and (maxang = TRUE), check Rule 2 elseif (minvel = TRUE) and (maxang = FALSE), check Rule 3 else check Rule 4 Rule 2 if (lftvalley > T 1 or rgtvalley > T 2 ) and (lftpeak > T 3 or rgtpeak > T 4 ), detection = TRUE POINT else detection = FALSE ALARM Rule 3 if (lftvalley > T 1 or rgtvalley > T 2 ), check Rule 5 else detection = FALSE ALARM Rule 4 if (lftpeak > T 3 or rgtpeak > T 4 ), check Rule 5 else detection = FALSE ALARM Rule 5 if (normvel <= T 5 and dirang >= T 6 ) or (dirang >= T 7 and normvel <= T 8 ), detection = TRUE POINT else detection = FALSE ALARM Note: T i,i=1,2,...,8, are thresholds found empirically, and (T 5 < T 8 ), (T 7 > T 6 ). the condition (minvel =FALSE)and (maxang =FALSE) will not occur.

5 J Sign Process Syst (2010) 59: point. However, we relax these conditions, and accept a point with a very low velocity (T 5 ) but moderately high directional angle change (T 6 ) as a true boundary point. On the other hand, if this condition is not met, but the point exhibits a very high directional angle change (T 7 ) and moderately low velocity (T 8 ), we also consider it as a true boundary point. The threshold values (T i )are found empirically as described in Section 5. 3 Phoneme Transcription The segments can have different lengths, locations, orientations and directions of motion in the 3-D signing space. Also, the segments obtained from continuous sentence trajectories can be noisy in the sense that the segmentation algorithm does not always give the exact boundary points and slight deviations in segment boundaries are usually obtained. Movement epenthesis in naturally signed continuous sentences also contributes to this. Movement epenthesis is an extra segment that occurs between signs and contains no information related to the sign meaning. For example, Fig. 4 shows a segment which is essentially a straight line, but has a small extraneous part that arises from movement epenthesis. 3.1 Descriptors for Trajectory Segments There are too many variations in the segments of naturally signed sentences which make it difficult to directly cluster them. Hence, we suggest a better representation that will enable the use of simple clustering algorithms. We characterize a trajectory segment by the plane in which it lies, and its shape, direction of motion, size and position. Curves are described by all the above features, while lines are described only by their direction, size and position. PCA can easily differentiate lines (1-D) from curves (2-D) based on eigenvalues. For a line, the first eigenvalue (when ordered from largest to smallest) greatly exceeds the second, and we use this fact to easily separate lines and curves. Based on normalized eigenvalues E i = λ i 3 j=1 λ j, i = 1, 2, 3 (3) a segment is determined to be a line if E 1 > 0.95, and a 2-D curve, otherwise. Following this determination, a set of features are extracted as described below Plane of the Trajectory Segment The normal to the plane in which the curve lies in 3-D space can be obtained by the vector cross product n i = e i 1 ei 2 (4) where n i is the normal to the plane, and (e i 1, ei 2 )are the first and second eigenvectors of the i th segment. As there are two possible directions for n i in 3-D, we adopt a fixed convention to choose its direction. Since two combinations of ±e i 1 and ±ei 2 correspond to the normal direction chosen, we use one of the pairs as our first and second eigenvectors Direction of Motion We use dominant motion direction to describe direction for lines, and clockwise/anticlockwise sense for circles. As for arcs, both are used. Dominant Direction Though the direction of a line can be simply computed from the starting position to the ending position of the trajectory segment, to reduce sensitivity to noise, the dominant direction is obtained based on the first eigenvector, e i 1 which is along the direction of the largest variance in the data. As both e i 1 and ei 1 can be considered to be valid directions of maximum variance, we resolve this ambiguity as follows: i) Compute a unit vector from the starting point to the ending point of the segment as w i = pi n pi 1 p i n pi 1 (5) Figure 4 Straight line segment with small portion arising from movement epenthesis. where p i 1 and pi n are the starting and ending points of the i th segment, respectively.

6 216 J Sign Process Syst (2010) 59: ii) Compute Shape θ 1 = cos 1 (w i e i 1 ) (6) θ 2 = cos 1 (w i e i 1 ) (7) The dominant direction is chosen to point in the eigenvector direction that is closer to w i by choosing e i 1 if θ 1 is smaller than θ 2,and e i 1, otherwise. Clockwise and Anticlockwise Motion We use the projected 2-D curves to determine whether the motion is clockwise or anticlockwise as follows: i) The first turning point, q, of the curve is located, for example, as in Fig. 5a orb.thecurveisthen rotated so that q lies on the positive horizontal axis. The corresponding rotated trajectories are as shown in Fig. 5c and d, respectively. ii) Clockwise and anticlockwise motion sense can then be found by the following rule: motion= { clockwise if (x, y ) or(x, y ) ast anticlockwise if (x, y ) or(x, y ) ast (8) Both arcs and circles are initially classified as curves, but need to be distinguished based on shape of the segments in the 2-D principal subspace. This is done with Fourier descriptors, which are extracted following the steps below. i) The trajectory segment is resampled to a fixed number of samples, N, equally spaced in arc length. N is chosen to be a power of 2 to facilitate the application of the fast Fourier transform. We used N = 64. ii) The projected 2-D curve coordinates are used to define a complex signal iii) z t = x t + iy t, t = 0, 1,..., N 1 (9) where x and y are the x- and y-coordinates in the projected plane. The motion direction of the projected trajectory segment (clockwise or anticlockwise) affects the ordering of the Fourier descriptors. Hence, to Figure 5 a, b Projected trajectories and c, d corresponding rotated trajectories.

7 J Sign Process Syst (2010) 59: remove this sensitivity, we re-ordered the projected segment from the last sample to the first, if its motion sense was found to be anticlockwise. iv) The DFT of z =[z 0, z 1,..., z N 1 ] is obtained as F =[f 0, f 1,..., f N 1 ]. v) Invariance to translation is obtained by removing the first element (DC component) in F. Rotation invariance is achieved by removing the phase information, i.e. using only the absolute values of f i. Scale normalization is obtained by dividing the Fourier coefficients by f 1. The final Fourier descriptors are given as F = [ f2 f 1, f 3 f 1,..., f ] N 1 f 1 (10) For discriminating only between circles and arcs, the first and last k elements in F were used, and k = 5 was found to be sufficient Size and Position The maximum range in each of the x-, y-, z-coordinates is found, and the largest range is taken to represent the size. Position is described by using only the starting and ending positions of the segments. As the segments obtained are noisy, we represent the start and end positions of the segment by the mean values of the first and last 5% of the segment points. 3.2 Defining Phonemes with K-means There are two alternatives defining phonemes by clustering. We can either concatenate the extracted features and cluster these vectors or cluster each feature separately. We adopt the latter approach as it is simpler and allows simple geometric labeling of the clusters. Figure 6 shows the transcription procedure. The 3-D trajectory segments are first segregated into lines or curves based on the principal eigenvalue found by PCA of each segment. The features used for lines are dominant direction, size and position. All the features in Section 3.1 are used for arcs and circles, with the exception of dominant direction for circle. The individual features are clustered by k-means. Table 3 summarizes possible clusters for each feature and serves as a guideline to determine the number of clusters for each feature. The actual number of clusters is found empirically. The phonemes are then defined by grouping the trajectory segments which have the same geometric feature descriptions. For example, all the trajectory segments which are identified as lines with Dominant Direction = down, Size = small, and Position = mouth, are considered as a cluster (phoneme). Figure 6 Phoneme transcription procedure.

8 218 J Sign Process Syst (2010) 59: Table 3 Possible clusters for the descriptors. Descriptors Clusters Plane xy-,yz-,xz-,±45 -planes Shape circles and arcs Dominant direction up,down,left,right,away,toward Motion sense clockwise,anticlockwise Size large,small Position 12 positions (refer to [3]) 4 Continuous Sentence Recognition with HMMs We train HMMs [18, 19] using continuous sentences labeled by the transcribed phonemes. Raw 3-D trajectory positions are used as the observation sequences to train the model parameters, M = (π, A, B) of left to right HMMs. Each sentence is modeled as a sequence of phonemes. Each phoneme is modeled by 3 5 states, and each state is represented by a single Gaussian with full covariance matrix. We employ Viterbi training and decoding, where the phoneme boundaries are detected implicitly. Initialization of the HMMs is done as follows. As our signed sentences always start at the same spatial position, the initial state s prior probability is set to 1. The transition probabilities are set to be equi-probable except for the invalid transitions, whose probabilities are set to zero. For the case when the phonemes are obtained by automatic transcription, the segments obtained with our segmentation algorithm are used to compute the initial Gaussian parameters. On the other hand, when the phonemes are defined manually, the entire sentence trajectory is divided equally into segments according to the number of phonemes in the sentence. As each segment is represented by 3 5 states, the segment is equally divided into 3 5 sub-segments from which the Gaussian parameters are estimated to initialize each state. which was synchronized with the trackers, was used to record the frontal view of the signer signing the sentences. The video clips obtained were used to facilitate the manual segmentation and phoneme transcription procedures. We conducted several experiments to evaluate the automatic rule-based trajectory segmentation procedure and phoneme transcription process. The evaluations are based on 25 ASL sentences signed 5 10 times by a deaf signer. Various subsets of this data were used for different experiments. 5.1 Automatic Trajectory Segmentation To compare the performance of automatic trajectory segmentation with manual segmentation, and to have labeled training data, segment points were marked by an expert signer. In the segmentation algorithm, the initial segments were obtained from all samples of the 25 sentences by automatically locating the points of minimal velocity and maximal change of directional angle. This yielded 1996 initial segment boundary points. In order to use the rules of Table 2 for processing this initial segmentation, threshold values (T i )are needed for the features. To obtain these thresholds, we picked two training samples each from 13 randomly picked sentences and labeled their initial segment points as true segment points or false alarms in relation to the manually marked points. The features were then extracted from this training data, and the threshold value for each feature was set by examining its distribution. We conducted two experiments to compare results from rule-based segmentation on the initial 1996 segment points. In Experiment 1, we used the rules in Table 2, while in Experiment 2, we used only a subset of the features, viz., minvel, maxang, normvel, dirang andtherulesintable4.table5 summarizes the results obtained by the two experiments. The accuracy of segment point detection in Experiment 1 was 89.7% 5 Experiments and Results A Polhemus FASTRACK electromagnetic motion tracking system was used for capturing the 3-D hand trajectories. It senses the position (x, y, z) coordinates and orientation (azimuth, elevation, and roll) of objects, and consists of one transmitter and four trackers. Two trackers were attached to right and left hands, respectively, and two other trackers were attached to the waist and the head of the signer to provide reference values. The sampling rate for each tracker is 30 samples/s. In this work, we only used the right hand data for the experiments. In addition, a video camera, Table 4 Simplified rules. Rules Descriptions Rule 1 if (minvel = TRUE) and (maxang = TRUE), detection = TRUE POINT else, checkrule2 Rule 2 if (normvel <= T 5 and dirang >= T 6 ) or (dirang >= T 7 and normvel <= T 8 ), detection = TRUE POINT else detection = FALSE ALARM Note: T i, i = 5,6,7,8, are the same thresholds as in Table 2. the condition (minvel =FALSE)and (maxang =FALSE) will not occur.

9 J Sign Process Syst (2010) 59: Table 5 Detection accuracy by rules. Experiment 1 Experiment 2 Total no. of points from initial segmentation Manually labeled boundary points Detected true boundary points 722 (89.7%) 769 (95.5%) False alarms 140 (11.8%) 462 (38.8%) Missed points 83 (10.3%) 36 (4.5%) with 11.8% false alarms; the corresponding results for Experiment 2 were 95.5% and 38.8%, respectively. Though Experiment 2 detects the true segmentation points about 6% better than Experiment 1, it is about 27% worse in its ability to discard false alarms. We note here that we use the terms true boundary point and false alarms only in relation to points manually marked by an expert signer. Manual segmentation involves difficult judgements and guesses, and it would be optimistic to label this as groundtruth. 5.2 Phoneme Transcription For purposes of comparison, we obtained phoneme transcriptions in three different ways, and used them for recognizing signed sentences by HMMs. In Experiment A, trajectory segmentation was done manually, and results were compared between manual and automatic phoneme transcription. For this experiment, an expert signer attempted to define the trajectory segments in one sample of each of the 25 sentences according to sign linguistics, in conjunction with the initial segments obtained at points of velocity minima and/or maxima of directional angle change. A frontal video of the signer, which was closely synchronized with the trackers was also used to facilitate this manual segmentation process. Based on this collective information, the expert signer identified 173 segments in the 25 sentences. As a basis for comparison with the automatic phoneme transcription procedure of Section 3, the expert signer also manually transcribed these 173 trajectory segments into phonemes by visual observation. The same video clips that were used to facilitate manual segmentation were used with an ASL dictionary for manual transcription. The 173 segments obtained represent sign information as well as movement epentheses. As movement epenthesis is a connecting segment without sign meaning, the segments corresponding to it should be represented as simply as possible. In addition, as the movement epentheses are usually signed more loosely, clustering them with the sign phonemes may cause the phoneme clusters to be less representative. Hence, the trajectory segments were first separated out into movement epenthesis segments and sign segments, based on analysis of the sentence structure and the segments obtained. We then described the movement epenthesis by their start and end positions and direction only, and described the sign phonemes by the features used for lines, arcs and circles as shown in Section 3. This manual approach yielded 57 phonemes, which included 33 phonemes corresponding to signs and 24 phonemes corresponding to movement epenthesis. When the same 173 segments were automatically transcribed by the procedure of Section 3, we obtained 56 clusters (phonemes), with 36 sign phonemes and 20 movement epenthesis phonemes. The phoneme clusters obtained by both approaches were checked by plotting the trajectories of cluster members. We observed that the clusters obtained by the automatic procedure were generally more consistent and the cluster members were closer in appearance. On the other hand, some phoneme clusters specified by manual transcription were poorly formed. This can be expected as it is difficult to maintain consistency in manual transcription. Also, relying on the video clips during this process could have led to errors when there were visual occlusions. In the automatic transcription process, on the other hand, the PCA process separates the segments into lines, or planar curves and circles, and each feature of these categories is individually clustered. This simplifies checking the validity of the clusters as the number of clusters obtained for each feature is greatly reduced. Figure 7a and b show one of the clusters (phonemes) obtained by the automatic and manual phoneme transcription processes, respectively. It can be seen that the cluster formed by automatic phoneme labeling is more consistent. Another attractive benefit of automatic phoneme transcription is the significant reduction in time and labor as compared to manual transcription. In Experiment B, the entire phoneme transcription process was automatic using the trajectory segmentation procedure of Section 2 followed by the transcription process of Section 3. For this we made use of the segment points obtained from all samples of all sentences from Experiment 1 to derive the final segment boundary points in each sentence for automatic transcription in a consistent manner. This was done by selecting only the consistently occurring points in all samples of a sentence to form the final segment boundary points. This yielded a total of 165 segments in the 25 sentences to be used for automatic transcription. Of these, 25 points corresponded to the starting location of each sentence, which was assumed

10 220 J Sign Process Syst (2010) 59: Figure 7 Clusters obtained (trajectories are normalized) (a, b). known. Of the remaining, 128 points corresponded to manually labeled points, while 12 were false alarms. The automatic trajectory segmentation and phoneme transcription procedures make no assumptions about the movement epenthesis segments and sign segments, and hence, they are not distinguished from each other. All the trajectory segments obtained are clustered according to the transcription procedure described in Section 3. The automatic transcription process based on these 165 segments yielded 58 phonemes. We then used the transcribed phonemes from Experiments A and B to recognize the sequence of phonemes in the 25 sentences using HMMs. 5.3 Recognition with HMMs We used four samples of each of the 25 sentences to train and one sample to test the HMMs in a full round robin procedure, i.e. in each trial 75 sentences were used for training and 25 for testing. After decoding a test sentence, the sequence of phonemes detected in the sentence was checked for errors. Three kinds of phoneme recognition errors are possible, viz. insertion, deletion and substitution. Table 6 compares the average number of errors in Experiment A for all the test sentences when the training sentences were labeled by phonemes transcribed manually and automatically. On the 173 segments in the 25 sentences, the average numbers of errors were 33.8 and 24.0, for sentences that were labeled manually and by automatically transcribed phonemes, respectively. It can be seen that recognition performance improves when sentences are labeled with automatically transcribed phonemes. This suggests that consistent phoneme transcription is important for better recognition performance, and that manual transcription contributes to inconsistency in grouping the segments. When segments of different kinds are grouped together, the data distribution is inconsistent, and this degrades the recognition performance of the HMMs. We used the same experimental procedure to evaluate the recognition rate for the case when the phoneme transcription process was fully automatic as in Experiment B. Of the 165 segments obtained for the 25 sentences, the average number of errors was 18.8 as shown in Table 7, which is less than that obtained by manual segmentation in Table 6. Thus,interms of recognition performance, phonemes transcribed by the fully automatic procedure gives better results than when manual segmentation and/or manual transcription is involved. This is fortunate since manual processing is tedious and extremely time consuming. Table 6 Average HMM recognition errors on 25 test sentences containing 173 segments (Experiment A). Error type Manual transcription Automatic transcription errors errors Insertion Deletion Substitution Total The trajectory segmentation was manual; the comparison is between manual and automatic phoneme transcription. Table 7 Average HMM recognition errors on 25 test sentences containing 165 segments obtained by the rule-based segmentation algorithm and automatic transcription (Experiment B). Error type Errors in fully automatic phoneme transcription Insertion 4.4 Deletion 8.4 Substitution 6.0 Total 18.8

11 J Sign Process Syst (2010) 59: We note here that accuracy of automatic trajectory segmentation directly affects the recognition accuracy. False alarms lead to too many segments in the trajectory, while missed points cause segments to be merged and thus they may not represent the smallest units, leading to many variations in the segments. In both cases, more clusters (phonemes) will be needed to group the segments consistently. If we use more phonemes to represent a sentence, the recognition accuracy may be better, but at the same time the size and complexity of the HMM decoding network will increase. On the other hand, if we compromise by having fewer clusters (phonemes) with larger cluster variances, sentence representation may be less accurate and thus the recognition accuracy may drop. Hence, we need to segment the trajectories as accurately as possible for phoneme transcription. 6 Conclusions We devised an automatic segmentation procedure to perform temporal segmentation of naturally signed ASL hand trajectories, and used a set of rules to detect true segmentation points and eliminate false alarms from a simple initial segmentation. We also devised an automatic phoneme transcription scheme which relies on effective feature representation. PCA is used to simplify the problem significantly by projecting 3-D hand trajectory segments to 1-D (lines) or 2-D (curves). High level features which describe the geometry of the segments are extracted in the projected space. These feature descriptors have proven to be useful for phoneme transcription in our experiments. The experimental results show that our automatic approach is more accurate than manual trajectory segmentation and phoneme transcription, while providing significant savings on the time consuming human labour required in the manual approach. An automatic approach will be even more important for large vocabulary systems where manual transcription is impractical. Overall, the average number of phoneme recognition errors on 25 sentences containing 165 segments based on our automatic ruled-based segmentation and phoneme transcription procedure was 18.8 (11.4%). The rules and thresholds used in this work are based on observation and found empirically. We are currently working on an approach to automate this procedure by using a Bayesian network. Also, in further work, we will extend the automatic segmentation and phoneme transcription approach to all the components of manual signs, for complete recognition of signed sentences. References 1. Fry, D. B. (1959). Theoretical aspects of mechanical speech recognition. Journal of the British Institution of Radio Engineers, 19(4), Denes, P. (1959). The design and operation of the mechanical speech recognizer at University College London. Journal of the British Institution of Radio Engineers, 19(4), Stokoe, W. C. (1978). Sign language structure: An outline of the visual communication system of the american deaf, studies in linguistics: Occasional papers 8. Silver Spring: Linstok, Liddell, S. K., & Johnson, R. E. (1989). American sign language: The phonological base. Sign Language Studies, 64, Vogler, C., & Metaxas, D. (1999). Towards scalability in ASL recognition: Breaking down sign into phonemes. In Gesture workshop (pp ). Gif-sur-Yvette, France, March. 6. Wang, C., Gao, W., & Shan, S. (2002). An approach based on phonemes to large vocabulary chinese sign language recognition. In Proceedings of the fifth IEEE international conference on automatic face and gesture recognition (pp ). Washinton, DC, USA, May. 7. Walter, M., Psarrou, A., & Gong, S., (2001). Auto clustering for unsupervised learning of atomic gesture components using minimum description length. In Proceedings of the IEEE ICCV workshop on recognition, analysis, and tracking of faces and gestures in real-time systems (pp ). Vancouver, Canada, July. 8. Bauer, B., & Kraiss, K.-F. (2001). Towards an automatic sign language recognition system using subunits. In Gesture workshop (pp ). London, UK, April. 9. Wang, C., et al. (2000). An approach to automatically extracting the basic units in chinese sign language recognition. In Proceedings of the 5th international conference on signal processing (pp ). Beijing, China, August. 10. Fang, G., et al. (2004). A novel approach to automatically extracting basic units from chinese sign language. In Proceedings of the 17th international conference on pattern recognition (pp ). Cambridge, UK, August. 11. Wilpon, J., & Rabiner, L. (1985). A modified k-means clustering algorithm for use in isolated work recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(3), Sagawa, H., & Takeuchi, M. (2000). A method for recognizing a sequence of sign language words represented in a japanese sign language sentence. In Proceedings of the fourth IEEE international conference on automatic face and gesture recognition (pp ). Grenoble, France, March. 13. Wang, T.-S., et al. (2001). Unsupervised analysis of human gestures. In Proceedings of the second IEEE pacific rim conference on multimedia: Advances in multimedia information processing (pp ). Beijing, China, October. 14. Gibet, S., & Marteau, P.-F. (2007). Approximation of curvature and velocity using adaptive sampling representations application to hand gesture analysis. In Gesture workshop. Lisbon, Portugal, May. 15. Rao, C., Yilmaz, A., & Shah, M. (2002) View-invariant representation and recognition of actions. International Journal of Computer Vision, 50(2), Asada, H., & Brady, M. (1984). The curvature primal sketch. Technical Report 758, MIT AI memo.

12 222 J Sign Process Syst (2010) 59: Nam, Y., & Wohn, K. (1996). Recognition of space-time hand-gestures using hidden Markov model. In Proceedings of the ACM symposium on virtual reality software and technology (pp ), Hong Kong, July. 18. Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), Bilmes, J. (2006). What HMMs can do. IEICE Transactions on Information and Systems, E89-D(3), in human gesture understanding applications and also include human-computer interaction, machine learning, and computer vision. W. W. Kong received the B.Eng. (Honors) degree in 2000 and the M.Eng. degree in 2005, both in Electrical Engineering, from the National University of Singapore. During 2001, she was with the R&D Division of Singapore Epson Industrial Pvt. Ltd., where she was working on scanner software development and testing. Currently, she is working toward the Ph.D. degree in the Department of Electrical and Computer Engineering at the National University of Singapore. Her research interests are Surendra Ranganath received the B. Tech. degree in Electrical Engineering from the Indian Institute of Technology (Kanpur), the M.E. degree in Electrical Communication Engineering from the Indian Institute of Science (Bangalore) and the Ph.D degree in Electrical Engineering from the University of California (Davis). From 1982 to 1985, he was with the Applied Research Group at Tektronix, Inc., Beaverton, OR where he was working in the area of digital video processing for enhanced and high definition TV. From 1986 to 1991, he was with the medical imaging group at Philips Laboratories, Briarcliff Manor, NY. In 1991, he joined the Department of Electrical and Computer Engineering at the National University of Singapore, where he is currently an Associate Professor. His research interests are in digital image processing, computer vision, and machine learning with focus on human-computer interaction and video understanding applications.