Classification of poses and movement phases

Transcription

1 Classification of poses and movement phases Adam Świtoński12,, Henryk Josiński 12, Karol Jedrasiak 1, Andrzej Polański 12, and Konrad Wojciechowski 12 1 Polish-Japanese Institute of Information Technology, Aleja Legionw Bytom, Poland {aswitonski,apolanski, hjosinski, kwojciechowski}@pjwstk.edu.pl 2 Silesian University of Technology, ul. Akademicka Gliwice Poland {adam.switonski, henryk.josinski, andrzej.polanski, konrad.wojciechowski}@polsl.pl Abstract. We have focused on the problem of classification of motion frames representing different poses by supervised machine learning and dimensionality reduction techniques. We have extracted motion frames from global database manually, divided them into six different classes and applied classifiers to automatic pose type detection. We have used statistical Bayes, neural network, random forest and Kernel PCA classifiers with wide range of their parameters. We have tried classification on the original data frames and additional reduced their dimensionality by PCA and Kernel PCA methods. We have obtained satisfactory results rated in best case 1 percent of classifiers efficiency. 1 Introduction Motion databases consist of a very large amount of data. They store hundreds of motions and each motion is a large sequence of frames, usually captured with minimum 1Hz frequency. Motion data usually comes directly from mocap capturing devices [8] or they can be estimated from static 2D images. In [6] such a method based on the Markov chain Monte Carlo is proposed. In practice it is impossible to analyze and search in such kind of databases manually. One of the first tasks in the automatic motion analysis could be pose detection, which means pointing for each frame of the pose type. It could be useful in database searching problem. On the basis of labeled frames we are able to build criteria of the database query, for instance find a motion in which a human is sitting. Pose identification also has medical applications. It could be used in automatic detection of some improper poses, typical in given kinds of diseases. Finally, pose identification could be useful in further automatic motion analysis, as for instance segmentation. The boundary of the motion segments could be placed in the moment of pose changing. Comparing of motion frames is not trivial problem. A frame is described by position of special markers located by the joints. The frame can be represented by the data of direct position of each marker in 3D global coordinate system. But the most often used representation is a kinematic chain, which has format

2 of a tree structure. The root object is placed on the top of the tree and is described by its position in global coordinate system. Child objects are connected to their parents and have information of transformation relative to the parents. Both formats contain exactly the same data, but the advantage of kinematic chain is that identical poses captured in different places have almost the same numerical representation, except for root objects, which is completely different in the representation in 3D global coordinate system [8]. On the basis of the frame format we can build pose similarity measures. The distance could be an aggregation of distances each pair of suitable markers in 3D global coordinate system, but it has disadvantages described above. Thus, in practice it is not used. The authors of the [4] propose 3D cloud point distance measure. First they build cloud points for compared frames and their temporal context. Further, they find global transition to match both clouds and finally calculate the sum of distances corresponding points of matched clouds. In [5] clouds are built based on the downsampled frame representation, which avoids focusing on the pose details. In kinematic chain format, transformations are usually coded with unit quaternions. Thus pose distance can be evaluated as sum of quaternion distances. In [3] frame distance is total weighted sum of quaternion distances because influence of transformations can differ on the pose - the differences depends on the joints. [7] propose binary relational motion features as description of pose. Relational feature is enabled if given joints and bones are in the defined relation: the left knee is behind the right knee, the right ankle is higher than the left knee and so on. We prepare such a set of features and this way describe pose by the binary vector. Pose distance can be calculated as distance metric of vectors descriptors. The basic problem in relational motion features is the proper set of features to distinguish between different kinds of poses. It is very difficult to prepare a single set of features which is applicable to the recognition of every kind of poses. Features are usually dedicated to specialized detections and because of their relatively easy interpretation they are prepared by medical experts who know the meaning of the given joint and bones dependencies. We can generate large features vectors from generic features set proposed by [7], but because of the difficulty in pointing significant features, this leads to long pose description and redundant data. The problem of recognition of a pose type is much more general than evaluating similarity of two different poses. A single pose type can be represented by different, not similar poses. It is so because different phases of each pose type, for example jumping can be divided into starting, flying and landing phases. Secondly it happens so because of different characters of the same pose type - fast run and slow run generates other pose frames. Each pose is represented by the location of tens of markers. Thus, a manual discovery of dependencies in each pose and finding numerical boundaries of given poses representation is almost impossible. Considering that we have decided to use machine learning techniques which are able to explore data, find dependencies and generalize knowledge. We have tested supervised learning with pose

3 distance metric based on the tree like representation. We have also tested linear and nonlinear dimensionality reduction methods to reduce pose description. 2 Pose Database We have prepared poses data from Carnegie Mellon University Motion Capture Database [2]. We have analyzed motion clips, selected pose frames of six different pose types: climbing, jumping, running, sitting, standing and walking. Each pose type contains a wide range of instances, taken from different movements and in different move phases. Finally we have labeled pose frames. Example poses are shown in Fig. 1. Fig. 1. Randomly selected poses from prepared test database. The following rows represent: climbing, jumping, running, sitting, standing and walking pose types Each pose is identified by six root attributes pointing location and orientation of global coordinate system and 56 relative attributes pointing twenty six body parts in a tree like structure. The number of description values of a given body part depends on its degrees of freedom. In preprocessing step we have removed root attributes to avoid learning of classifiers pose type by location and orientation of global coordinate system. The data originates from tens of different motion clips and pose instances are usually located in different places, which can make it easier for classifier to learn by frame location instead of real pose state. At the current stage we have not decided to add pose dynamics attributes such as velocities and accelerations.

4 To reduce computational complexity of machine learning methods we have prepared test set with only 2 randomly selected pose frames. 3 Classification First we have tested supervised learning methods based on the raw data containing all 56 relative attributes. We have used cross validation for to split our test set into the train and test parts and focused on the classifier efficiency, meaning percentage of correctly classified poses of the test sets. We have chose following classifiers: Naive Bayes [1] with normal and kernel based density estimator, knn [1] with number of analyzed nearest neighbors ranging from 1 to 1, Random Forest [1] with various number of features, MultiLayer Perceptron [1] with various numbers of hidden layer neurons and epochs plus several different learning rates. 1 Naive Bayes 1 knn Normal Kernel Density estimation Random Forest k MultiLayer Perceptron Number of features Hidden Layer Neurons Fig. 2. Classification results The efficiencies of all classifiers are over 95 percent and in best case of neural network classifier it comes 1 percent. All results are presented in Fig. 2. For Naive Bayes, which achieved worst efficiencies, there is significant difference for normal and kernel based density estimator. The advantage of kernel based one probably means that the assumption of normal distribution of pose attributes is not so accurate, but the other hand, almost 95 percent efficiency does not deny normal distributions. KNN classifier achieved very good results. There is opposite relation of of number of the analyzed nearest neighbors and efficiency - the more neighbors

5 the worse results. It is probably so because of the nature of our dataset, which in a few untypical feature space regions has weak representation of given pose type frames. The nearest neighbor classifier is best fitted to the train dataset regardless of that representation. In spite of that, efficiency of 1NN classifier is still acceptable and only a little bit worse than 1NN. There is a slightly noticeable influence of number of features of random forest classifier, but the differences are not remarkable and all results are satisfactory. Globally the best results are achieved with neural network classifier. The results depend proportionally on the complexity of the network - the greater complexity, the better results, but even five hidden layer neurons give excellent efficiency over 99 percent. We think it is so because of the above mentioned weak representation, which could be better approximated with more complicated networks. 4 Dimensionality reduction We have applied dimensionality reduction methods to reduce pose descriptions. On the basis of the reduced feature space, we have tested supervised learning methods and compared the results with raw data. We have used and compared linear Principal Components Analysis [1] and nonlinear Kernel Principal Components Analysis [9] and tested nonlinearity of the feature space. We have chosen radial kernel function K(x, y) = e x y 2σ 2 [1], with different sigma values and Eucalidean metric calculated on normalized and raw feature space Number of componets Fig. 3. PCA Variance Cover Variance cover of PCA components shows that there is no short description which stores most of dataset variance. Three components have only 26 percent of global variance, and 9 percent receives just more than twenty componets. In the Fig. 4 we have visualized the first three components of PCA and Kernel PCA in reduced 3D feature space and in Tab. 1 we have presented example confusion matrices for this 3D PCA feature space achieved with Naive Bayes and Random Forest classifiers. For Kernel PCA we have used kernel function

6 PCA Components Kernel PCA Components Fig. 4. Reduced feature space. Poses: blue-sitting, red-standing, green-jumping, yellowclimbing, black-walking, cyan-running. sit sta jum cli wal run sitting standing jumping climbing walking running sit sta jum cli wal run sitting standing jumping climbing walking running Table 1. Example Confusion Matrices. 3D PCA feature space, Naive Bayes and Random Forest classifiers. parameters for which we have obtained best classification results. We can notice general pose classes boundaries, but there is no accurate simple distinction between them. Especially poses standing and walking are mixed together. It happens so because in slow walking there are some phases which look very similar to standing and three values are insufficient to distinguish them. In Kernel PCA some climbing poses are placed far from the rest of instances. There are probably poses with largely leaning forward, which produce large values of distance metric to the rest of posed and has an impact on the kernel function values. Fig. 5 shows aggregated classification results obtained by classifiers for PCA and Kernel PCA reduced feature space. We have chosen results for best parameters of each classifier and in the case of Kernel PCA best pair of sets of parameters classifier and kernel function. There is a similarity to raw feature space. The worst is Naive Bayes and kernel density estimator is a bit better than normal. For knn the best one is 1NN, but the variations are not remarkable. The only difference is that the multilayer perceptron is not better than others and does not need such complexed structure to obtain optimal efficiency. Globally the best is 1NN, but Random Forest and neural network are almost the same. Acceptable results with efficiency over 9 percent need at least three dimensional features space, for 95 percent five dimensional is required, but excellent 99 percent needs only seven dimensional. There are no remarkable deferences of PCA and Kernel PCA, except for one

7 Number of components Naive Bayes (PCA) Naive Bayes (KPCA) knn (PCA) knn (KPCA) Random Forest (PCA) Random Forest (KPCA) MLP (PCA) Fig. 5. PCA and Kernel PCA classification results dimensional feature space which promotes PCA. Kernel PCA is a little bit better for Naive Bayes but a bit worse for knn and Random Forest. Kernel function has great impact on the classification. In most cases distance metric calculated on the normalized feature space is promoted. There is no noticeable general dependency as regards sigma parameter, it differs in given cases. We also built classifier based on the first components of the Kernel PCA trained on the datasets with pose frames of only single pose type. Kernel function depends on the similarity of its arguments, the more similar the greater the value. Thus, sum of kernel function values calculated against the same pose type could give greater value than against other pose type. We have decided to assign pose to the class with maximum value of first Kernel PCA component trained on the given class instances. 1 8 Raw Normalized Sigma Fig. 6. Kernel PCA classifier results for distance metric calculated on the raw and normalized feature space. We have obtained over 97 percent of classifier efficiency for the best case. Regardless of sigma values, the distance metric based on the normalized feature

8 space gives better results. The choice of analyzed Kernel PCA component is disputable and there is possible area to improve the results. 5 Conclusion We have evaluated supervised learning techniques for detection of pose type, based only on the location of body markers. We have prepared test database of 2 pose frames and six different pose types. We have chosen four different classifiers and tested them on wide range of parameters. The results are very promising, we have obtained even 1 percent of classifier efficiency for the case of multilayer percptron with complexed structure. However, it could be an overtrained network, ideally fitted to the train dataset. Although train and test datasets are disjoint, they come from the same database and have some kind of dependency. In fact there is no possibility to prepare dataset with unique cover of possible class regions for the data described with 56 attributes. A single pose type can be represented by a very large number of different frames, and some attributes may have no significance, like the position of the hands in the sitting pose. Dimensionality reduction techniques preserve global information of the pose state. Three-dimensional feature space is sufficient to notice the inaccurate boundaries between pose types, but better results require more dimensions. For tendimensional space the results are rated on the level of 99 percent, which is only slightly worse than for the full 56 attributes. Dimensionality reduction generalizes feature space ; thus, it diminishes focusing on the pose details and strict fitting to train dataset. We think that results are more reliable. Our experiments are only introductory stage to real applications, which is more challenging task, because of the above mentioned train set representation and more pose types. Our final conclusion is that supervised machine learning techniques are able to recognize pose types. Acknowledgment This paper has been supported by the European Regional Development Project: System wraz z bibliotek moduw dla zaawansowanej analizy i interaktywnej syntezy ruchu postaci ludzkiej wspfinansowany przez Uni Europejsk ze rodkw Europejskiego Funduszu Rozwoju Regionalnego w ramach Programu Operacyjnego Innowacyjna Gospodarka Dziaanie 1.3 Poddziaanie References 1. Boser B. E., Guyon I. M., Vapnik V.: A training algorithms for optimal margin classifiers. Fifth Annual Workshop on Computational Learning Theory., Pittsburgh 1992

9 2. Carnegie-Mellon Mocap Database Johnson M. Exploiting Quaternions to Support Expressive Interactive Character Motion. PhD thesis, Massachusetts Institute of Technology, Kovar L., Gleicher M.: Flexible automatic motion blending with registration curves. Proc. 23 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, (23) 5. Kovar L., Gleicher M., Pighin F: Motion graphs. ACM Trans. Graph., (22) 6. Lee M. W., Cohen I.: A Model-Based Approach for Estimating Human 3D Poses in Static Images, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 28, No. 6, Mller M., Rder T.: A Relational Approach to Content-based Analysis of Motion Capture Data. Vol. 36 of Computational Imaging and Vision, ch. 2, , Roder T.:Similarity, Retrieval, and Classification of Motion Capture Data. PhD thesis, Massachusetts Institute of Technology, Schoelkopf B., Smola A., Mueller K.-R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Technical Report No. 44, Max-Planck-Institut fuer biologische Kybernetik, Witten I., Frank E.: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 25