Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if an end effector trajectory will successfully accomplish a particular manipulation primitive. In this project, we define this classification problem and introduce modified modified shape context features to adapt to this situation. We generate a dataset of supervised examples for a rope tying problem and explore classification performance with these features. We outperform a baseline feature representation that adapts state-of-the-art object recognition and pose estimation features. 1 Introduction Robots often have to perform a given task in a variety of scenarios, making it impractical or even impossible to individually specify trajectories a priori. Learning from demonstrations is an approach that enables robots to generalize from demonstrations of manipulation tasks to new situations. These manipulation tasks can be modeled as sequences of manipulation primitives, where each manipulation primitive is an atomic step of the task. For example, a manipulation primitive when tying a knot would be moving one end of the rope over the other. We are interested in making learning from demonstrations more robust by predicting whether a given trajectory in a scenario will accomplish a particular motion primitive successfully. Our approach to prediction centers around classifying a trajectory-scenario pair into predefined motion primitive classes. A major challenge for building such a classifier is determining a relevant feature representation of trajectory-scenario pairs. We experiment with two different types of feature representations. Our baseline representation uses Viewpoint Feature Histograms (VFH) to capture a global representation of the initial scene, and uses relative change in end effector positions and rotations as a position-invariant representation of the trajectory. We also propose using features in the frame of the end effector, in particular shape context features, because they prioritize the area of the scene nearest the end effector. Since grasping and other motions with the end effector are essential characteristics of manipulation primitives, it is reasonable to place more emphasis on similarity for areas close to the end effector. 1.1 Problem Statement We assume that there exists a library of predefined manipulation primitives. Given the point cloud of an initial scene and an end effector trajectory in the form of a position and rotation for a sequence
of time steps, we would like to predict which manipulation primitive, if any, this trajectory will perform successfully in this scene. 2 Feature Design In this section, we begin with a discussion of characteristics of good features for this problem. We then describe the features we propose and motivate how they match up with our characteristics. We follow that with a description of baseline features. In designing features for this task, there is significant overlap with the task of object recognition and pose estimation. Most of this overlap stems from the fact that we are interested in deformable object manipulation. Many manipulation primitives will only be successful for a certain set of configurations of the object. For example, the final step of tying an overhand knot is only possible if the rope is already in a particular state with overlapping loops. When incorporating the trajectory, we need to pay attention to the relative pose of the gripper with respect to the object. Good features for this task should be invariant to Euclidean transformation of both the object and the trajectory. Furthermore, we would like features to represent the object at different levels of granularity depending on the manipulation trajectory. While the success of a manipulation is likely related to the particular configuration of parts of the object that are close to the gripper, changes in the object configuration that are farther away have less impact on the manipulation, so we would like to be robust to these changes. 2.1 Trajectory Shape Context Features The features we developed for this task are variations of 3D shape context features [1]. These are local descriptors that provide a highly detailed description of an object s local geometry and a coarse global representation of the overall shape of the object. This is done using a local polar coordinate frame around the basis point and computing a histogram of the distances and angles to the rest of the points on the object. One way to apply this for manipulation would be to use shape context features to describe the object and then to incorporate trajectories on top of that. We go a step further. In selecting basis points for computing shape context features, we use the intermediate points along the trajectory. This enables us to capture the overall shape of the object we are manipulating as well as the relative position with respect to the initial object configuration. Another positive aspect of these features is that they are more sensitive to changes in parts of the object that are close to our trajectory. This lets us be robust to changes that are far from the actual manipulation but sensitive to local geometry. In practice, computing these features is done in a few steps. Our trajectories are represented as sequences of homography matrices that correspond to a rotation and translation of the gripper with respect to a fixed point on our robot. For a particular gipper pose, with homography matrix H, we first rotate and translate the point cloud to be in the reference frame of the gripper at that
time. This corresponds to multiplying the point cloud by H 1. Then we convert our points to polar coordinates and count weighted occurrences of points. The weight we use is the same as in the original 3D shape context: w p = 1 d p 3 v b, where d p is the local point density around p and v b is the volume of the bin we are placing p into. In doing this binning, we are able to take advantage of one simplification: we have a well specified coordinate frame and so do not need to be, in fact do not want to be, invariant to rotations about the azimuth direction. 2.2 Baseline Features For our baseline, we combined Viewpoint Feature Histograms (VFH) with a reasonable representation of the trajectory [2]. These features make use of normals for a point cloud. Thus, our first step in applying these features is to estimate the direction of normals for each of our points. Because we downsample the trajectory, we hope that this estimation process will be robust to noise in our point cloud. We make use of the normal estimation utilities included in Point Cloud Library. This takes a radius as an argument and uses PCA on a covariance matrix to perform this estimation [3]. Once we have normals, VFH builds a representation of the view of the object by looking at the relative pitch, roll, and yaw of surface normals to a vector from the camera center to the centroid of the object. A global shape descriptor is then built that looks at the relative pitch, roll, yaw of the vector from the center of a patch to the camera center and each surface normal in the patch. These features have been very successful in object recognition and pose estimation for robotics applications. In order to include information about the trajectory, we use the relative change in end effector position for several different time steps. These differences are computed in the reference frame of the initial end effector position to be robust to rotations. Finally, we include the distance from this initial position to the centroid of the object. This gives us a representation that is invariant to joint rotations of the object and trajectory, although it does not take into account the relation of the shape to the end effector. 3 Results Our experiments deal with the manipulation task of tying a knot in a rope. We were able to simulate end effector trajectories on arbitrary scenarios, and used this to construct a labelled dataset containing 297 simulated steps of knot tying. 203 of the 297 examples fell under one of the six motion primitives in Figure 1, and the remaining 94 were failed attempts at one of these six motion primitives. We defined these six manipulation primitives manually, by observing each simulated trajectory and labeling it as a new manipulation primitive if it did not resemble any of the existing ones. Each item in our dataset consists of a point cloud of the initial scene, as well as the end effector position and rotation for each time step in the trajectory, and its corresponding manipulation primitive label. We use an SVM, and the LIBSVM library, as our classifier [4]. We ran experiments in several settings to test different hyper-parameter settings and holdout sizes. We use both 90-10 cross validation and leave-one-out cross validation (LOOCV) (Figure 1). For 90-10 cross validation, we
train the SVM on a random 90% of the data, test it on the remaining 10%, and average the results across 20 runs. Figure 1 shows shows the error rate for both sets of features as well as error the results from guessing the most likely class. Varying the hyper-parameter, C, significantly impacts the error rate; a choice of 0.01 for C seems to work best for both 90-10% cross validation and LOOCV. 4 Limitations and Future Work When standardizing the length of trajectories, sub-selecting at regular intervals may miss key characteristics of the trajectory, especially those that cause the trajectory to fail at accomplishing its desired manipulation primitive. For example, sub-selecting may skip over small jerks in the trajectory or the point when the end effector grasps the object. An alternative approach would be to use max or average pooling to interpolate between the sub-sampled trajectory points. Another issue with our approach stems from the fact that our feature descriptors are very high dimensional. In order to have good resolution for the shape context features, we need to use over a thousand bins per time step. Given that our data set has 300 examples, there is a serious danger of overfitting. For our experiments, we mitigated this by varying hyper-parameters and using a linear classifier. However, with more data we would be able to relax this and potentially loosen our regularization. Generating a large data set of labelled examples in this scenario is a difficult challenge. Unlike standard classification, we cannot use crowd-sourced solutions, as actually classifying these scenarios (as a human), requires knowledge of the task being performed and a somewhat sophisticated and task-specific concept of failure. As such, we think this is a good candidate for semi-supervised learning. References [1] Andrea Frome, Daniel Huber, Ravi Kolluri, Thomas Bülow, and Jitendra Malik. Recognizing objects in range data using regional point descriptors. In Computer Vision-ECCV 2004, pages 224 237. Springer, 2004. [2] Radu Bogdan Rusu, Gary Bradski, Romain Thibaux, and John Hsu. Fast 3d recognition and pose using the viewpoint feature histogram. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 2155 2162. IEEE, 2010. [3] Radu Bogdan Rusu. Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments. PhD thesis, Computer Science department, Technische Universitaet Muenchen, Germany, October 2009. [4] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
5 Tables and Figures Trial Max Category VFH Trajectory Shape Context 90-10% CV; C = 0.1 68.4% 24.6% 24.6% 90-10% CV; C = 0.01 68.4% 18.6% 17.5% 90-10% CV; C = 0.001 68.4% 20.8% 20% LOOCV; C = 1 68.4% 25.6% 25.5% LOOCV; C = 0.01 68.4% 18.8% 18.8% Table 1: Error rates for various classification approaches. VFH and Trajectory Shape Context use an SVM and different trials correspond to differing experimental setups. Max category shows the error rate that results from always guessing the most common category. Both features show a significant improvement over max category, with the best results overall achieved by trajectory shape context. The varying of the hyper-parameters indicates the potential for over-fitting, which is an issue with the limited amount of data we have. Figure 1: Manipulation primitives in rope tying. We manually identified six unique manipulation primitives in our robot s approach to tying knots. Each number in the figure denotes the motion (blue arrow) corresponding to one of these manipulation primitives. For the sixth manipulation primitive, the robot uses both end effectors to grab the rope at each blue arrow and pulls in those directions.