Classifying Manipulation Primitives from Visual Data



Similar documents
Removing Moving Objects from Point Cloud Scenes

A Genetic Algorithm-Evolved 3D Point Cloud Descriptor

The Scientific Data Mining Process

Tracking in flussi video 3D. Ing. Samuele Salti

Industrial Robotics. Training Objective

Essential Mathematics for Computer Graphics fast

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Journal of Industrial Engineering Research. Adaptive sequence of Key Pose Detection for Human Action Recognition

Local features and matching. Image classification & object localization

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

The Delicate Art of Flower Classification

Image Normalization for Illumination Compensation in Facial Images

II. RELATED WORK. Sentiment Mining

Mean-Shift Tracking with Random Sampling

Taking Inverse Graphics Seriously

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

A Study on SURF Algorithm and Real-Time Tracking Objects Using Optical Flow

Colorado School of Mines Computer Vision Professor William Hoff

Supporting Online Material for

Build Panoramas on Android Phones

The Visual Internet of Things System Based on Depth Camera

Computational Geometry. Lecture 1: Introduction and Convex Hulls

3D Model based Object Class Detection in An Arbitrary View

Make and Model Recognition of Cars

Predict Influencers in the Social Network

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

How To Cluster

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Face Model Fitting on Low Resolution Images

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

Distances, Clustering, and Classification. Heatmaps

Stirling Paatz of robot integrators Barr & Paatz describes the anatomy of an industrial robot.

CS 4620 Practicum Programming Assignment 6 Animation

Learning Invariant Visual Shape Representations from Physics

3D Human Face Recognition Using Point Signature

Robot Task-Level Programming Language and Simulation

The Artificial Prediction Market

Music Mood Classification

Categorical Data Visualization and Clustering Using Subjective Factors

A Reliability Point and Kalman Filter-based Vehicle Tracking Technique

Component Ordering in Independent Component Analysis Based on Data Power

Privacy Preserving Automatic Fall Detection for Elderly Using RGBD Cameras

RIA : 2013 Market Trends Webinar Series

BRIEF: Binary Robust Independent Elementary Features

OUTLIER ANALYSIS. Data Mining 1

Universidad de Cantabria Departamento de Tecnología Electrónica, Ingeniería de Sistemas y Automática. Tesis Doctoral

Predict the Popularity of YouTube Videos Using Early View Data

ART Extension for Description, Indexing and Retrieval of 3D Objects

Introduction to Computer Graphics

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

Support Vector Machine (SVM)

Data Mining Yelp Data - Predicting rating stars from review text

This week. CENG 732 Computer Animation. Challenges in Human Modeling. Basic Arm Model

Building an Advanced Invariant Real-Time Human Tracking System

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Monotonicity Hints. Abstract

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Support Vector Machines for Dynamic Biometric Handwriting Classification

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

How To Create A Text Classification System For Spam Filtering

Human-like Arm Motion Generation for Humanoid Robots Using Motion Capture Database

The use of computer vision technologies to augment human monitoring of secure computing facilities

CIS 536/636 Introduction to Computer Graphics. Kansas State University. CIS 536/636 Introduction to Computer Graphics

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

High-accuracy ultrasound target localization for hand-eye calibration between optical tracking systems and three-dimensional ultrasound

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Support Vector Machine-Based Human Behavior Classification in Crowd through Projection and Star Skeletonization

Lecture 2: Homogeneous Coordinates, Lines and Conics

Subspace Analysis and Optimization for AAM Based Face Alignment

Service-Oriented Visualization of Virtual 3D City Models

Automotive Applications of 3D Laser Scanning Introduction

MMVR: 10, January On Defining Metrics for Assessing Laparoscopic Surgical Skills in a Virtual Training Environment

Novelty Detection in image recognition using IRF Neural Networks properties

ELECTRIC FIELD LINES AND EQUIPOTENTIAL SURFACES

Representing Geography

Digital 3D Animation

Knowledge Discovery from patents using KMX Text Analytics

Personalized Hierarchical Clustering

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Cross-Validation. Synonyms Rotation estimation

GURLS: A Least Squares Library for Supervised Learning

VEHICLE LOCALISATION AND CLASSIFICATION IN URBAN CCTV STREAMS

More Local Structure Information for Make-Model Recognition

Face Recognition in Low-resolution Images by Using Local Zernike Moments

CMSC 425: Lecture 13 Animation for Games: Basics Tuesday, Mar 26, 2013

Robotics and Automation Blueprint

University of Arkansas Libraries ArcGIS Desktop Tutorial. Section 2: Manipulating Display Parameters in ArcMap. Symbolizing Features and Rasters:

Machine Learning using MapReduce

On Fast Surface Reconstruction Methods for Large and Noisy Point Clouds

Maximizing Precision of Hit Predictions in Baseball

L13: cross-validation

Big Ideas in Mathematics

VOLUMNECT - Measuring Volumes with Kinect T M

K'NEX DNA Models. Developed by Dr. Gary Benson Department of Biomathematical Sciences Mount Sinai School of Medicine

TouchPaper - An Augmented Reality Application with Cloud-Based Image Recognition Service

Computer Graphics. Geometric Modeling. Page 1. Copyright Gotsman, Elber, Barequet, Karni, Sheffer Computer Science - Technion. An Example.

Degree Reduction of Interval SB Curves

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Data, Measurements, Features

Transcription:

Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if an end effector trajectory will successfully accomplish a particular manipulation primitive. In this project, we define this classification problem and introduce modified modified shape context features to adapt to this situation. We generate a dataset of supervised examples for a rope tying problem and explore classification performance with these features. We outperform a baseline feature representation that adapts state-of-the-art object recognition and pose estimation features. 1 Introduction Robots often have to perform a given task in a variety of scenarios, making it impractical or even impossible to individually specify trajectories a priori. Learning from demonstrations is an approach that enables robots to generalize from demonstrations of manipulation tasks to new situations. These manipulation tasks can be modeled as sequences of manipulation primitives, where each manipulation primitive is an atomic step of the task. For example, a manipulation primitive when tying a knot would be moving one end of the rope over the other. We are interested in making learning from demonstrations more robust by predicting whether a given trajectory in a scenario will accomplish a particular motion primitive successfully. Our approach to prediction centers around classifying a trajectory-scenario pair into predefined motion primitive classes. A major challenge for building such a classifier is determining a relevant feature representation of trajectory-scenario pairs. We experiment with two different types of feature representations. Our baseline representation uses Viewpoint Feature Histograms (VFH) to capture a global representation of the initial scene, and uses relative change in end effector positions and rotations as a position-invariant representation of the trajectory. We also propose using features in the frame of the end effector, in particular shape context features, because they prioritize the area of the scene nearest the end effector. Since grasping and other motions with the end effector are essential characteristics of manipulation primitives, it is reasonable to place more emphasis on similarity for areas close to the end effector. 1.1 Problem Statement We assume that there exists a library of predefined manipulation primitives. Given the point cloud of an initial scene and an end effector trajectory in the form of a position and rotation for a sequence

of time steps, we would like to predict which manipulation primitive, if any, this trajectory will perform successfully in this scene. 2 Feature Design In this section, we begin with a discussion of characteristics of good features for this problem. We then describe the features we propose and motivate how they match up with our characteristics. We follow that with a description of baseline features. In designing features for this task, there is significant overlap with the task of object recognition and pose estimation. Most of this overlap stems from the fact that we are interested in deformable object manipulation. Many manipulation primitives will only be successful for a certain set of configurations of the object. For example, the final step of tying an overhand knot is only possible if the rope is already in a particular state with overlapping loops. When incorporating the trajectory, we need to pay attention to the relative pose of the gripper with respect to the object. Good features for this task should be invariant to Euclidean transformation of both the object and the trajectory. Furthermore, we would like features to represent the object at different levels of granularity depending on the manipulation trajectory. While the success of a manipulation is likely related to the particular configuration of parts of the object that are close to the gripper, changes in the object configuration that are farther away have less impact on the manipulation, so we would like to be robust to these changes. 2.1 Trajectory Shape Context Features The features we developed for this task are variations of 3D shape context features [1]. These are local descriptors that provide a highly detailed description of an object s local geometry and a coarse global representation of the overall shape of the object. This is done using a local polar coordinate frame around the basis point and computing a histogram of the distances and angles to the rest of the points on the object. One way to apply this for manipulation would be to use shape context features to describe the object and then to incorporate trajectories on top of that. We go a step further. In selecting basis points for computing shape context features, we use the intermediate points along the trajectory. This enables us to capture the overall shape of the object we are manipulating as well as the relative position with respect to the initial object configuration. Another positive aspect of these features is that they are more sensitive to changes in parts of the object that are close to our trajectory. This lets us be robust to changes that are far from the actual manipulation but sensitive to local geometry. In practice, computing these features is done in a few steps. Our trajectories are represented as sequences of homography matrices that correspond to a rotation and translation of the gripper with respect to a fixed point on our robot. For a particular gipper pose, with homography matrix H, we first rotate and translate the point cloud to be in the reference frame of the gripper at that

time. This corresponds to multiplying the point cloud by H 1. Then we convert our points to polar coordinates and count weighted occurrences of points. The weight we use is the same as in the original 3D shape context: w p = 1 d p 3 v b, where d p is the local point density around p and v b is the volume of the bin we are placing p into. In doing this binning, we are able to take advantage of one simplification: we have a well specified coordinate frame and so do not need to be, in fact do not want to be, invariant to rotations about the azimuth direction. 2.2 Baseline Features For our baseline, we combined Viewpoint Feature Histograms (VFH) with a reasonable representation of the trajectory [2]. These features make use of normals for a point cloud. Thus, our first step in applying these features is to estimate the direction of normals for each of our points. Because we downsample the trajectory, we hope that this estimation process will be robust to noise in our point cloud. We make use of the normal estimation utilities included in Point Cloud Library. This takes a radius as an argument and uses PCA on a covariance matrix to perform this estimation [3]. Once we have normals, VFH builds a representation of the view of the object by looking at the relative pitch, roll, and yaw of surface normals to a vector from the camera center to the centroid of the object. A global shape descriptor is then built that looks at the relative pitch, roll, yaw of the vector from the center of a patch to the camera center and each surface normal in the patch. These features have been very successful in object recognition and pose estimation for robotics applications. In order to include information about the trajectory, we use the relative change in end effector position for several different time steps. These differences are computed in the reference frame of the initial end effector position to be robust to rotations. Finally, we include the distance from this initial position to the centroid of the object. This gives us a representation that is invariant to joint rotations of the object and trajectory, although it does not take into account the relation of the shape to the end effector. 3 Results Our experiments deal with the manipulation task of tying a knot in a rope. We were able to simulate end effector trajectories on arbitrary scenarios, and used this to construct a labelled dataset containing 297 simulated steps of knot tying. 203 of the 297 examples fell under one of the six motion primitives in Figure 1, and the remaining 94 were failed attempts at one of these six motion primitives. We defined these six manipulation primitives manually, by observing each simulated trajectory and labeling it as a new manipulation primitive if it did not resemble any of the existing ones. Each item in our dataset consists of a point cloud of the initial scene, as well as the end effector position and rotation for each time step in the trajectory, and its corresponding manipulation primitive label. We use an SVM, and the LIBSVM library, as our classifier [4]. We ran experiments in several settings to test different hyper-parameter settings and holdout sizes. We use both 90-10 cross validation and leave-one-out cross validation (LOOCV) (Figure 1). For 90-10 cross validation, we

train the SVM on a random 90% of the data, test it on the remaining 10%, and average the results across 20 runs. Figure 1 shows shows the error rate for both sets of features as well as error the results from guessing the most likely class. Varying the hyper-parameter, C, significantly impacts the error rate; a choice of 0.01 for C seems to work best for both 90-10% cross validation and LOOCV. 4 Limitations and Future Work When standardizing the length of trajectories, sub-selecting at regular intervals may miss key characteristics of the trajectory, especially those that cause the trajectory to fail at accomplishing its desired manipulation primitive. For example, sub-selecting may skip over small jerks in the trajectory or the point when the end effector grasps the object. An alternative approach would be to use max or average pooling to interpolate between the sub-sampled trajectory points. Another issue with our approach stems from the fact that our feature descriptors are very high dimensional. In order to have good resolution for the shape context features, we need to use over a thousand bins per time step. Given that our data set has 300 examples, there is a serious danger of overfitting. For our experiments, we mitigated this by varying hyper-parameters and using a linear classifier. However, with more data we would be able to relax this and potentially loosen our regularization. Generating a large data set of labelled examples in this scenario is a difficult challenge. Unlike standard classification, we cannot use crowd-sourced solutions, as actually classifying these scenarios (as a human), requires knowledge of the task being performed and a somewhat sophisticated and task-specific concept of failure. As such, we think this is a good candidate for semi-supervised learning. References [1] Andrea Frome, Daniel Huber, Ravi Kolluri, Thomas Bülow, and Jitendra Malik. Recognizing objects in range data using regional point descriptors. In Computer Vision-ECCV 2004, pages 224 237. Springer, 2004. [2] Radu Bogdan Rusu, Gary Bradski, Romain Thibaux, and John Hsu. Fast 3d recognition and pose using the viewpoint feature histogram. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 2155 2162. IEEE, 2010. [3] Radu Bogdan Rusu. Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments. PhD thesis, Computer Science department, Technische Universitaet Muenchen, Germany, October 2009. [4] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.

5 Tables and Figures Trial Max Category VFH Trajectory Shape Context 90-10% CV; C = 0.1 68.4% 24.6% 24.6% 90-10% CV; C = 0.01 68.4% 18.6% 17.5% 90-10% CV; C = 0.001 68.4% 20.8% 20% LOOCV; C = 1 68.4% 25.6% 25.5% LOOCV; C = 0.01 68.4% 18.8% 18.8% Table 1: Error rates for various classification approaches. VFH and Trajectory Shape Context use an SVM and different trials correspond to differing experimental setups. Max category shows the error rate that results from always guessing the most common category. Both features show a significant improvement over max category, with the best results overall achieved by trajectory shape context. The varying of the hyper-parameters indicates the potential for over-fitting, which is an issue with the limited amount of data we have. Figure 1: Manipulation primitives in rope tying. We manually identified six unique manipulation primitives in our robot s approach to tying knots. Each number in the figure denotes the motion (blue arrow) corresponding to one of these manipulation primitives. For the sixth manipulation primitive, the robot uses both end effectors to grab the rope at each blue arrow and pulls in those directions.