CS229 Project Final Report. Sign Language Gesture Recognition with Unsupervised Feature Learning

Similar documents
T O B C A T C A S E G E O V I S A T DETECTIE E N B L U R R I N G V A N P E R S O N E N IN P A N O R A MISCHE BEELDEN

: Introduction to Machine Learning Dr. Rita Osadchy

Analecta Vol. 8, No. 2 ISSN

Environmental Remote Sensing GEOG 2021

Supporting Online Material for

Colour Image Segmentation Technique for Screen Printing

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Mean-Shift Tracking with Random Sampling

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Face Recognition For Remote Database Backup System

An Energy-Based Vehicle Tracking System using Principal Component Analysis and Unsupervised ART Network

Efficient online learning of a non-negative sparse autoencoder

Probabilistic Latent Semantic Analysis (plsa)

How does the Kinect work? John MacCormick

3 An Illustrative Example

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Journal of Industrial Engineering Research. Adaptive sequence of Key Pose Detection for Human Action Recognition

FPGA Implementation of Human Behavior Analysis Using Facial Image

The Scientific Data Mining Process

Advanced analytics at your hands

A Real Time Hand Tracking System for Interactive Applications

HSI BASED COLOUR IMAGE EQUALIZATION USING ITERATIVE n th ROOT AND n th POWER

DYNAMIC RANGE IMPROVEMENT THROUGH MULTIPLE EXPOSURES. Mark A. Robertson, Sean Borman, and Robert L. Stevenson

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tracking and Recognition in Sports Videos

Pedestrian Detection with RCNN

Predict Influencers in the Social Network

VEHICLE TRACKING USING ACOUSTIC AND VIDEO SENSORS

An Active Head Tracking System for Distance Education and Videoconferencing Applications

HAND GESTURE BASEDOPERATINGSYSTEM CONTROL

A Genetic Algorithm-Evolved 3D Point Cloud Descriptor

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification


HANDS-FREE PC CONTROL CONTROLLING OF MOUSE CURSOR USING EYE MOVEMENT

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Comparison of K-means and Backpropagation Data Mining Algorithms

CS1112 Spring 2014 Project 4. Objectives. 3 Pixelation for Identity Protection. due Thursday, 3/27, at 11pm

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

A simple application of Artificial Neural Network to cloud classification

Mouse Control using a Web Camera based on Colour Detection

International Journal of Advanced Information in Arts, Science & Management Vol.2, No.2, December 2014

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Machine Learning with MATLAB David Willingham Application Engineer

Automatic Labeling of Lane Markings for Autonomous Vehicles

Support Vector Machine-Based Human Behavior Classification in Crowd through Projection and Star Skeletonization

Professor, D.Sc. (Tech.) Eugene Kovshov MSTU «STANKIN», Moscow, Russia

Low-resolution Character Recognition by Video-based Super-resolution

Open-Set Face Recognition-based Visitor Interface System

Image Classification for Dogs and Cats

Circle Object Recognition Based on Monocular Vision for Home Security Robot

Principal components analysis

A Study on SURF Algorithm and Real-Time Tracking Objects Using Optical Flow

Static Environment Recognition Using Omni-camera from a Moving Vehicle

Introduction to Logistic Regression

Lecture 2: The SVM classifier

Music Mood Classification

Novelty Detection in image recognition using IRF Neural Networks properties

Automatic Traffic Estimation Using Image Processing

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Lecture 9: Introduction to Pattern Analysis

Visualization and Feature Extraction, FLOW Spring School 2016 Prof. Dr. Tino Weinkauf. Flow Visualization. Image-Based Methods (integration-based)

Programming Exercise 3: Multi-class Classification and Neural Networks

The Visual Internet of Things System Based on Depth Camera

Colorado School of Mines Computer Vision Professor William Hoff

LOCAL SURFACE PATCH BASED TIME ATTENDANCE SYSTEM USING FACE.

Subspace Analysis and Optimization for AAM Based Face Alignment

ENVI Classic Tutorial: Classification Methods

REAL TIME TRAFFIC LIGHT CONTROL USING IMAGE PROCESSING

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Simple and efficient online algorithms for real world applications

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Demo: Real-time Tracking of Round Object

3D Scanner using Line Laser. 1. Introduction. 2. Theory

Topographic Change Detection Using CloudCompare Version 1.0

Volume 2, Issue 9, September 2014 International Journal of Advance Research in Computer Science and Management Studies

Taking Inverse Graphics Seriously

Vision based Vehicle Tracking using a high angle camera

Matt Cabot Rory Taca QR CODES

K-Means Clustering Tutorial

FCE: A Fast Content Expression for Server-based Computing

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

LIBSVX and Video Segmentation Evaluation

Fall Detection System based on Kinect Sensor using Novel Detection and Posture Recognition Algorithm

ECE 468 / CS 519 Digital Image Processing. Introduction

Video Surveillance System for Security Applications

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Real Time Multiple License Plates Localization and Distortion Normalization for Moving Vehicles

1. Classification problems

Face Recognition in Low-resolution Images by Using Local Zernike Moments

SOURCE SCANNER IDENTIFICATION FOR SCANNED DOCUMENTS. Nitin Khanna and Edward J. Delp

Blood Vessel Classification into Arteries and Veins in Retinal Images

Transcription:

CS229 Project Final Report Sign Language Gesture Recognition with Unsupervised Feature Learning Justin K. Chen, Debabrata Sengupta, Rukmani Ravi Sundaram 1. Introduction The problem we are investigating is sign language recognition through unsupervised feature learning. Being able to recognize sign language is an interesting machine learning problem while simultaneously being extremely useful for deaf people to interact with people who don t know how to understand American Sign Language (ASL). Our approach was to first create a data set showing the different hand gestures of the ASL alphabet that we wish to classify. The next step was to segment out only the hand region from each image and then use this data for unsupervised feature learning using an autoencoder, followed by training a softmax classifier for making a decision about which letter is being displayed. Our final dataset consists of ten letters of the ASL alphabet, namely a, b, c, d, e, f, g, h, i and l. The following block diagram (Figure 1) summarizes our approach. Fig. 1: Block diagram summarizing our approach for sign language recognition

2. Methodology 2.1 Data collection and Hand segmentation The data that we used was collected off of a Microsoft Kinect 3D depth camera. Videos were taken of the test subject s hands while forming sign language letters. Frames showing individual letters were extracted. For segmenting out only the hand, we tried several approaches as described below. The first approach involved modeling the skin color by a 2D Gaussian curve and then using this fitted Gaussian to estimate the likelihood of a given color pixel being skin. We first collected skin patches from 40 random images from the internet. Each skin patch was a contiguous rectangular skin area. The patches were collected from people belonging to different ethnicities so that our model is able to correctly predict skin areas for a wide variation of skin color. We first normalized each color as follows : r=r/(r+g+b), b=b/(r+g+b). We ignored the g component as it is linearly dependant on the other two. We then estimated the mean and covariance matrix of the 2D Gaussian (with r, b as the axes) as Mean µ= E(x), Covariance C = E(x µ)(x µ) T, where x is the matrix with each row being the r and b values of a pixel. Fig 2: Histogram of the color distribution for skin patches and the corresponding Gaussian model that was fit to it. With this Gaussian fitted skin color model, we computed the likelihood of skin for any pixel of a given test image. Finally, we thresholded the likelihood to classify it as skin or non-skin. However, this approach did not give significantly good results and failed to detect dimly illuminated parts of skin. The results of using this algorithm for skin segmentation are shown in Fig 3 (b). The second approach which we used is motivated by the paper [1], in which the authors first transform the image from the RGB space to the YIQ and YUQ color spaces. Then the compute the parameter ϴ = tan -1 (V/U) and combine it with the parameter I to define the region to which skin pixels belong. Specifically, the authors called all pixels with 30 < I <100 and 105 o < ϴ < 150 o as skin. For our experiments, we tweaked these thresholds a bit, and found that the results were significantly better than our Gaussian model in the previous approach. This might have been because of two reasons (i) our Gaussian model was trained using data samples of insufficient variety and hence was inadequate to correctly detect skin pixels of darker shades (ii) fitting the model in the RGB space performs poorly as RGB doesn t capture the hue and saturation information of each pixel separately. The results of this second approach is shown in Fig 3(c). After having detected the skin regions, the next task was to filter out only the hand region and eliminate the face and other background pixels that might have been detected. For this, we used the depth information that we

obtained from the Kinect, which returns a grayscale depth image with objects having lower intensity values being closer to the camera. We assumed that in any frame, the hand was the object closest to the camera, and used this assumption to segment out only the hand. Our final dataset consists of hand gestures for ten letters of the ASL alphabet bounded in a 32x32 bounding box, as shown in Fig 4. We have 1200 samples for each letter, giving us a total size of 12,000 images for our entire dataset of ten letters. Fig 3(a) Original Image Fig 3(b) Skin detected using the Gaussian model Fig 3(c) Skin detected using YIQ and YUV color spaces A B C D E F G H I L Fig 4. Samples of each letter from our final dataset 2.2 Unsupervised Feature Learning and Classification The extracted data of hand images was fed into an autoencoder in order to learn a set of unsupervised features. We used 600 images of each letter (so, a total of 6000 images) as training data samples and fed them into the sparse autoencoder. Each input image from the segmentation block are images of size 32x32 pixels. A sparse autoencoder is chosen with an input layer with 32x32 nodes and one hidden layer of 100 nodes. We used L-BFGS to optimize the cost function. This was run for about 400 iterations to obtain estimates of the weights. Visualization of the learned weights of the autoencoder is shown in Fig.5. From this figure, it becomes evident that the autoencoder learns a set of features similar to edges. Fig 5 : Visualization of the learned weights of the autoencoder

The next step was to train a softmax classifier in order to classify the 10 different letters based on the features learnt by the autoencoder. The activations of the hidden layer of the autoencoder was fed into a softmax classifier. The softmax classifier again learns using the L-BFGS optimization function. This algorithm converges after about 40 iterations. We tested the system classification accuracy by using the remaining 600 images per letter (so a total of 6000 images) as our test set. 3. Results and Discussions In the previous sections, we have mentioned details of our implementation of the hand segmentation, unsupervised feature learning and classification sub blocks. In this section, we report the performance of our system through tables and figures. As a preliminary diagnostic, we plotted a learning curve showing the percentage training error and test error in classification as a function of the size of the training set. The following plot shows the learning curve. Fig. 6: Learning curve In our milestone report, we had used 50 hidden units for our autoencoder. Analyzing our learning curve, we observed that the training error and the test error are close to each other (except for one aberration at training set size 3000), and even the training error is more than 1.5%, which is somewhat high. Hence suspecting that we might be in the high bias region, we decided to increase the size of our features by increasing the number of hidden units of our autoencoder to 100. Our final classification accuracy achieved using this 100 length feature vector was 98.2%.The following table summarizes all our implementation details and reports the accuracy obtained : Size of training set Size of feature vector Number of classes (letters) Accuracy of classification (%) 1200 100 10 95.62 2400 100 10 98.18 3600 100 10 97.47 4800 100 10 97.92 6000 100 10 98.20

As a finishing step to our project, we have successfully created a real time implementation of our entire system, so that hand gestures made in front of the Kinect connected to our computer directly displayed the image captured by the kinect, the segmented hand gesture and the output of our classifier, which is one of the ten letters in our dataset. The evaluation process takes less than 2 seconds per frame. The following figures show examples of our real-time implementation and the results obtained. 4. Concluding Remarks and Future Work Fig. 7 : Examples of prediction from our live demo In this project, we have implemented an automatic sign language gesture recognition system in real-time, using tools learnt in CS229. Although our system works quite well as has been demonstrated through tables and images, there s still a lot of scope for possible future work. Possible extensions to this project would be extending the gesture recognition system to all alphabets of the ASL and other non-alphabet gestures as well. Having used MATLAB as the platform for implementation, we feel that we can improve upon the speed of our real-time system by coding in C. The framework of this project can also be extended to several other applications like controlling robot navigation using hand gestures and the like. 5. Acknowledgements We would like to thank Prof. Andrew Ng and the TA s of CS229 for the valuable advice they provided from time to time. References [1] Xiaolong Teng, Biani Wu, Weiwei Yu and Chongqing Liu. A Hand Gesture recognition system based on local linear embedding, April 2005. [2] D. Metaxas. Sign language and human activity recognition, June 2011. CVPR Workshop on Gesture Recognition. [3] S. Sarkar. Segmentation-robust representations, matching, and modeling for sign language recognition, June 2011. CVPR Workshop on Gesture Recognition, Co-authors: Barbara Loeding, Ruiduo Yang, Sunita Nayak, Ayush Parashar Appendix This project was being done in combination with Justin Chen s CS231A Computer Vision project.