Research Statement: Towards Detailed Recognition of Visual Categories

Similar documents
Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

The Visual Internet of Things System Based on Depth Camera

Colorado School of Mines Computer Vision Professor William Hoff

Pedestrian Detection with RCNN

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

The Scientific Data Mining Process

Local features and matching. Image classification & object localization

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Big Data: Image & Video Analytics

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Transform-based Domain Adaptation for Big Data

Applications of Deep Learning to the GEOINT mission. June 2015

Semantic Recognition: Object Detection and Scene Segmentation

Convolutional Feature Maps

Big Data in the Mathematical Sciences

Introduction. Selim Aksoy. Bilkent University

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Classifying Manipulation Primitives from Visual Data

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

ENHANCED WEB IMAGE RE-RANKING USING SEMANTIC SIGNATURES

Professor, D.Sc. (Tech.) Eugene Kovshov MSTU «STANKIN», Moscow, Russia

3D Model based Object Class Detection in An Arbitrary View

Florida International University - University of Miami TRECVID 2014

Object Recognition. Selim Aksoy. Bilkent University

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

How To Make Sense Of Data With Altilia

Limitations of Human Vision. What is computer vision? What is computer vision (cont d)?

Simultaneous Gamma Correction and Registration in the Frequency Domain

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

Simple and efficient online algorithms for real world applications

Semantic Image Segmentation and Web-Supervised Visual Learning

MusicGuide: Album Reviews on the Go Serdar Sali

Online Learning for Fast Segmentation of Moving Objects

FACE RECOGNITION BASED ATTENDANCE MARKING SYSTEM

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Speaker: Prof. Mubarak Shah, University of Central Florida. Title: Representing Human Actions as Motion Patterns

Machine Learning: Overview

How To Create A Text Classification System For Spam Filtering

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Scalable Object Detection by Filter Compression with Regularized Sparse Coding

VEHICLE LOCALISATION AND CLASSIFICATION IN URBAN CCTV STREAMS

Fast Matching of Binary Features

EHR CURATION FOR MEDICAL MINING

Tracking performance evaluation on PETS 2015 Challenge datasets

A Cognitive Approach to Vision for a Mobile Robot

Human behavior analysis from videos using optical flow

The Delicate Art of Flower Classification

Behavior Analysis in Crowded Environments. XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011

Segmentation as Selective Search for Object Recognition

Deep learning applications and challenges in big data analytics

Evaluating Sources and Strategies for Learning Video Concepts from Social Media

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Deformable Part Models with CNN Features

Quality Assessment for Crowdsourced Object Annotations

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Support Vector Machine-Based Human Behavior Classification in Crowd through Projection and Star Skeletonization

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

ADVANCED MACHINE LEARNING. Introduction

Novelty Detection in image recognition using IRF Neural Networks properties

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Machine Learning for Data Science (CS4786) Lecture 1

Scalable Developments for Big Data Analytics in Remote Sensing

Segmentation & Clustering

Data, Measurements, Features

How To Use A Near Neighbor To A Detector

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Semi-Supervised Learning for Blog Classification

LIBSVX and Video Segmentation Evaluation

Part-Based Pedestrian Detection and Tracking for Driver Assistance using two stage Classifier

An Automatic and Accurate Segmentation for High Resolution Satellite Image S.Saumya 1, D.V.Jiji Thanka Ligoshia 2

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Classification using Intersection Kernel Support Vector Machines is Efficient

Merging learner performance with browsing behavior in video lectures

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Journal of Industrial Engineering Research. Adaptive sequence of Key Pose Detection for Human Action Recognition

Blog Post Extraction Using Title Finding

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Finding people in repeated shots of the same scene

Class-Specific, Top-Down Segmentation

Practical Tour of Visual tracking. David Fleet and Allan Jepson January, 2006

Computer Forensics Application. ebay-uab Collaborative Research: Product Image Analysis for Authorship Identification

T O B C A T C A S E G E O V I S A T DETECTIE E N B L U R R I N G V A N P E R S O N E N IN P A N O R A MISCHE BEELDEN

Machine Learning What, how, why?

Steven C.H. Hoi School of Information Systems Singapore Management University

Unsupervised Discovery of Mid-Level Discriminative Patches

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Open issues and research trends in Content-based Image Retrieval

LabelMe: Online Image Annotation and Applications

Machine Learning with MATLAB David Willingham Application Engineer

A PHOTOGRAMMETRIC APPRAOCH FOR AUTOMATIC TRAFFIC ASSESSMENT USING CONVENTIONAL CCTV CAMERA

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

Who are you? Learning person specific classifiers from video

Transcription:

As humans, we have a remarkable ability to perceive the world around us in minute detail purely from the light that is reflected off it we can estimate material and metric properties of objects, localize people in images, describe what they are doing, and even identify them! Automatic methods for such detailed analysis of images are essential for most human-centric applications such as image search, organization of personal photo collections; and large scale analysis of the content of media collections for market research, advertisement, and social studies. For example, in order to shop for shoes in an on-line catalogue, a system should be able to understand a shoe style, the length of its heels, or the shininess of its material. In order to support visual demographics analysis for advertisement, a system should be able to not only identify people in a scene, but also to understand what kind (style and brand) of clothes they are wearing, whether they have any accessories, and so on. Despite several successes, such detailed recognition is beyond the current state of the art computer vision systems. This is a challenging task and to make progress we have to make advances in many fronts; human-computer interaction to enable construction of datasets and benchmarks to evaluate various recognition tasks; computational models that can enable such rich recognition; study of natural language as a means to extract the semantics of human understanding that is exposed to us via people s descriptions of their visual experiences; and machine learning algorithms that can learn from large amounts of data in a computationally efficient manner. My research touches upon some of these aspects of the problem, including: Supervised methods for detailed visual recognition. Together with collaborators, we have developed a novel representation called poselets that use more detailed supervision during learning than traditional approaches and provide a basis for building high-level recognition systems. Our methods are currently the state of the art for person detection [1], segmentation [3], and estimating fine-grained properties such as pose, action [12], gender, clothing style and other attributes [2]. Machine learning for computer vision. This includes learning algorithms that offer better tradeoffs between representation power and efficiency of the classifier, leading to orders of magnitude savings in memory and training time on large-scale data [9], efficient classification techniques that are exponentially faster for commonly used classifiers in computer vision [10], techniques to speedup object detection in images using part-based models [9], and methods for efficiently combining information from high-level and low-level image analysis for semantic segmentation [14]. Collecting annotations via crowdsourcing 1. These include software tools [5] to collect and curate annotations enabling richer models and benchmarks for image understanding [2, 4, 12], and intuitive interfaces for collecting annotations that minimize annotator bias [6, 13]. Models that combine language and vision for analyzing videos and transcripts to automatically tag faces [8] and modeling human responses in image description tasks to discover discriminative properties of objects [6]. The explosion of imagery on the web due to cheap availability of sensors, computation, and storage, combined with the ability to collect large amounts of images labeled and verified by real people via crowdsourcing, is enabling a new direction of computer vision research for building richer models for detailed visual recognition. This presents new scientific challenges that have to be solved to allow the next generation applications for security, surveillance, robotics, and human-computer interaction. I present an overview of the two main themes of my research that stem from interacting with humans and large datasets. 1 Services like Amazon Mechanical Turk https://www.mturk.com 1

Machine learning for visual recognition Computer vision provides a rich source of problems which can often be tackled best using statistical methods that can learn from data. However, as more and more training data becomes available, training and evaluating such systems starts to become a bottleneck. This is especially true for object detectors as they face the complexity of searching over various locations in an image that score well according to a classifier. Traditionally, that had limited the kinds of classifiers to boosted decision trees or linear SVMs because of their efficiency of classification, even though linear SVMs are not the most accurate for many image classification tasks. My work on efficiently evaluating additive kernel SVMs has made a large class of kernels commonly used in computer vision, including the spatial pyramid match, pyramid match and various χ 2 kernels, very efficient classification time and memory complexity is same as that of a linear SVM [10, 11]. This result had a significant impact (cited over 295 times) in the computer vision community and has led to many state of the art systems (including some of our own) such as: Improvements over the state of the art pedestrian detector (of Dalal and Triggs) on INRIA pedestrian dataset by allowing non-linear classifiers for detection [10]. Speedups of over two orders of magnitude for the state of the art image classification method based on spatial pyramid matching on Caltech- 101/256 datasets [10], the most widely used dataset at the time of publication. Several top-performing methods in some of the most challenging benchmarks in computer vision and multimedia for image classification and detection, such as PASCAL VOC, ImageNet and TRECVID datasets, use our efficient classification algorithm to evaluate their classifiers. These methods combine several features corresponding to shape, color and texture cues with an additive kernel SVMs. Although testing speed and accuracy are important factors for many applications, training time can sometimes dictate the choice of classifiers. Our analysis has also led to the development of new algorithms for efficiently training commonly used non-linear classifiers in computer vision [9, 7]. For e.g., an additive kernel SVM classifier that previously took several hours to train on a standard detection dataset can be trained in a few seconds using our method. One theme of my research is building computationally efficient models for learning and inference for problems that arise in computer vision. For example in [9], we propose a method to improve the accuracy and speed of object localization in images by casting a voting based method (Hough transform) Spatial weights of features in Hough voting. as a learning problem. This led to the best results on ETHZ shape dataset and person detection in the PASCAL VOC 2008 dataset. In addition to efficient models, I have looked at ways to model user response in complex image annotation tasks to extract useful information. Issues of language and pragmatics can often lead to unintended biases for example, part annotation for visual categories such as sofas can be noisy since every annotator has a slightly different interpretation of what a handrest is. In these settings, models that treat the true annotation as a latent variable can be more robust. Topics corresponding to parts and attributes. Some of my research aims to build models that can use such noisy supervision during learning [13]. Another recent project aims to analyze human responses in description tasks to discover directions of visual variability in images [6]. Future directions of research involve tighter integration of human input and learning in order to accomplish various computer vision tasks. 2

Research Statement: Towards Detailed Recognition of Visual Categories Detailed recognition of visual categories Although current computer vision systems have been very successful in 1. Top poselets for person category detecting faces, optical character recognition, and estimating the 3D structure of a scene from large collections of photos, they are still far from the human ability to perform detailed image understanding. Two challenges in building such systems are: (1) design of models that can 2. Person and part localization learn both from rich annotations and large amounts of unlabeled data; and (2) collection of such annotations in a cost and time effective manner. I will discuss briefly our attempts at solving these problems. Studying the role of supervision in computer vision Typical computer vision systems involve a search over a large space spanned by the choice of features, learning methods, models, etc., so a black-box approach for learning can be unproductive. Rich supervi3. Segmentation sion can enable feedback and an informed search over these choices by allowing intermediate tasks that can be tested and improved in isolation. These tasks can also provide a basis for building mid-level representations between pixels and high-level semantics to enable a variety of high-level recognition tasks. Our work on poselets [1] tries to achieve this for object recognition 4. Action recognition by using rich annotations, somewhat contradictory to the favored approach of using large amounts of unsupervised or semi-supervised data. Poselets are easy to detect patterns in images that have a semantic meaning (Figure 1), unlike corners or parallel lines. These correspond to faces or entire bodies, but are not just restricted to anatomical parts. For e.g., a poselet can also refer to a pattern that is the conjunction of 5. 3D pose estimation the head and shoulder. Additional supervision of the form of locations of various joints of the human body enables us to discover these parts. It provides an underlying representation of images on which we have build state of the art systems for person detection [1], segmentation [3], pose estimation, action classification [12], and attribute recognition [2] 6. Attribute recognition (Figures 2-6). More supervision can also make learning easy and interpretable. Our state of the art human attribute recognition system uses a feed-forward neural-network like architecture where every layer is supervised [2] (Figure 6). Such architectures can help answer questions like what features are most useful for discriminating hair styles?, or what parts are most revealing for gender recognition? (Figure 7), thus providing directions for focussing future research. 7. Informative parts for gender recognition Supervision also arises naturally in an interactive system. Some of my research has explored ways to efficiently combine top-down information provided by users, with bottom-up information from low-level image features for semantic segmentation [14]. Future work involves building models for analyzing diverse visual categories such as architectural scenes and texture, and ways of combining segmentation and detection approaches for better visual recognition. 3

Crowdsourcing for computer vision I have developed software tools [5] to collect annotations at a large scale that has propelled research on supervised methods for object detection such as poselets [1]. It has also enabled datasets for evaluating semantic contour detection [4], attribute recognition [2], action classification and pose estimation [12]. Two of my ongoing research projects aim to build better userinterfaces to obtain supervision to bootstrap hard vision problems. The first is a way of collecting annotations to enable part and layout discovery that avoids deciding on the semantics of the parts Annotation via pairwise correspondence ahead of time. It is based on collecting annotations via pairwise correspondence between instances of a category [13]. The second is a novel interface that forces annotators to describe objects in more detail than they normally would. Exploiting the same idea of paired comparison, the interface consists of randomly pairing each image with another within the category and asking people to describe the differences, which can be analyzed to discover a lexicon of parts, Discovering a lexicon of attributes modifiers and their relations [6]. Future work involves extending these ideas to discover attributes for a taxonomy of object categories in a coarse to fine manner and analyzing descriptions of images on the internet to discover visual attributes. Conclusions and outreach The task of detailed and fine-grained visual recognition fascinates me and successful methods for doing so can tremendously improve human interaction with machines for navigating large amounts of visual data. I am interested in both the computer vision and machine learning problems that arise from this, as well as the aspects that touch upon areas such as psychophysics, cognitive science, language processing, data visualization and HCI. As a community we have just started to look at this problem and very few benchmarks exist. At a recently held six-week workshop at the CLSP center at Johns Hopkins university 2 that I helped orga- Detailed annotations of aeroplanes nize (along with researchers from Oxford, JHU, École Centrale and UC Berkeley), we have started to study the role of supervision for building better and interpretable models for detailed object detection. To encourage research in this area, we have collected a large dataset of several thousands of airplanes, completely annotated with attributes, segmentations and part bounding boxes, which can provide a test bed for such research. We have also started to do a similar analysis for texture to build models that describe texture in detail. We plan to release this dataset and a set of benchmarks to the community soon, as well as extend our annotations to other categories. I collaborate with researchers from around the world (Oxford, École Centrale, UC Berkeley, Stony Brook, UCSD and TTI Chicago) and have worked with many others. I have organized a tutorial at ECCV 2012 (on additive kernels and explicit embeddings for large scale vision), released code for much of my research and contributed to building datasets and benchmarks for evaluating computer vision algorithms. I have also maintained my ties to the Indian community where I come from, by attending yearly workshops and/or vision conferences held there. I am putting together a team to organize tutorials at ICVGIP 2012 being held in Bombay, India this December. 2 http://www.clsp.jhu.edu/workshops/archive/ws-12/groups/tduosn/ 4

Bibliography [1] L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting People using Mutually Consistent Poselet Activations. In European Conference on Computer Vision (ECCV), 2010. [2] L. Bourdev, S. Maji, and J. Malik. Describing People: Poselet-Based Approach to Attribute Classification. In International Conference on Computer Vision (ICCV), 2011. [3] T. Brox, L. Bourdev, S. Maji, and J. Malik. Object Segmentation by Alignment of Poselet Activations to Image Contours. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. [4] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic Contours from Inverse Detectors. In International Conference on Computer Vision (ICCV), 2011. [5] S. Maji. Large Scale Image Annotations on Amazon Mechanical Turk. Technical Report UCB/EECS-2011-79, EECS Department, University of California, Berkeley, Jul 2011. [6] S. Maji. Discovering a Lexicon of Parts and Attributes. In Second International Workshop on Parts and Attributes, ECCV, 2012. [7] S. Maji. Linearized Smooth Additive Classifiers. In Workshop on Web-scale Vision and Social Media, ECCV, 2012. [8] S. Maji and R. Bajcsy. Fast unsupervised alignment of video and text for indexing/names and faces. In Workshop on multimedia information retrieval on The many faces of multimedia semantics. ACM, 2007. [9] S. Maji and A. Berg. Max-Margin Additive Classifiers for Detection. In International Conference on Computer Vision (ICCV), 2009. [10] S. Maji, A. Berg, and J. Malik. Classification using Intersection Kernel Support Vector Machines is Efficient. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008. [11] S. Maji, A. Berg, and J. Malik. Efficient Classification for Additive Kernel SVMs. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), to appear in the January 2013 edition. [12] S. Maji, L. Bourdev, and J. Malik. Action Recognition from a Distributed Representation of Pose and Appearance. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. [13] S. Maji and G. Shakhnarovich. Part Annotations via Pairwise Correspondence. In 4th Workshop on Human Computation, AAAI, 2012. [14] S. Maji, N. Vishnoi, and J. Malik. Biased Normalized Cuts. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. A complete list of my publications can be found on my website: http://ttic.uchicago.edu/~smaji. Citation counts are according to Google Scholar as of November 3, 2012. 5