Speaker: Prof. Mubarak Shah, University of Central Florida. Title: Representing Human Actions as Motion Patterns

Speaker: Prof. Mubarak Shah, University of Central Florida Title: Representing Human Actions as Motion Patterns Abstract: Automatic analysis of videos is one of most challenging problems in Computer vision. In this talk I will introduce the problem of action, event, and activity representation and recognition from video sequences. I will begin by giving a brief overview of a few interesting methods to solve this problem, including trajectories, volumes, and local interest points based representations. The main part of the talk will focus on a newly developed framework for the discovery and statistical representation of motion patterns in videos, which can act as primitive, atomic actions. These action primitives are employed as a generalizable representation of articulated human actions, gestures, and facial expressions. The motion primitives are learned by hierarchical clustering of observed optical flow in four dimensional, spatial and motion flow space, and a sequence of these primitives can be represented as a simple string, a histogram, or a Hidden Markov model. I will then describe methods to extend the framework of motion patterns estimation to the problem of multi-agent activity recognition. First, I will talk about transformation invariant matching of motion patterns in order to recognize simple events in surveillance scenarios. I will end the talk by presenting a framework in which a motion pattern represents the behavior of a single agent, while multi-agent activity takes the form of a graph, which can be compared to other activity graphs, by attributed inexact graph matching. This method is applied to the problem of American football plays recognition. Bio: Dr. Mubarak Shah, Agere Chair Professor of Computer Science, is the founding director of the Computer Visions Lab at University of Central Florida (UCF). He is a co-author of three books (Motion-Based Recognition (1997), Video Registration (2003), and Automated Multi-Camera Surveillance: Algorithms and Practice (2008)), all by Springer. He has published extensively on topics related to visual surveillance, tracking, human activity and action recognition, object detection and categorization, shape from shading, geo registration, visual crowd analysis, etc. Dr. Shah is a fellow of IEEE, IAPR, AAAS and SPIE. In 2006, he was awarded the Pegasus Professor award, the highest award at UCF, given to a faculty member who has made a significant impact on the university. He is ACM Distinguished Speaker. He was an IEEE Distinguished Visitor speaker for 1997-2000, and received IEEE Outstanding Engineering Educator Award in 1997. He received the Harris Corporation's Engineering Achievement Award in 1999, the TOKTEN awards from UNDP in 1995, 1997, and 2000; SANA award in 2007, an honorable mention for the ICCV 2005 Where Am I? Challenge Problem, and was nominated for the best paper award in ACM Multimedia Conference in 2005 and 2010. At UCF he received Scholarship of Teaching and Learning (SoTL) award in 20111; College of Engineering and Computer Science Advisory Board award for faculty excellence in 2011; Teaching Incentive Program awards in 1995 and 2003, Research Incentive Award in 2003 and 2009, Millionaires' Club awards in 2005, 2006, 2009, 2010 and 2011; University Distinguished Researcher award in 2007 and 2012. He is an editor of international book series on Video Computing; editor in chief of Machine Vision and Applications journal, and an associate editor of ACM Computing Surveys journal. He was an associate editor of the IEEE Transactions on PAMI, and a guest editor of the special issue of International Journal of Computer Vision on Video Computing. He was the program co-chair of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

Speaker: Prof. Irfan Essa, Georgia Tech, prof.irfanessa.com Title: Extracting Content and Context from Video Abstract: In this talk, I will describe various efforts aimed at extracting context and content from video. I will highlight some of our recent work in extracting spatio-temporal features and the related saliency information from the video, which can be used to detect and localize regions of interest in video. Then I will describe approaches that use structured and unstructured representations to recognize the complex and extended-time actions. I will also discuss the need for unsupervised activity discovery, and detection of anomalous activities from videos. I will show a variety of examples, which will include online videos, mobile videos, surveillance and home monitoring video, and sports videos. Finally, I will pose a series of questions and make observations about how we need to extend our current paradigms of video understanding to go beyond local spatio-temporal features, and standard time-series and bag of words models. Bio: Irfan Essa is a Professor in the School of Interactive Computing (ic) of the College of Computing (CoC), Georgia Institute of Technology(GA Tech), in Atlanta, Georgia, USA. At GA Tech, he is primarily affiliated with two interdepartmental centers; the Robotics & Machine Intelligence (RIM@GT) Center and the GVU Center. He founded the Computational Perception Laboratory (CPL) at GA Tech in 1996, which he now co-directs with 4 other faculty members. He is interested in the analysis, interpretation, authoring, and synthesis (of video), with the goals of building aware environments & supporting healthy living, recognizing & modeling human behaviors, empowering humans to effectively interact with each other, with media & with technologies, and developing dynamic & generative representations of time-varying streams. He has published over 150 scholarly articles in leading journals and conference venues on these topics. For further information, see his website at http://prof.irfanessa.com

Speaker: Dr. Apostol (Paul) Natsev, Google Title: Machine Perception for Content Discovery at YouTube Abstract: YouTube's mission is for YOU to discover and shape the world through video. At the heart of this mission is content discovery, or the problem of finding interesting content relevant to a given topic or user. This problem is particularly challenging given the variety and volume of YouTube videos: one hour of video is uploaded to YouTube every second (that's more than ten years worth of content every day). In this talk, I will give an overview of some work in the machine perception department at Google Research aiming to improve content discovery at YouTube. Specifically, I will present several case studies of applying machine perception and machine learning at YouTube scale to tackle problems such as automatically identifying and labeling celebrities and tourist landmarks in video, tagging videos with large unconstrained vocabularies, discovering musical or comedy talent on YouTube, and using gamification to crowdsource video discovery on YouTube. Bio: Apostol (Paul) Natsev received the M.S. and Ph.D. degrees in computer science from Duke University, Durham, NC, in 1997 and 2001, respectively. He is currently a Software Engineer and Manager in the Video Content Analysis Group at Google Research, Mountain View, CA. Previously, he was a Research Staff Member at IBM Research, Hawthorne, NY, from 2001 to 2011, and Manager of the Multimedia Research Group from 2007 to 2011. Dr. Natsev's research agenda is to advance the science and practice of systems that enable users to manage and search vast repositories of unstructured multimedia content. His research interests span the areas of image and video analysis and retrieval, computer vision, and large-scale machine learning.

Speaker: Dr. Anthony Hoogs, Kitware Title: Action and Activity Recognition: Scaling Across Domains Abstract: Over the past 10 years, the vision community has achieved significant breakthroughs in action, event and activity recognition. We have solved the fundamental problems posed by the Weizmann and KTH datasets, and are making substantial improvements each year against less constrained datasets such as UCF YouTube Sports and Hollywood Human Actions (HOHA). Because these videos were not filmed by vision researchers, but were compiled from the web and movie archives, they exhibit real-world conditions and complexity. More recently, the TRECVID Multimedia Event Detection competition was conducted on a very large collection of web videos showing complex events such as a wedding, changing a tire, and doing a woodworking project. With hundreds of exemplars of each event, plus thousands of videos of random events, this collection represents the largest publicly-available web video dataset today. Surprisingly, initial event detection accuracy on this dataset exceeded expectations, with Pd > 25% at a false alarm rate < 5%. Apparently, the problem was easier than expected. In the related domain of video surveillance, the most extensive datasets are demonstrating the opposite effect. Released last year, the VIRAT Video Dataset [Oh et al, CVPR 2011] has 11 scenes, 8.5 hours total, 11 annotated event types, and annotated bounding boxes on all movers. Initial performance on this dataset, using the same algorithms that do so well on HOHA and UCF, is much worse: at Pd = 25%, precision < 1%. Similarly poor results have been observed on the TRECVID Surveillance Event Detection dataset, which has similar content but more limited scene variety. Why is the seemingly less complex domain of surveillance more difficult than highly complex web videos? In this talk I will describe the methods we ve used to achieve the stated levels of performance on these datasets, and present reasons why surveillance video appears to be the more difficult case. Bio: Dr. Hoogs founded and directs the computer vision group at Kitware, Inc. which currently has more 30 members, half with PhD s. Over the past 20 years, he has supervised and performed research in various areas of computer vision including: event, activity and behavior recognition; motion pattern learning and anomaly detection; tracking; content-based retrieval; and segmentation. At Kitware he has led large, collaborative projects in video analysis involving universities, companies and government institutions. He has published more than 60 papers in computer vision, pattern recognition, artificial intelligence and remote sensing, and regularly serves as a program committee member and/or area chair for major vision conferences. Dr. Hoogs received his Ph.D. in Computer and Information Science from the University of Pennsylvania; M.S. from the University of Illinois at Urbana-Champaign; and B.A. magna cum laude from Amherst College.

Speaker: Dr. John Smith, IBM Title: TBD Bio: Dr. John Smith is a senior manager of the Intelligent Information Management Department at IBM T. J. Watson Research Center. He leads a research department addressing technical challenges in database systems and information management. His team includes the Database Research Group and Intelligent Information Analysis Group. In addition to his managerial responsibilities, Dr. Smith currently serves as Chair, Data Management research area at Watson and IBM Research Campus Relationship Manager for Columbia University. From 2001-2004, Dr. Smith served as Chair of ISO/IEC JTC1/SC29 WG11 Moving Picture Experts Group (MPEG) Multimedia Description Schemes group with responsibilities in the development of MPEG-7 and MPEG-21 standards. Dr. Smith also served as co-project Editor for following parts of MPEG-7 standard: "MPEG-7 Multimedia Description Schemes," "MPEG-7 Conformance," "MPEG-7 Extraction and Use" and "MPEG-7 Schema Definition." Dr. Smith also serves on the Advisory Committee for NIST TREC Video Retrieval Evaluation. Dr. Smith received his M. Phil and Ph.D. degrees in Electrical Engineering from Columbia University in 1994 and 1997, respectively (visit his page at Columbia University).