How does the Kinect work? John MacCormick



Similar documents
Introduction to Computer Graphics

Computer Animation and Visualisation. Lecture 1. Introduction

Spatio-Temporally Coherent 3D Animation Reconstruction from Multi-view RGB-D Images using Landmark Sampling

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

PCHS ALGEBRA PLACEMENT TEST

Computer vision. 3D Stereo camera Bumblebee. 25 August 2014

Robotics. Lecture 3: Sensors. See course website for up to date information.

Segmentation & Clustering

Depth and Excluded Courses

Wii Remote Calibration Using the Sensor Bar

PHILOSOPHY OF THE MATHEMATICS DEPARTMENT

Professor, D.Sc. (Tech.) Eugene Kovshov MSTU «STANKIN», Moscow, Russia

Kinect Interface to Play Computer Games with Movement

Master Thesis Using MS Kinect Device for Natural User Interface

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, , , 4-9

Computer Graphics AACHEN AACHEN AACHEN AACHEN. Public Perception of CG. Computer Graphics Research. Methodological Approaches

MEng, BSc Applied Computer Science

MONITORING AND DIAGNOSIS OF A MULTI-STAGE MANUFACTURING PROCESS USING BAYESIAN NETWORKS

Fall Detection System based on Kinect Sensor using Novel Detection and Posture Recognition Algorithm

Professional Organization Checklist for the Computer Science Curriculum Updates. Association of Computing Machinery Computing Curricula 2008

MVTec Software GmbH.

The Visual Internet of Things System Based on Depth Camera

Applied Mathematics and Mathematical Modeling

MEng, BSc Computer Science with Artificial Intelligence

Several tips on how to choose a suitable computer

Theory and Methods of Lightfield Photography SIGGRAPH 2009

GestPoint Maestro3D. A White Paper from GestureTek The Inventor of 3D Video Gesture Control

New Tracks in B.S. in Mathematics

GRADES 7, 8, AND 9 BIG IDEAS

Activity recognition in ADL settings. Ben Kröse

Virtual Fitting by Single-shot Body Shape Estimation

Color Segmentation Based Depth Image Filtering

Character Animation from 2D Pictures and 3D Motion Data ALEXANDER HORNUNG, ELLEN DEKKERS, and LEIF KOBBELT RWTH-Aachen University

COMPUTER SCIENCE. Department of Mathematics & Computer Science

A technical overview of the Fuel3D system.

C# Implementation of SLAM Using the Microsoft Kinect

Part-Based Recognition

An Active Head Tracking System for Distance Education and Videoconferencing Applications

Computer Graphics in Real World Applications

MATH. ALGEBRA I HONORS 9 th Grade ALGEBRA I HONORS

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Solution Guide III-C. 3D Vision. Building Vision for Business. MVTec Software GmbH

Intelligent Flexible Automation

A Short Introduction to Computer Graphics

A System for Capturing High Resolution Images

Mean-Shift Tracking with Random Sampling

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

NEW MEXICO Grade 6 MATHEMATICS STANDARDS

Diablo Valley College Catalog

Dynamic Digital Depth (DDD) and Real-time 2D to 3D conversion on the ARM processor

Computer Science. 232 Computer Science. Degrees and Certificates Awarded. A.S. Degree Requirements. Program Student Outcomes. Department Offices

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

not think the same. So, the consumer, at the end, is the one that decides if a game is fun or not. Whether a game is a good game.

Shape Measurement of a Sewer Pipe. Using a Mobile Robot with Computer Vision

Introduction Computer stuff Pixels Line Drawing. Video Game World 2D 3D Puzzle Characters Camera Time steps

Training for Big Data

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

PHOTOGRAMMETRIC TECHNIQUES FOR MEASUREMENTS IN WOODWORKING INDUSTRY

NEUROMATHEMATICS: DEVELOPMENT TENDENCIES. 1. Which tasks are adequate of neurocomputers?

An open-source tool for distributed viewing of kinect data on the web

DESIGN OF A TOUCHLESS USER INTERFACE. Author: Javier Onielfa Belenguer Director: Francisco José Abad Cerdá

Big Data Text Mining and Visualization. Anton Heijs

Industrial Robotics. Training Objective

Computer Graphics Hardware An Overview

COSC 6344 Visualization

Course: Model, Learning, and Inference: Lecture 5

Learning Detectors from Large Datasets for Object Retrieval in Video Surveillance

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Number Sense and Operations

How To Use Trackeye

High School Mathematics Program. High School Math Sequences

Andrew Ilyas and Nikhil Patil - Optical Illusions - Does Colour affect the illusion?

VECTORAL IMAGING THE NEW DIRECTION IN AUTOMATED OPTICAL INSPECTION

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Basic Math Course Map through algebra and calculus

COURSE SYLLABUS Pre-Calculus A/B Last Modified: April 2015

Allocation of Mathematics Modules at Maynooth to Areas of Study for PME (Professional Masters in Education)

Mathematics (MAT) MAT 061 Basic Euclidean Geometry 3 Hours. MAT 051 Pre-Algebra 4 Hours

A PHOTOGRAMMETRIC APPRAOCH FOR AUTOMATIC TRAFFIC ASSESSMENT USING CONVENTIONAL CCTV CAMERA

The Basics of Graphical Models

Kinect Gesture Recognition for Interactive System

Synthetic Sensing: Proximity / Distance Sensors

Azure Machine Learning, SQL Data Mining and R

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Learning is a very general term denoting the way in which agents:

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Active Learning SVM for Blogs recommendation

Contents. Introduction Hardware Demos Software. More demos Projects using Kinect Upcoming sensors. Freenect OpenNI+ NITE + SensorKinect

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

Linear Threshold Units

Learning outcomes. Knowledge and understanding. Competence and skills

An Introduction to Applied Mathematics: An Iterative Process

False alarm in outdoor environments

3D SCANNING: A NEW APPROACH TOWARDS MODEL DEVELOPMENT IN ADVANCED MANUFACTURING SYSTEM

Privacy Preserving Automatic Fall Detection for Elderly Using RGBD Cameras

Human Pose Estimation from RGB Input Using Synthetic Training Data

ACE: After Effects CS6

Transcription:

How does the Kinect work? John MacCormick

Xbox demo

Laptop demo

The Kinect uses structured light and machine learning Inferring body position is a two-stage process: first compute a depth map (using structured light), then infer body position (using machine learning) The results are great! The system uses many college-level math concepts, and demonstrates the remarkable advances in computer vision in the last 20 years

Inferring body position is a two-stage process: first compute a depth map, then infer body position

Inferring body position is a two-stage process: first compute a depth map, then infer body position

Stage 1: The depth map is constructed by analyzing a speckle pattern of infrared laser light 3D DEPTH SENSORS RGB CAMERA MULTI-ARRAY MIC MOTORIZED TILT Figure copied from Kinect for Windows SDK Quickstarts

Stage 1: The depth map is constructed by analyzing a speckle pattern of infrared laser light Remarks: Microsoft licensed this technology from a company called PrimeSense The depth computation is all done by the PrimeSense hardware built into Kinect Details are not publicly available; this description is speculative (based mostly on PrimeSense patent applications) and may be wrong

The technique of analyzing a known pattern is called structured light

Structured light general principle: project a known pattern onto the scene and infer depth from the deformation of that pattern Zhang et al, 3DPVT (2002)

The Kinect uses infrared laser light, with a speckle pattern Shpunt et al, PrimeSense patent application US 2008/0106746

Stage 1: The depth map is constructed by analyzing a speckle pattern of infrared laser light The Kinect uses an infrared projector and sensor; it does not use its RGB camera for depth computation The technique of analyzing a known pattern is called structured light The Kinect combines structured light with two classic computer vision techniques: depth from focus, and depth from stereo

Depth from focus uses the principle that stuff that is more blurry is further away Watanabe and Nayar, IJCV 27(3), 1998

Depth from focus uses the principle that stuff that is more blurry is further away The Kinect dramatically improves the accuracy of traditional depth from focus The Kinect uses a special ( astigmatic ) lens with different focal length in x- and y- directions A projected circle then becomes an ellipse whose orientation depends on depth

The astigmatic lens causes a projected circle to become an ellipse whose orientation depends on depth Freedman et al, PrimeSense patent application US 2010/0290698

Depth from stereo uses parallax If you look at the scene from another angle, stuff that is close gets shifted to the side more than stuff that is far away The Kinect analyzes the shift of the speckle pattern by projecting from one location and observing from another

Depth from stereo uses parallax Isard and M, ACCV (2006)

Inferring body position is a two-stage process: first compute a depth map, then infer body position Stage 1: The depth map is constructed by analyzing a speckle pattern of infrared laser light Stage 2: Body parts are inferred using a randomized decision forest, learned from over 1 million training examples Note: details of Stage 2 were published in Shotton et al (CVPR 2011)

Inferring body position is a two-stage process: first compute a depth map, then infer body position

Stage 2 has 2 substages (use intermediate body parts representation) Shotton et al, CVPR(2011)

Stage 2.1 starts with 100,000 depth images with known skeletons (from a motion capture system) Shotton et al, CVPR(2011)

For each real image, render dozens more using computer graphics techniques Use computer graphics to render all sequences for 15 different body types, and vary several other parameters Thus obtain over a million training examples

For each real image, render dozens more using computer graphics techniques Shotton et al, CVPR(2011)

Stage 2.1 transforms depth image to body part image Start with 100,000 depth images with known skeleton (from a motion capture system) For each real image, render dozens more using computer graphics techniques. Learn a randomized decision forest, mapping depth images to body parts

A randomized decision forest is a more sophisticated version of the classic decision tree

A decision tree is like a pre-planned game of twenty questions Ntoulas et al, WWW (2006)

What kind of questions can the Kinect ask in its twenty questions? Simplified version: is the pixel at that offset in the background? Real version: how does the (normalized) depth at that pixel compare to this pixel? [see Shotton et al, equation 1] Shotton et al, CVPR(2011)

To learn a decision tree, you choose as the next question the one that is most useful on (the relevant part of) the training data E.g. for umbrella tree, is raining? or cloudy? more useful? In practice, useful = information gain G (which is derived from entropy H): Shotton et al, CVPR(2011)

Kinect actually uses a randomized Randomized: decision forest Too many possible questions, so use a random selection of 2000 questions each time Forest: learn multiple trees to classify, add outputs of the trees outputs are actually probability distributions, not single decisions

Kinect actually uses a randomized decision forest Shotton et al, CVPR(2011)

Learning the Kinect decision forest requires 24,000 CPU-hours, but takes only a day using hundreds of computers simultaneously To keep the training times down we employ a distributed implementation. Training 3 trees to depth 20 from 1 million images takes about a day on a 1000 core cluster. Shotton et al, CVPR(2011)

Learning the Kinect decision forest requires 24,000 CPU-hours, but takes only a day using hundreds of computers simultaneously To keep the training times down we employ a distributed implementation. Training 3 trees to depth 20 from 1 million images takes about a day on a 1000 core cluster. Shotton et al, CVPR(2011)

Stage 2 has 2 substages (use intermediate body parts representation) Shotton et al, CVPR(2011)

Stage 2.2 transforms the body part image into a skeleton The mean shift algorithm is used to robustly compute modes of probability distributions Mean shift is simple, fast, and effective

Intuitive Description Region of interest Center of mass Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector Slide taken from Ukrainitz & Sarel

Intuitive Description Region of interest Center of mass Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector Slide taken from Ukrainitz & Sarel

Intuitive Description Region of interest Center of mass Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector Slide taken from Ukrainitz & Sarel

Intuitive Description Region of interest Center of mass Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector Slide taken from Ukrainitz & Sarel

Intuitive Description Region of interest Center of mass Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector Slide taken from Ukrainitz & Sarel

Intuitive Description Region of interest Center of mass Objective : Find the densest region Distribution of identical billiard balls Mean Shift vector Slide taken from Ukrainitz & Sarel

Intuitive Description Region of interest Center of mass Objective : Find the densest region Distribution of identical billiard balls Slide taken from Ukrainitz & Sarel

The Kinect uses structured light and machine learning Inferring body position is a two-stage process: first compute a depth map (using structured light), then infer body position (using machine learning) The results are great! The system uses many college-level math concepts, and demonstrates the remarkable advances in computer vision in the last 20 years

The results are great!

The results are great! But how can we make the results even better?

How can we make the results even better? Use temporal information The Xbox team wrote a tracking algorithm that does this Reject bad skeletons and accept good ones This is what I worked on last summer Of course there are plenty of other ways (next summer???)

Last summer (1)

Last summer (2): Use the same ground truth dataset to infer skeleton likelihoods likely unlikely

Next summer:???? (working with Andrew Fitzgibbon at MSR Cambridge)

The Kinect uses structured light and machine learning Inferring body position is a two-stage process: first compute a depth map (using structured light), then infer body position (using machine learning) The results are great! The system uses many college-level math concepts, and demonstrates the remarkable advances in computer vision in the last 20 years

What kind of math is used in this stuff? PROBABILITY and STATISTICS MULTIVARIABLE CALCULUS LINEAR ALGEBRA Complex analysis, combinatorics, graph algorithms, geometry, differential equations, topology

Personal reflection: computer vision has come a long way since I started working on it in 1996. Key enablers of this progress: Use of probabilistic and statistical models Use of machine learning Vastly faster hardware Parallel and distributed implementations (clusters, multicore, GPU) Cascades of highly flexible, simple features (not detailed models)

The Kinect uses structured light and machine learning