Real-Time Whiteboard Capture and Processing Using a Video Camera for Teleconferencing



Similar documents
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

What is Candidate Sampling

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , info@teltonika.

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Traffic State Estimation in the Traffic Management Center of Berlin

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Distributed Multi-Target Tracking In A Self-Configuring Camera Network

RequIn, a tool for fast web traffic inference

DEFINING %COMPLETE IN MICROSOFT PROJECT

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

The OC Curve of Attribute Acceptance Plans

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Conferencing protocols and Petri net analysis

Frequency Selective IQ Phase and IQ Amplitude Imbalance Adjustments for OFDM Direct Conversion Transmitters

An Alternative Way to Measure Private Equity Performance

Extending Probabilistic Dynamic Epistemic Logic

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Lecture 2: Single Layer Perceptrons Kevin Swingler

Enterprise Master Patient Index

A machine vision approach for detecting and inspecting circular parts

VIP X1600 M4S Encoder module. Installation and Operating Manual

A Performance Analysis of View Maintenance Techniques for Data Warehouses

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

An Interest-Oriented Network Evolution Mechanism for Online Communities

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

Vehicle Detection and Tracking in Video from Moving Airborne Platform

A Multi-Camera System on PC-Cluster for Real-time 3-D Tracking

Calculation of Sampling Weights

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Single and multiple stage classifiers implementing logistic discrimination

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

A Multi-mode Image Tracking System Based on Distributed Fusion

An interactive system for structure-based ASCII art creation

A Secure Password-Authenticated Key Agreement Using Smart Cards

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

8 Algorithm for Binary Searching in Trees

Recurrence. 1 Definitions and main statements

Vehicle Detection, Classification and Position Estimation based on Monocular Video Data during Night-time

Forecasting the Direction and Strength of Stock Market Movement

Fault tolerance in cloud technologies presented as a service

Vembu StoreGrid Windows Client Installation Guide

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

IMPACT ANALYSIS OF A CELLULAR PHONE

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

Project Networks With Mixed-Time Constraints

MACHINE VISION SYSTEM FOR SPECULAR SURFACE INSPECTION: USE OF SIMULATION PROCESS AS A TOOL FOR DESIGN AND OPTIMIZATION

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Inter-Ing INTERDISCIPLINARITY IN ENGINEERING SCIENTIFIC INTERNATIONAL CONFERENCE, TG. MUREŞ ROMÂNIA, November 2007.

Detecting Global Motion Patterns in Complex Videos

Survey on Virtual Machine Placement Techniques in Cloud Computing Environment

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Calculating the high frequency transmission line parameters of power cables

Support Vector Machines

A Crossplatform ECG Compression Library for Mobile HealthCare Services

J. Parallel Distrib. Comput.

FINAL REPORT. City of Toronto. Contract Project No: B

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

VideoJet X20 Network Video Server. Installation and Operating Manual

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

We assume your students are learning about self-regulation (how to change how alert they feel) through the Alert Program with its three stages:

IWFMS: An Internal Workflow Management System/Optimizer for Hadoop

Implementation of Deutsch's Algorithm Using Mathcad

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

GENESYS BUSINESS MANAGER

Gender Classification for Real-Time Audience Analysis System

Cloud-based Social Application Deployment using Local Processing and Global Distribution

A DATA MINING APPLICATION IN A STUDENT DATABASE

denote the location of a node, and suppose node X . This transmission causes a successful reception by node X for any other node

Section 5.4 Annuities, Present Value, and Amortization

Credit Limit Optimization (CLO) for Credit Cards

HALL EFFECT SENSORS AND COMMUTATION

A Suspect Vehicle Tracking System Based on Video

Abstract. 1. Introduction

Politecnico di Torino. Porto Institutional Repository

Trivial lump sum R5.0

Efficient Project Portfolio as a tool for Enterprise Risk Management

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Statistical Methods to Develop Rating Models

Enabling P2P One-view Multi-party Video Conferencing

iavenue iavenue i i i iavenue iavenue iavenue

An RFID Distance Bounding Protocol

Alarm Task Script Language

This circuit than can be reduced to a planar circuit

New Solutions for Substation Sensing, Signal Processing and Decision Making

Hallucinating Multiple Occluded CCTV Face Images of Different Resolutions

A Fast Incremental Spectral Clustering for Large Data Sets

Realistic Image Synthesis

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Proactive Secret Sharing Or: How to Cope With Perpetual Leakage

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Cooperative Load Balancing in IEEE Networks with Cell Breathing

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Transcription:

Real-Tme Whteboard Capture and Processng Usng a Vdeo Camera for Teleconferencng L-We He & Zhengyou Zhang Mcrosoft Research, One Mcrosoft Way, Redmond, WA, USA {lhe,zhang}@mcrosoft.com Mcrosoft Research Techncal Report: MSR-TR-2004-91 Short verson publshed n Proc. IEEE Internatonal Conference on Acoustcs, Speech, and Sgnal Processng (ICASSP 2005), Vol.2, pp.1113-1116, March 18-23, 2005, Phladelpha. Full verson to appear n IEEE Transactons on Multmeda wth slghtly modfed ttle Real-Tme Whteboard Capture and Processng Usng a Vdeo Camera for Remote Collaboraton ABSTRACT Ths paper descrbes our recently developed system whch captures pen strokes on whteboards n real tme usng an off-the-shelf vdeo camera. Unlke many exstng tools, our system does not nstrument the pens or the whteboard. It analyzes the sequence of captured vdeo mages n real tme, classfes the pxels nto whteboard background, pen strokes and foreground obects (e.g., people n front of the whteboard), and extracts newly wrtten pen strokes. Ths allows us to transmt whteboard contents usng very low bandwdth to remote meetng partcpants. Combned wth other teleconferencng tools such as voce conference and applcaton sharng, our system becomes a powerful tool to share deas durng onlne meetngs. Keywords: Whteboard capture, annotaton, meetng archvng, vdeo analyss, audo/vdeo ndexng, mage classfcaton, corporate knowledge base, dgtal lbrary. 1 INTRODUCTION The Real-Tme Whteboard Capture System (RTWCS) s a part of our Dstrbuted Meetngs proect, whch ams at dramatcally mprovng knowledge workers meetng experence usng ubqutous computng technologes. The work presented n ths paper focuses on the partcular meetng scenaros that use whteboard heavly such as branstormng sessons, lectures, proect plannng meetngs, and patent dsclosures. In those sessons, a whteboard s ndspensable. It provdes a large shared space for the partcpants to focus ther attenton and express ther deas spontaneously. It s not only effectve but also economcal and easy to use -- all you need s a flat board and several dry-nk pens. Whle whteboard sessons are frequent for knowledge workers, they are not perfect. The content on the board s hard to archve or share wth others who are not present n the sesson. People are often busy copyng the whteboard content to ther notepads when they should spend tme sharng and absorbng deas. Sometmes they put Do Not Erase sgn on the whteboard and hope to come back and deal wth t later. In many cases, they forget or the content s accdentally erased by other people. Furthermore, meetng partcpants who are on conference call from remote locatons are not able to see the whteboard content as the local partcpants do. In order to enable ths, the meetng stes often must be lnked wth expensve vdeo conferencng equpments. Such equpment ncludes a pantlt-zoom camera whch can be controlled by the remote partcpants. It s stll not always satsfactory because of vewng angle, lghtng varaton, and mage resoluton, wthout mentonng lack of functonalty of effectve archvng and ndexng of whteboard contents. Other equpment requres nstrumentaton ether n the pens such as Mmo from Vrtual Ink or on the whteboard such as SMARTBoard from SmartTech. Our system was desgned wth two purposes: 1) to allevate meetng partcpants the mundane tasks of note takng by capturng whteboard content automatcally; 2) to communcate the whteboard content to the remote meetng

partcpants n real tme usng a fracton of the bandwdth requred f vdeo conferencng equpment s used. To the best of our knowledge, all exstng systems that capture whteboard content n real tme requre nstrumentaton ether n the pens or on the whteboard. Our system allows the user to wrte freely on any exstng whteboard surface usng any pen. To acheve ths, our system uses an off-the-shelf hgh-resoluton vdeo camera whch captures mages of the whteboard at 7.5Hz. From the nput vdeo sequence, our algorthm separates people n the foreground from the whteboard background and extracts the pen strokes as they are deposted to the whteboard. To save bandwdth, only newly wrtten pen strokes are compressed and sent to the remote partcpants. Furthermore, the mages are whte-balanced and colorenhanced for greater compresson rate and better vewng experence than the orgnal vdeo. There are a number of advantages n usng a hgh-resoluton vdeo camera over the sensng mechansm of pen devces or electronc whteboard. They are: 1) Wthout requrng specal pens and erasers makes the nteracton much more natural. 2) Snce t s takng mages of the whteboard drectly, there s no ms-regstraton of the pen strokes. 3) As long as the users turn on the system before erasng, the content wll be preserved. 4) Images captured wth a camera provde much more contextual nformaton such as who was wrtng and whch topc was dscussng (usually by hand pontng). As an extenson to our system, we also appled our vdeo algorthm to capture handwrtng annotatons on prnted documents n real tme a common scenaro n teleconferencng. As to be descrbed, we found only a mnor algorthmc change was needed. The paper s organzed as follows: Secton 2 dscusses related works. Secton 3 explans the desgn choces that we made and presents the archtecture of our system. Secton 4 explans the techncal challenges we encountered whle buldng the system. We dscuss the lmtatons of our system n Secton 5 and conclude n Secton 6. 2 RELATED WORKS 2.1 Whteboard Capture Devces Many technologes exst to capture the whteboard content automatcally. One of the earlest, the whteboard coper, s a specal whteboard wth a bult-n coper. Wth a clck of a button, the whteboard content s scanned and prnted. Once the whteboard content s on paper, t can be photocoped, faxed, put away n the fle cabnet, or scanned nto dgtal form. Recent technologes all attempt to capture the whteboard nto dgtal form from the start. They generally fall nto two categores: 2.1.1 Image Capture Devces The devces n the frst category capture mages of the whteboard drectly. NTSC-resoluton vdeo cameras are often used because of ther low cost. Snce they usually do not have enough resoluton for a typcal conference room sze whteboard, several vdeo frames must be sttched together to create a sngle whteboard mage. The ZombeBoard system [10], deployed nternally at Xerox PARC, uses a Pan-Tlt vdeo camera to scan the whteboard. The Hawkeye system from SmartTech opts for a three-camera array that takes mages smultaneously. Another devce n ths category s the dgtal stll camera. As hgh resoluton dgtal cameras get cheaper, takng snapshots of the board wth a dgtal camera becomes a popular choce. To clean-up the results, people use software (e.g. Adobe Photoshop or Polyvson s Whteboard Photo) to crop the non-whteboard regon and color-balance the mages. The dsadvantages of the mage capture devces are as follows: 1) They capture the whteboard content one snapshot at a tme so users have to make a conscous decson to take a snapshot of the whteboard. 2) There are usually a lag between wrtng on the board and takng a snapshot. Usng these devces n real tme teleconferencng scenaros s not very natural and convenent. 2.1.2 Pen Trackng Devces Devces n the second category track the locaton of the pen at hgh frequency and nfer the content of the whteboard from the hstory of the pen coordnates. Mmo by Vrtual Ink s one of the best systems n ths category. Mmo s an add-on devce attached to the sde of a conventonal whteboard and uses specal adaptors for dry-nk pens and eraser. The adapted pen emts ultrasonc pulses when pressed aganst the board. The two recevers at the add-on devce use the dfference n tme-of-arrval of the audo pulses to trangulate the pen coordnates. Snce the hstory of the pen coordnates s captured, the content on the whteboard can be reconstructed n real tme. And because the content s captured n vector form, t can be transmtted and archved wth low bandwdth and storage requrement. Electronc whteboards also use pen trackng technology. They go one step further by makng the whteboard an nteractve devce. For example, the SMARTBoard from SmartTech s essentally a computer wth a gant touchsenstve montor. The user wrtes on the montor wth a specal stylus whch s tracked by the computer. The computer renders the strokes on the screen wherever the stylus touches the screen -- as f the nk s deposted by the stylus. Because the strokes are computer generated, t can be edted, re-flowed, and anmated. However, the pen-trackng devces have several dsadvantages: 1. It requres nstrumentaton ether to the pens and erasers or to the surface that they are wrtng on. For example, Mmo uses specal adapters for drynk pens, whch make them much thcker and harder to press. Electronc whteboards are not even compatble wth the exstng whteboards. They use touch screens as ther wrtng surfaces, whch lmt ther nstall base due to hgh cost and small sze.

2. If the system s not turned on or the user wrtes or erases wthout usng the specal pens or erasers, the content cannot be recovered by the devce. Many people lke to use ther fngers to correct small mstakes on the whteboard. Ths common behavor causes extra strokes to appear on the captured content. 3. There s usually mnor mprecson n the tracked pen coordnates, whch tends to accumulate and cause ms-regstratons among the neghborng strokes. Furthermore, t does not allow multple users to wrte on the whteboard smultaneously. The mage capture devces do not have ths problem snce they work n a What You See Is What You Get (WYSIWYG) manner. 2.2 Multmeda Recordng Systems A lot of research has been done on the capture, ntegraton, and access of the multmeda experence. People have developed technques and systems that use handwrtten notes, whteboard content, sldes, or manual annotatons to ndex the recorded vdeo and audo for easy access [1,2,4,6,7,8,9,10,11,12,13,14]. Inspred by those systems, we also developed a Whteboard Capture System (WCS) about two years ago [15]. The goal of that proect s to buld a whteboard capture system that combnes the benefts of both mage capture devces and pen trackng devces. What dfferentates our WCS from other systems descrbed n Secton 2.1.1 s that we take pctures of the whteboard contnuously nstead of occasonal snapshots. Thus the entre hstory of changes to the whteboard content s captured. If the camera and the whteboard reman statc, robust vson algorthm exsts to remove the redundancy n multple stroke mages and to detect the tme when the strokes frst appear. Compact representaton and avalablty of tme stamps are the key benefts of the pen trackng devces. We chose a Canon G2 dgtal camera as the nput devce for ts hgh resoluton (4 MP) and the avalablty of a software development kt, whch allows us to control the camera from the PC. Because of the avalable hgh resoluton, we do not need complex camera control such as n the ZombeBoard system [10]. However, because the camera s connected to the host PC va low bandwdth USB 1.1, the frame rate s lmted to 5 second per frame. At such a low frame rate, we made no attempt to use t as a real tme conferencng tool. Our algorthm was desgned to analyze and browse offlne meetng recordngs. From the nput mage sequence, we compute a set of key frames that captures the hstory of the content on whteboard and the tme stamps assocated wth each pen strokes. A key frame contans all the vsual content before a maor erasure. Ths nformaton can then be used as a vsual ndex to the audo meetng recordng (see Fgure 1 for such a browsng nterface). Current Strokes Raw Image VCR & Tmelne Control Key Frame Thumbnals Future Strokes Fgure 1: Browsng nterface. Each key frame mage represents the whteboard content of a key moment n the recordng. The man wndow shows a composton of the raw mage from the camera and the current key frame mage. The pen-strokes that the partcpants have already wrtten are n sold color, whle those to be wrtten n the future (Future Strokes) are shown n ghost-lke style. The user can clck on any partcular stroke (both past and future) to ump to the pont of the meetng when that stroke was wrtten, and lsten to what people are talkng about. 3 SYSTEM DESIGN The Real Tme Whteboard Capture System (RTWCS) descrbed n ths paper s a real-tme extenson to our prevous WCS system. 3.1 Capture Devce Snce buldng the WCS two years ago, there have been tremendous advances n dgtal magng hardware. One notable example s the avalablty of nexpensve hgh resoluton vdeo cameras and hgh-speed USB 2.0 connecton. For example, wth Aplux MU2 vdeo camera connected to any PC wth a USB 2.0 port, we are able to capture 1.3 mega pxel mages at 7.5 Hz. The resoluton of each vdeo frame s 1280 pxels by 1028 pxels --- equvalent to 18 dp for a 6 by 4 board. At 7.5 Hz, the whteboard content can be captured n near real tme good enough to use n teleconferences. For our partcular scenaro, t s a perfect compromse between the NTSC vdeo camera and the hgh-resoluton stll mage camera. Because of the avalable hgh resoluton, we do not need complex mechancal camera controls such as n the ZombeBoard system [10]. 3.2 Capture Interface Requrements Lke WCS, our system does not requre people to move out of the camera s feld of vew durng capture as long as they do not block the same porton of the whteboard durng the whole meetng. Unlke WCS, our system does not need specal nstallaton or calbraton. Sttng on ts bult-n stand, the vdeo camera can be placed anywhere that has a steady and clear vew of the whteboard. It can be moved occasonally durng the meetng. After each move, t wll automatcally and quckly fnd the whteboard regon agan (see Secton 4.1.2). Ths mprovement made our system much more portable and easer to use than the WCS. Although the camera can be placed anywhere, the ntended capture area should occupy as much vdeo frame as possble n order to maxmze the avalable mage resoluton. For better mage qualty, t s also better to place the camera rght n front of the whteboard n order to utlze the depth-of-feld of the lens to avod out of focus. 3.3 Automatc Camera Exposure Adustment The camera exposure parameter s kept constant. If the lght settng does not change, the color of whteboard background should stay constant n a sequence. We wrote a

routne that automatcally set the exposure to mnmze the number of saturated pxels (.e. brghtness level s 0 or 255). Ths routne s run once when the system s started and trggered to run agan whenever a global change s detected (see Secton 4.1.2). 4 TECHNICAL DETAILS The nput to RTWCS s a sequence of vdeo mages (see Fgure 2). We need to analyze the mage sequence n order to separate the whteboard background from the person n the foreground and to extract the new pen strokes as they appear on the whteboard. As mentoned earler, there are a number of advantages n usng a hgh-resoluton vdeo camera over the sensng mechansm of devces lke Mmo or electronc whteboard. However, our system has a set of unque techncal challenges: 1) The whteboard background color cannot be pre-calbrated (e.g. take a pcture of a blank whteboard) because each ndoor room has several lght settngs that may vary from sesson to sesson and outdoor room lghtng condton s nfluenced by the weather and the drecton of the sun; 2) Frequently, people move between the camera and the whteboard, and these foreground obects occlude some porton of the whteboard and cast shadow on t. Wthn a sequence, there may be no sngle frame that s completely unoccluded. We need to deal wth these problems n order to extract the new pen strokes. Fgure 2: Selected frames from an nput mage sequence. The sequence lasts 82 seconds. 4.1 Image Sequence Analyss Snce the person who s wrtng on the board s n the lne of sght between the camera and the whteboard, he/she often occludes some part of the whteboard. We need to segment the mages nto foreground obects and whteboard. For that, we reply on two prmary heurstcs: 1) Snce the camera and the whteboard are statonary, the whteboard background cells are statonary throughout the sequence untl the camera s moved; 2) Although sometmes foreground obects (e.g., a person standng n front of the whteboard) occlude the whteboard, the pxels that belong to the whteboard background are typcally the maorty. Our algorthms explot these heurstcs extensvely. We apply several strateges n our analyss to make the algorthm effcent enough to run n real tme. Frst, rather than analyzng the mages at pxel level, we dvde each vdeo frame nto rectangular cells to lower the computatonal cost. The cell sze s roughly the same as what we expect the sze of a sngle character on the board (16 by 16 pxels n our mplementaton). The cell grd dvdes each frame n the nput sequence nto ndvdual cell mages, whch are the basc unt n our analyss. Fgure 3: The mage sequence analyss process. Second, our analyzer s structured as a ppelne of sx analyss procedures (see Fgure 3). If a cell mage does not meet the condton n a partcular procedure, t wll not be further processed by the subsequent procedures n the ppelne. Therefore, many cell mages do not go through all sx procedures. At the end, only a small number of cell mages contanng the newly appeared pen strokes come out of the analyzer. The sx procedures are: 1. Change detector: t determnes f the cell mages have changed snce last frame. 2. Color estmator: t computes the background color of the cell mages -- the color of blank whteboard. 3. Background modeler: Ths s a dynamc procedure that updates the whteboard background model by ntegratng the results computed from the prevous procedure whch may have mssng parts due to occluson by foreground obects. 4. Cell classfer: t classfes the cell mages nto foreground or whteboard cells. 5. Stroke extractor: t extracts the newly appeared strokes. 6. Color enhancer: t enhances the color of those extracted strokes. The thrd strategy s specfc to the vdeo camera that we use n our system. The Aplux MU2 allows the vdeo frames to be drectly accessed n Bayer format, whch s the sngle channel raw mage captured by the CMOS sensor. In general, a demosacng algorthm s run on the raw mage to produce an RGB color mage. By processng the cell mages n raw Bayer space nstead of RGB space and delayng demosacng untl the fnal step and runnng t only on the cells contanng new strokes, we save memory and processng by at least 66%. An addtonal beneft s that we can obtan a hgher qualty RGB mage at the end by usng

a more sophstcated demosacng algorthm than the one bult nto the camera drver. 4.1.1 Analyss State The analyss algorthm keeps some state as t processes the nput vdeo frames: 1) The last vdeo frame t has processed; 2) The age of each cell mage n the last frame. The age s defned to be the number of frames that the cell mage remans unchanged; 3) Cell mages wth whteboard content that have been detected so far; 4) The whteboard background model (see Secton 4.1.4 for detals). 4.1.2 Assgnng Age to Cells We frst assgn an age to each cell mage. To determne whether a cell mage has changed, t s compared aganst the mage of the same cell (e.g., the cell n the same locaton) n prevous frame usng a modfed Normalzed Cross-Correlaton (NCC) algorthm. Note that the NCC s appled to the mages n the Bayer space. Consder two cell mages I and I. Let I and I ' be ther mean colors and σ and σ ' be ther standard devatons. The normalzed cross-correlaton score s gven by 1 c = ( I I)( I' I') where the summaton s over Nσσ ' every pxel and N s the total number of pxels. The score ranges from -1, for two mages not smlar at all, to 1, for two dentcal mages. Snce ths score s computed after the subtracton of the mean color, t may stll gve a hgh value even two mages have very dfferent mean colors. So we have an addtonal test on the mean color dfference based on the Mahalanobs dstance [3], whch s gven by 2 2 d = I I' σ σ '. In summary, two cell mages I and I are consdered to be dentcal and thus should be put nto the same group f and only f d < Td and c > T. In c our mplementaton, Td = 2 and T c = 0. 707. If the comparson ndcates a cell mage s changed from the prevous frame, ts age s set to 1. Otherwse, t s ncremented by 1. At each frame, all the cells that have been statonary for more than the age threshold (4 frames n our system -- about 0.5 second at 7.5 Hz) are consdered to be the background canddates and fed to the Whteboard Color Model Update module. If the age s not greater than the age threshold, the cell mage s not processed further durng ths frame. The age threshold s a trade-off between the output delay and analyss accuracy. We also compute the percentage of the cell mages that have changed. If the change s more than 90%, we assume somethng drastc and global has happened snce last frame (e.g. lght settng s changed, camera s moved, etc.). In such an event, all state s re-ntalzed and the exposure calbraton routne s called. Other more localzed changes (e.g. people movng across, gradual change n sun lght) are handled dynamcally by the Whteboard Color Model Update Module. 4.1.3 Computng the Background Color To classfy cells, we need to know for each cell what the whteboard background color s (.e. the color of the whteboard tself wthout anythng wrtten on t). The whteboard background color s also used n colorenhancng the extracted cell mages (see Secton 4.1.7), so t needs to be estmated accurately to ensure the qualty. Snce the nk absorbs the ncdent lght, the lumnance of the whteboard pxels s hgher than pen stroke pxels. The whteboard color wthn the cell s therefore the color wth the hghest lumnance. In practce, the colors of the pxels n the top 10th percentle are averaged n order to reduce the error ntroduced by sensor nose. Hence, the color of each cell s computed by frst sortng the pxels of the same color channel (128 green, 64 blue and 64 red values n a 16x16 cell mage n Bayer space) and then takng the values of top 10% percentle n each channel. 4.1.4 Updatng the Whteboard Color Model The color computed from the prevous secton wll gve good estmaton of whteboard color for the cells contanng some whteboard background. Though, t wll gve the wrong color when the cells contan only the foreground or pen strokes (1 st mage n Fgure 4). We have to dentfy those cells to prevent them from contamnatng the whteboard color model. Fgure 4: 1) Colors of the cell mages. Note that the strokes on the whteboard are removed by our background color estmaton algorthm. 2) Colors of the cell mages that go nto the Update Module. Note the black regons contan the cell colors that are fltered out by both the change detector and the color estmator. 3) Integrated whteboard color model. We use a least-meda-squares algorthm, whch fts a global plane over the colors and throws away the cells that contan outler colors (see Appendx for detals). The remanng cells are consdered as background cells and ther colors are used to update the whteboard background (2 nd mage n Fgure 4). We then use a Kalman flter to dynamcally ncorporate the background colors computed from the current frame nto the exstng whteboard background color model. The state for the cell s ts color C, together wth varance P representng the uncertanty. P s ntally set to to ndcate no observaton s avalable. The update s done n two steps: Integrate. Let O be the color of cell computed from the current frame. There s also an uncertanty, Q, assocated wth O. In our current system, t can only be one of two values: f the cell color s an outler, 4 otherwse (.e., the standard devaton s equal to 2 ntensty levels). Consderng possble lghtng varaton durng the tme

elapsed snce the last frame, the uncertanty P s frst ncreased by (4 n our system, equvalent to a standard devaton of 2). C and P are then updated accordng to the classc Kalman flter formula: K = C P P Q = C P = (1 K) P K ( O C ) Propagate. In order to fll the holes created by the cells that are occluded by foreground obects and to ensure the color model s smooth, the cell colors are propagated to the neghborng cells. For each cell, t ncorporates the 4 of ts neghbors states accordng to the followng: CP C = P P = ( P 16 1 1 16 16 1 ( P C ( P ( P ) ) ) ) Note that we ncrease the uncertanty of ts neghbors by (4 n our system) to allow color varaton. A hole of sze N generally takes N/2 frames to get flled. Snce the uncertanty n the cells wth flled values s much larger than the ones wth the observed values (due to added ), the flled values are quckly supplanted by the observed values once they become avalable. An example of an ntegrated whteboard color s the 3 rd mage n Fgure 4. Note that the bookshelf area n the left sde of the mage s never flled. 4.1.5 Classfyng Cells Ths step s to determne whether a cell mage s a foreground obect or the whteboard. We perform ths n two levels: ndvdual and neghborhood. At the ndvdual cell level, gven a good whteboard color model, we smply compute the Eucldean dstance between the background color of the cell mage (computed n Secton 4.1.3) and the color of the correspondng cell locaton n whteboard background model. If the dfference exceeds a threshold (4 brghtness levels n our system), the cell mage s classfed as foreground obect. However, more accurate results can be acheved by utlzng spatal relatonshp among the cell groups. The basc observaton s that foreground cells should not appear solated spatally snce a person usually blocks a contnuous regon of the whteboard. So at the neghborhood level, we perform two flterng operatons on every frame. Frst, we dentfy solated foreground cells and reclassfy them as whteboard. Ths operaton corrects the ms-classfcaton of the cells that are entrely flled wth strokes. Second, we reclassfy whteboard cells whch are mmedately connected to some foreground cells as foreground cells. One man purpose of the second operaton s to handle the cells at the boundares of the foreground obect. Notce that f such a cell contans strokes, the second operaton would ncorrectly classfy ths cell as a foreground obect. It wll be correctly re-classfed as whteboard once the foreground obect moves away. Extendng the foreground obect boundary delays the recognton of strokes by a few frames, but t prevents some parts of the foreground obect from beng classfed as strokes -- a far worse stuaton. Fgure 5: Samples of the classfcaton results. The mages above correspond to the mages n Fgure 2. The classfed foreground regon s tnted n purple and s larger than the actual foreground obect due to moton, shadow, and spatal flterng. 4.1.6 Extractng New Strokes The cells classfed as foreground are not further processed. For cells classfed as whteboard, we check whether there s a whteboard cell already exstng n the same cell locaton n the output depostory. If not, the cell s a new whteboard cell. If a whteboard cell does exst, we stll need to check whether the exstng cell and the current cell mage are the same, usng the same mage dfference algorthm n Secton 4.1.2. If they are dfferent, the user probably has erased the whteboard and/or wrtten somethng new, and therefore the whteboard cell n the output depostory s replaced by the current cell mage. Perodcally (every 30 seconds), we update all exstng whteboard cells wth current cell mages to account for possble lghtng varatons. 4.1.7 Color-enhancng the Stroke Images At ths stage, the newly extracted cell mages are fnally converted from raw Bayer mages nto RGB mages. The demosacng algorthm used by the Aplux camera drver s not documented. Instead of the bult-n one, we use a demosacng algorthm proposed n [8] whch handles edges much better. After demosacng, the mages stll look color-shfted and nosy. They need to be whte-balanced and color-enhanced, whch also helps n The process conssts of two steps: 1. Make the background unformly whte and ncrease color saturaton of the pen strokes. For each cell, the whteboard color computed n Secton 4.1.3, I w, s used to scale the color of I each pxel n the cell: n I out = mn( 255, 255). I w

2. Reduce mage nose. We remap the value of each color channel of each pxel n the key frames accordng to an S-shaped curve. Fgure 3 s an example of such color enhancement. 4.2 Teleconferencng Experence Fgure 6: Real-tme whteboard system nsde the Wndows Messenger. To test our system n a real teleconference settng, we adapted our system to be a plug-n to the Whteboard applet of the Mcrosoft Wndows Messenger. The Whteboard applet allows the users at two ends of a Wndows Messenger sesson to share a dgtal whteboard. The user at one end can paste mages or draw geometrc shapes and the user at the other end can see the same change almost nstantaneously. Usually, the user draws obects wth hs mouse, whch s very cumbersome. Wth our system, the user can wrte on a real whteboard nstead. Our system takes 45-50% processng power of a 2.4G Hz Pentum 4 PC. Once launched, our system ntalzes n about 5 seconds, whch nclude the tme to do exposure calbraton, ntalze the whteboard color model, and capture the content already exstng on the whteboard. The changes to the whteboard content are automatcally detected by our system and ncrementally pped to the Whteboard applet as small cell mage blocks. The Whteboard applet s responsble for compressng and synchronzng the dgtal whteboard content shared wth the remote meetng partcpant. The remote partcpant can add annotatons on top of the whteboard mage usng the mouse. When used wth other Wndows Messenger tools, such as voce conferencng and applcaton sharng, whteboard sharng becomes a very useful tool n communcatng deas. The tme delay between the appearance of the stroke n nput vdeo and showng up on the local Whteboard applet s 1 second. Network transport takes addtonal 0.5 second or more dependng on the dstance between the two ends. Because the resultng mage contans only unform background and a handful of colors, the requred communcaton bandwdth after compresson s proporton to the amount of content that the user produces. Usng GIF compresson, a reasonably full whteboard mage at 1.3 MP takes about 200K bytes (Fgure 3 takes 70K bytes). After the ntal mage capture, the whteboard updates take 50-100 bytes per cell. Snce usually only a handful of cells are changng at a tme when the whteboard s n use, the sustaned network bandwdth requrement s far below those of vdeo conferencng solutons sutable even for use n a dal-up network. The reader s recommended to watch the accompanyng vdeo that captures the workng of our RTWCS. On the left s the nterface whch allows us to choose whch camera, f there are several, and whch demosacng algorthm (we have mplemented several) to use. It also shows a downsampled lve vdeo. On the rght s the dgtal whteboard applcaton shared wth the remote partcpant,.e., the remote partcpant sees exactly what s shown on ths wndow. As can be observed, the person n front of the whteboard s fltered out. The new strokes show up n the dgtal whteboard wth very short delay (about 1 second). Note that the order of new strokes appearng n the dgtal whteboard s not necessary the same as that of the strokes when they were wrtten. The former depends on when the camera sees the strokes. Because of whte-balancng and color enhancement, the qualty of the whteboard contents that the remote partcpant sees s obvously much better than that of the vdeo. 4.3 Capturng Prnted Documents and Annotatons We also tred to use our system to capture handwrtng annotatons on prnted documents n real tme a common scenaro n teleconferencng when partcpants need to share paper documents. We bult a gooseneck support so the camera can be ponted downward securely. When capturng 8.5 x11 szed documents, we found that the document mage s legble down to 6pt fonts. The software works n general wthout modfcaton, except one problem. We dscovered that when annotatng on paper documents, the user tends to make small movements to the paper. Ths behavor causes reset to be trggered qute often. Each of such reset causes an addtonal 5 seconds delay for the annotaton to appear. To overcome ths problem, we added an effcent homography-based mage matchng algorthm to algn each nput vdeo frame to frst frame equvalent to moton stablzng the nput vdeo. Ths modfcaton removes most of the resets and makes the system much more usable. Ths functonalty s also demonstrated n the accompanyng vdeo. 5 LIMITATIONS OF OUR SYSTEM Under some very specal crcumstances, our system may fal to produce the desred results. For example, f a person stands perfectly stll n front of the whteboard for an extended perod, our system would not be able to determne that t s a person. The cells covered by the person would be ether treated as strokes or whteboard dependng on the textures of hs/her cloth. If a regon of the whteboard s never exposed to the camera, our system wll obvously not be able to fgure out the content n that regon. Ths stuaton actually happens qute often even for local meetng partcpants. In that case, the users, remote or local, can ask the presenter to move away temporarly when ther vews to the whteboard are blocked. (Remote partcpants have an advantage over the

local ones when the presenter occludes a regon already seen by the camera. They can see the contents behnd the presenter n ther dgtal whteboard.) 6 CONCLUDING REMARKS Meetngs consttute a large part of knowledge workers workng tme. Makng more effcent use of ths tme translates to a bg ncrease n ther productvty. The work presented n ths paper, the Real Tme Whteboard Capture System, allows the users to share deas on a whteboard n a varety of teleconference scenaros: branstormng sessons, lectures, proect plannng meetngs, patent dsclosures, etc. Comparng to vdeo conferencng solutons, our system takes only a fracton of ts bandwdth and s sutable even on dal-up networks. Comparng to other whteboard capture technologes, the users of our system can wrte naturally usng regular board and pens. Wth devces lke electronc whteboards and Tablet PCs, the users are promsed the ablty to wrte freely n an all electronc medum. But as the cost of those devces s stll qute hgh, we beleve that the combnaton of the omnpresent whteboard and a low-cost vdeo camera wll be a very promsng soluton for the foreseeable future. Appendx: Plane-based whteboard color estmaton We only consder one component of the color mage, but the technque descrbed below apples to all components (R, G, B, or Y). Each cell s defned by ts mage coordnates (x, y ). Its color s desgnated by z (z=r, G, B, or Y). The color s computed as descrbed n Secton 4.1.3, and s therefore nosy and even erroneous. From our experence wth the meetng rooms n our company, the color on the whteboard, although not constant, vares regularly. It s usually much brghter n the upper part and becomes darker toward the lower part, or s much brghter n one of the upper corners and becomes darker toward the opposte lower corner. Ths s because the lghts are nstalled aganst the celng. Therefore, for a local regon (7x7 cells n our case), the color can be ft accurately by a plane; for the whole mage, a plane fttng s stll very reasonable, and provdes a robust ndcaton whether a cell color s an outler. A plane can be represented by ax by c z = 0. We are gven a set of 3D ponts {( x, y, z ) = 1,..., n} wth T nose only n z. The plane parameters p = [ a, b, c] can be estmated by mnmzng the followng obectve functon: 2 F = f, where f = ax by c z. The leastsquares soluton s gven by p = ( A A) A z, T T where x1 y1 = x n y n 1 1 z = z,, A and [ ] T 1 z n. Once the plane parameters are determned, the color of the cell s replaced by zˆ = ax by c. The least-squares technque s not robust to erroneous data (outlers). As mentoned earler, the whteboard color we ntally computed does contan outlers. In order to detect and reect outlers, we use a robust technque to ft a plane to the whole whteboard mage. We use the least-medansquares [8], a very robust technque that s able to tolerate near half of the data to be outlers. The dea s to estmate the parameters by mnmzng the medan, rather than the 2 sum, of the squared errors,.e., mn medan f. We frst draw m random subsamples of 3 ponts (3 s the mnmum number to defne a plane). Each subsample gves an estmate of the plane. The number m should be large enough such that the probablty that at least one of the m subsamples s good s close to 1, say 99%. If we assume that half of the data could be outlers, then m = 35, therefore the random samplng can be done very effcently. For each subsample, we compute the plane parameters and the medan of the squared errors 2 p f over the whole set of data ponts. We retan the plane parameters that gve the mnmum medan of the squared errors, denoted by M. We then compute the so-called robust standard devaton σ = 1.4826 M (the coeffcent s used to acheve the same effcency when no outlers are present). A pont s consdered to be an outler and dscarded f ts error f > 2.5σ. Fnally, a plane s ft to the good ponts usng the least-squares technque descrbed earler. The color of an outler cell s replaced by zˆ = ax by c. REFERENCES 1. Abowd, G. D, Atkeson, C. G., Jason A., Brotherton, J. A., Enqvst, T., Gulley, P. & Lemon, J., Investgatng the capture, ntegraton and access problem of ubqutous computng n an educatonal settng. In the Proceedngs of CHI '98, pp. 440-447, May, 1998. 2. Chu, P., Kapuskar, A., Retmeer, S., and Wlcox, L. NoteLook: Takng notes n meetngs wth dgtal vdeo and nk. Proceedngs of ACM Multmeda '99. ACM, New York, pp. 149-158. 3. Duda, R.O., Hart, P.E. and Stork, D.G. Pattern Classfcaton, Second Edton, John Wley & Sons, New York, 2001. 4. Ju, S.X., Black, M.J., Mnnerman, S. & Kmber D. Analyss of Gesture and Acton n Techncal Talks for Vdeo Indexng. In IEEE Trans. on Crcuts and Systems for Vdeo Technology. 5. Malvar, Henrque S., He, L. and Cutler, R. Hgh-Qualty Lnear Interpolaton for Demosacng of Bayer-Patterned Color Images. ICASSP 2004. 6. Moran, T. P., Palen, L., Harrson, S., Chu,, P., Kmber, D., Mnneman, S., Melle, W. v. & Zellweger, P., ""I'll Get That Off the Audo": A Case Study of Salvagng Multmeda Meetng Records," n Proceedngs of CHI '97, Atlanta, GA, 1997. 7. Pedersen, E., McCall, K., Moran, T. P., & Halasz, F., Tvol: An electronc whteboard for nformal workgroup meetngs. Proceedngs of INTERCHI 93. pp391-389. 8. Rousseeuw, P. and Leroy, A. Robust Regresson and Outler Detecton, John Wley & Sons, New York, 1987.

9. Stfelman, L.J., Arons, B., Schmandt, C. & Hulteen, E.A. VoceNotes: A Speech Interface for a Hand-Held Voce Notetaker. Proc. INTERCHI 93 (Amsterdam, 1993), ACM 10. Saund, E. Image Mosacng and a Dagrammatc User Interface for an Offce Whteboard Scanner. Techncal Report, Xerox Palo Alto Research Center, 1999. 11. Weber, K. & Poon, A., Marquee: A tool for real-tme vdeo loggng. Proceedngs of CHI 94. pp 58-64. 12. Wlcox, L. D., Schlt, B. N. & Sawhney, N., Dynomte: A dynamcally organzed nk and audo notebook. Proceedngs of CHI 97. pp 186-193. 13. Whttaker, S., Hyland, P. & Wley, M., Flochat: Handwrtten notes provde access to recorded conversatons. Proceedngs of CHI 94. pp 271-276. 14. Wolf, C., Rhyne, J. & Brggs, L., Communcaton and nformaton retreval wth a pen-based meetng support tool. Proceedngs of CSCW 92. pp 322-329. 15. He, L-we, Lu, Z. and Zhang, Z. Why Take Notes? Use the Whteboard Capture System. ICASSP 2003.