Muli-camera scheduling for video producion Fahad Daniyal and Andrea Cavallaro Queen Mary Universiy of London Mile End Road, E 4S London, Unied Kingdom Email: {fahad.daniyal, andrea.cavallaro}@eecs.qmul.ac.uk Absrac We presen a novel algorihm for auomaed video producion based on conen ranking. The proposed algorihm generaes videos by performing camera selecion while minimizing he number of iner-camera swiches. We model he problem as a finie horizon Parially Observable Markov Decision Process over emporal windows and we use a mulivariae Gaussian disribuion o represen he conen-qualiy score for each camera. The performance of he proposed approach is demonsraed on a muli-camera seup of fixed cameras wih parially overlapping fields of view. Subjecive experimens based on he Turing es confirmed he qualiy of he auomaically produced videos. The proposed approach is also compared wih recen mehods based on Recursive Decision and on Dynamic Bayesian eworks and is resuls ouperform boh mehods. Keywords: Bes-view selecion; Feaure analysis; Conen ranking; Auonomous video producion; Camera scheduling. Inroducion The selecion of he camera ha bes describes a dynamic scene is an imporan problem in muli-camera neworks. This selecion is useful for auomaed video producion and for highlighs generaion. The main challenges in addressing his problem are he effecive analysis of he video conen and he idenificaion of he bes view o saisfy he objecives of he ask a hand. These objecives can be selecing he camera capuring he maximum number of people, or offering he bes view of a specific person or an aciviy. Camera selecion can be seen as an opimizaion [0] or as a scheduling [4] problem, where he goal is o maximize he visibiliy of he feaures or evens of ineres while minimizing iner-camera swiching. Camera selecion mehods involve a conen analysis sage ha assigns a score o each camera based on is Qualiy of View (QoV). Scoring can be based on a single feaure [5] or a combinaion of feaures [9]. aive mehods for bes-view selecion based on QoV usually perform poorly, as hey generally produce frequen view changes [0]. To miigae his problem, reward and cos funcions associaed o camera swiching have been inroduced [6, ]. Reward can be expressed in erms of feaure observabiliy and smoohness of he final oupu video and a cos is incurred whenever he seleced view is swiched. In general, bes-view selecion requires knowledge of he seleced camera and he QoV over a finie ime horizon [5]. Moreover he camera selecion sraegy should be able o predic he ime inervals during which feaures or objecs of ineres would be mos visible [2]. For his reason, an efficien camera selecion sraegy should ake ino accoun pas as well as fuure (or prediced) informaion. The works in [2, 5] use a scheduling inerval o observe arges for a cerain minimal duraion. This approach does no scale o large camera neworks where muliple cameras may be simulaneously compeing for he bes view. In [2], Time Dependen Orieneering (TDO), arge moion, posiion, arge birh and deadline are used o rigger a pan-il-zoom camera ha capures arges in he scene. The cos of he sysem is associaed o he number of arges no capured. The scheduling sraegy is he Kineic Traveling Salesperson Problem wih deadlines. A schedule o observe arges is chosen ha minimizes he pah cos in erms of TDO. This work does no consider arge occlusions and does no predic he bes ime inervals o capure images. In [0] a cos funcion is proposed ha depends on QoV using feaures such as objec size, pose and orienaion. While his approach minimizes frequen swiches, i ends no o selec he bes-view for highly dynamic scenes, as demonsraed in Sec. 3. Scheduling sraegies based on queue processing echniques have also been used for camera selecion [9, 4]. In his case views compee o be seleced and are assigned prioriies based on heir feaures. In [9] when more han one person is in he view of a fixed camera an acive camera focuses on he closes arge. The performance of he sysem does no scale wih he number of objecs as he camera swiches from arge o arge. The work in [4] uses a Weighed Round Robin echnique for scheduling each arge ha eners he moniored area, bu no penalies are assigned o frequen camera swiching. Greedy scheduling policies have also been used for camera scheduling [4]. In hese mehods arges are reaed as nework packes and rouing approaches based on echniques such as Firs Come Firs Served (FCFS), Earlies Deadline Firs (EDF) and Curren Minloss Throughpu Opimal (CMTO) are used. These approaches do no include he ransiion cos for he camera ha is associaed wih arge swaps. Moreover, all hese approaches assume ha he dynamics of he observed sie remain consan when a cerain person is viewed and,
esimaion Even deecion Camera feaure exracion C Moion deecion d Objec deecion Objec score (,, J J ) C Moion deecion d Objec deecion ( Objec score,, J J ) Conen ranking and camera scheduling d,..., Even deecion Figure : Block diagram of he proposed approach. (,..., ) (,, J ) while minimizing he swiches, hey do no quanify he loss of informaion in he views ha are no seleced. In his paper we model he view-selecion problem as a decision process during which informaion is only parially visible. In paricular we use a Parially Observable Markov Decision Process (POMDP), where he process measured by he cameras (e.g., objec size, locaion or scene aciviy) is a Markov process and he sensor scheduling is based on recursively esimaing and updaing he belief sae, he sensor-scheduling acions, and he poserior disribuion of he process given he hisory of he sensor measuremens. We represen he process dynamics and measuremens as linear Gaussian sae-space models and rack he belief using he Bayes Rule. The reward for camera selecion is modeled as a funcion on a conenqualiy score and he relaed camera swiching. This reward modeling allows he proposed approach o conrol he number of camera swiches, hus enabling he generaion of pleasan videos. The proposed approach is esed using boh objecive and subjecive evaluaions on a real muli-camera seup wih parially overlapping fields of view. The paper is organized as follows. In Sec. 2 we presen he proposed approach for feaure-based camera selecion. Experimenal resuls on a real muli-camera baskeball daase are evaluaed and discussed in Sec. 3. Finally conclusions are drawn in Sec. 4. 2 Proposed approach Le a sie be moniored by a se of cameras C = {C,...,C }, wih 2. Dynamic bes-view selecion can be regarded as a hree-sage problem (see Fig. ). The firs sage focuses on he exracion over ime for each view i () of a feaure vecor for each objec j wihin he view, ψ i j, and (2) of feaures associaed o he enire camera view, ψ i. The selecion of he feaures depends on he ask a hand. In he second sage, a QoV score, ρ i, is compued for each camera C i a each ime based on he objec feaures ψ i,o (i.e., all ψ i j : j =,...,J i where J i is he number of objecs in he view of C i a ime ) and he camera feaures ψ i. ρ i can hen be represened as a measure of feaure visibiliy: ρ i = M (ψ i,o,ψ i ), () where M (.) generaes he QoV ρ i given he wo feaure vecors. In he hird sage, a bes-camera selecion mechanism is consruced as a funcion of ime such ha he bes rade off beween he bes-camera selecion and number of swiches is found. 2. Camera selecion Le he seleced camera a ime be represened by an dimensional vecor Ω i = (c,...,c ) which has only in he index i and is 0 elsewhere. The bes view can be seleced for each as i = argmax(ρ,ρ 2,...ρ ). (2) i=,..., However in such selecion he number of swiches are no consrained, hus generaing unpleasan videos. To solve his problem, fixed consrains such as a minimum scheduling period can be inroduced [2]. However in realisic scenarios such a consrain may cause loss of informaion as sudden and imporan changes in video conen from one view will no be caered for. To his end i is preferable o se, such ha he seleced camera is he bes camera mos of he ime [0]. To consrain he number of iner-camera swiching, pas informaion [0] as well as fuure informaion [2] can be used. Moreover, he bes-camera selecion process should ake ino accoun he currenly seleced view. This dynamic modeling of he muli-camera sysem can be done by modeling he sae of each camera using a random variable a each poin of ime. The insananeous snapsho of such random variables, a each, describes he sae of our muli-camera sysem a. To his end we model he camera selecion problem as a Markovian Decision Process (MDP), wih parially observable saes (POMDP) where each decision (acion) akes our sysem An example can be seen a hp://www.eecs.qmul.ac.uk/ andrea/viewselecion.hml
o he nex sae. POMDP implies conrol over he saes while having parial observabiliy i.e. no informaion abou he fuure saes [8]. Wihin he POMDP framework, for bes-view selecion we firs map he observaion and he seleced camera o associae a uiliy o he sysem; hen an opimal policy is defined ha maximizes his uiliy. The maximizaion of he uiliy relaes o he bes-view selecion. A POMDP can be defined by he influence diagrams shown in Fig. 2. Le he sae of a camera C i a ime be represened as s i R +, where he sae space for he POMDP is S = [0,]. Thus he sae for he muli-camera sysem a ime can be expressed as s = (ρ,...,ρ ) (R + ). (3) Le he acion space be represened by C and he acion a any ime be represened by he camera ransiion as c j c+ i. Then he reward u(s,c) i of selecing a camera c i C, given he sae s can be represened by he one-sep reward funcion u(s,c i ) = αρ i + ( α)ϑ i, (4) where α [0,] is a scaling facor and ϑ {0,} is defined based on he previously seleced camera { ϑ i if c i = = (5) 0 oherwise I should be noed ha if α =, u(s,c n ) = ρ n hus convering his uiliy ino he qualiy score (Eq. 0). Hence he sysem will selec only he bes camera over a emporal window wihou inroducing any smoohing. The one-sep cos funcion described in Eq. 4 is an inegraed meric ha accouns for boh camera swiching and he observabiliy of feaures given by he accumulaed qualiy score a each ime k. The sae space of a POMDP is coninuous and esimaing he curren sae from such a large sae space is compuaionally inracable. Thus approximae soluions for sae esimaion are formulaed [6]. These soluions assume ha he space S is quanized wih a facor g such ha he quanized sae is represened as s d = g.s, where g = (g,g 2,...,g S ) and g k+ > g k, wih k [,S]. For clariy, we will drop he superscrip d from s d and refer o his discree sae as s. The soluion o he POMDP is a policy ha can be represened as π = {µ(p(s I ))} such, ha for each, µ(p(s I )) is a sae feedback map ha specifies an acion c i c j + on C depending on he belief sae probabiliy p(s I ). A graphical represenaion is shown in Fig. 3 where he poserior probabiliy disribuion of he sae s is condiioned on he observable hisory I such ha { p0 if = 0 I := (6) (p 0,ϖ 0,...,ϖ ) oherwise Here ϖ = (Ω i,(ψ,...,ψ )), where p 0 is he iniial probabiliy disribuion and ψ i Ψ is he observaion from C i, 0 c c0 u s, c ) ( (a) c s s I I c u s, c ) Figure 2: Influence diagram describing a POMDP model. Recangles correspond o decision nodes (acions), circles o random variables (saes) and riangles o reward nodes. Links represen he dependencies among he componens. s, Ω, i ψ, and u(.) denoe he sae, acion, observaion and reward a ime. Informaion saes (I and I + ) are represened by double-circled nodes. (a) oe ha an acion a ime depends on pas observaions and acions, no on he saes. (b) An acion choice (recangle) depends on he curren informaion sae only. ( drawn from he observaion space Ψ, given by he observaion equaion as s (b) ψ i = h(s i,w ), (7) ψ = (ψ,...ψ ), (8) where h represens he observaion map and w represen he randomness in he observaions a ime. We assume ha w is an independen and idenically disribued (iid) random variable wih zero-mean Gaussian disribuion. Then he sequence of saes wihin he POMDP are generaed such ha a ime = 0 he sysem sars a an iniial unobservable sae s 0 wih he given iniial disribuion p 0. If a any ime, he sysem is in sae s S, and aking an acion c j ci (selecing he camera C i given ha he camera C j was seleced a he previous ime insance ) akes he sysem o he nex sae s + S and an immediae reward u(s +,Ω) i is achieved. This sae ransiion is governed by he sae ransiion equaion s + = f (s,ω i,v ), (9) where f and v represen he sae dynamics and randomness in he sae ransiions, respecively. Since he sae equaion s is composed of wo segmens, he sae dynamics (Eq. 9) can be decomposed as f (s,ω,v i ) = [ f s (s,v ), f c (Ω)]. i All he componens of f c (Ω) i are 0 bu he i h componen ha corresponds o he seleced camera C i. The specific form of f s represens he model for he QoV evoluion which we approximae wih a Gaussian disribuion [9] as ρ i = (µ i,σ i,ψ i ), (0) where µ i and Σ i are he mean and he covariance of he Gaussian model for C i. Please noe ha he belief sae probabiliy p(s I ), i.e., he probabiliy of being in sae s, is he poserior probabiliy s
p s I p s I ( c ) c c p s I ( c ) p s I c c ( c, c ) 2 2 p s 2 I 2 ( c, c ) p s I c 2 c 2 p s 3 I 3( c, c, c 2) ( c, c, c ) 3 3 2 Figure 3: Belief sae disribuion for hree consecuive ime seps. Please noe ha I (c n,c+ m ) signifies he observable hisory I given c n = and c+ m =. disribuion of sae s condiioned on he observable hisory I. Then he esimaed belief sae probabiliy s +, given s afer selecing camera C i, and observing ψ is given by he Bayes rule as s + = η p(ψ s,c) i p(s ψ,c)p(s i I ), () s S where η = p(ψ p(s I ),c i ) is a normalizing consan. The nex sep calculaes he opimal value µ (p(s I )) and he opimal policy π ha consrucs he value o acion mapping π : µ (p(s I )) C (2) These can be esimaed using he Bellman equaions []: µ (p(s I )) = (3) = max s S u(s,ω i ( )p(s I )+ p(ψ p(s c i C +γ I ),c i ) ), ψ Ψ µ (p(s + I + )) where γ [0,] is a discoun facor and he corresponding opimal policy selecs he value-maximizing acion as π (p(s I )) = (4) = argmax s S u(s,ω i ( )p(s I )+ p(ψ p(s c i C +γ I ),c i ) ). ψ Ψ µ (p(s + I + )) The opimal value funcion in Eq. 4 or is approximaion can be compued using he value ieraion algorihm [3]. As demonsraed in [8], he opimal value funcion µ can be deermined wihin a finie horizon by performing a sequence of value-ieraion seps assuming ha he sequence of esimaes converges o he unique fixed-poin soluion. To his end we need o rewrie Eq. in he value-funcion mapping form. Le he real-valued bounded funcions µ be such ha value funcion mapping H for all informaion saes can be wrien π = Hµ and he value mapping funcion H can be wrien as (Hµ)(p(s I )) = maxh(p(s I ),Ω, i µ ), (5) C i C where H is an isoone mapping and such ha value-funcions are esimaed per each ieraion as: h(p(s I ),Ω i, µ ) = (6) = s S u(s,ω)p(s i ( I )+ p(ψ p(s +γ I ),c ψ Ψ ) i s S µ p(s + I + ) The error in he belief sae is esimaed using he error in he esimaed and observed belief sae ). g(s,ω i ) = E[ s s 2 ] + ( u(s,ω i )). (7) Ideally his should coninue unil g(s,ω) i = 0. However, in pracice, we sop he ieraion well before i reaches he limi soluion (0 5 ). Finally, camera selecion is performed using he belief-o-acion mapping of Eq. 4. 2.2 Qualiy of View The Qualiy of View (QoV) is compued a each ime using informaion relaed o he amoun of aciviy, he number of visible objecs, he visible evens and he accumulaed objec score. The frame score is dependen on he applicaion a hand. While keeping mos of he descripions generic, we will focus in his secion on he coverage of eam spor evens and, in paricular, on baskeball. Le d i represen he binary mask for camera C i encoding he posiion of he pixels ha changed heir inensiy due o moion. In his implemenaion we use he color based change deecor presened in [7]. The amoun of aciviy, d i, observed in a view a ime is hus based on he amoun of non-zero moion observed in a frame. The observaion vecor is ψ = (ψ,...,ψ ), where each ψ i is consruced as ψ i = (J i, d i,e i,θ), i (8) where J i is he number of objecs in he view of a camera C i, E i = Ji j= εi j is he sum of individual objec scores ε i j : j =,...,J i and Θ i () is he oal even score in he view of he i h camera a ime. The number of objecs J i is compued based on a muli-level homography of d, i ha projecs he moving objecs from each view o a common ground plane [7]. The homography is consruced by labeling associaed se of poins across camera views and on a virual op view. By consrucion, all he poins from he d i ha are labeled as because of he presence of a arge in a paricular plane projec o he corresponding op view posiion. Each objec is assigned an objec score ε i j, which is indicaive of he imporance of an objec wihin he scene and i is based on is size, locaion and proximiy o calculae he objec score.
Objec size Figure 5: Example of objec scores based on he proximiy of players o he objec of ineres (i.e., he ball shown wih he blue circle). Objec locaion Figure 4: Effec of size and locaion on εi j : change in an objec score when i moves from a region of lower ineres o a region of higher ineres (lef o righ); change in objec score due o size (op o boom). The size score si j of he jh objec in he ih camera is calculaed as wi j.hi j (9) si j = i i j i j, A w + h where Ai is he imaging area of camera Ci, wi j and hi j are he widh and heigh of he jh objec, respecively, a ime when viewed from Ci. The proximiy of he arges o cerain objecs or areas in he scene is also aken ino accoun for calculaing he objec score εi j. Auhors in [5, 4] consider he disance beween he curren locaion of he arge being observed and he exi poins locaed in he scene, o esimae he deadline (i.e., he approximae ime before he arge leaves he scene). Similarly in spors scenarios, he ball is he objec of aenion and needs o be in he seleced view mos of he ime. Moreover, objecs near he ball are of higher ineres. The disance of each jh objec xi j in each camera Ci a ime from he poin of ineres is calculaed and is used as a proximiy Ri j. The lower he value of Ri j, he more significan he objec in he scene. For generaing he locaion score, he sie is divided ino K nonoverlapping regions and each region is assigned a region score γk [0, ], where γk represens he region of maximum significance. In baskeball scenarios, hese could be he regions near he baske (see Fig. 7, where regions of high ineres are shown in green). Based on is locaion in he image plane, each objec is assigned a region score γi j a ime. The objec observaion vecor for he jh objec in Ci a ime is hen consruced as ψi j = (si j, γi j, Ri j ), (20) and he objec score εi j is hen modeled as a mulivariae Gaussian disribuion εi j = (µio, Σoi, ψi j ), (2) wih mean µio and covariance Σoi. The moivaion for using a mulivariae Gaussian as opposed o a linear fusion of feaures [5] is o normalize each feaure individually in is own feaure space. Moreover, such fusion of feaures allows he exension of he feaure vecor o sui a specific ask, for insance, visibiliy of faces, eam informaion or objec associaed even score. These local feaures provide informaion abou he ineresing objecs inside each camera view. Figure 4 shows he values for εi j for an objec according o is disance from he region of ineres (marked in green in Fig. 7) in a camera view ha observes he objec from a disance (op row) as compared o a closeup view of he same objec (boom row) in anoher camera view. When objec moves closer o he region of ineres here is an increase in he significance of he objec from 0.42 o 0.57 and hen o 0.63 (lef o righ) when he objec is approaching he poin of ineres (he baske). For he camera wih a larger version of he same objec, εi j goes from 0.53 o 0.63 and hen o 0.78. Because of his, he objec insance in he boom-righ par of he Fig. 4 will have a higher score (larger size and wihin he area of ineres) and he objec insance in he op-lef will have he lowes score (smaller size and ouside he area of ineres). The effec of proximiy is shown in Fig. 5. As menioned earlier he objecs closer o he objec of ineres are more imporan han oher objecs, hence he objec closes o he ball (shown wih a blue circle) will have higher score (0.87) as compared o oher objecs (0.83 and 0.79). The visibiliy of evens occurring in he sie causes cerain views o be more significan han ohers. Le us assume ha here are L possible evens which can happen for a mulicamera seup. Based on he significance of each even, i is manually assigned a score θl indexed by l =,..., L. The oal even score Θi for Ci a ime is given as L Θi = θl. (22) l= The evens which are no observed by he camera a ime are assigned a 0 score. In a baskeball scenario, aemp on baske is idenified using moion vecors and he conexual informaion associaed o a view. We consider he region in he viciniy of he baske and when he overall magniude of vecors in his baske region is larger han a pre-learned hreshold, his is considered o be an
(d) (c) (e) Score 0.8 0.6 (b) 0.4 0.2 (a) 0 000 00 200 300 400 500 600 700 800 900 2000 Frame (a) (b) (c) (d) (e) Figure 6: Qualiy score based on he Gaussian disribuion model for C from frame 000 o frame 2000. Sample images a frame (a) 40, (b) 32, (c) 500, (d) 62, and (e) 736. even. A sample oupu for he QoV score of camera C is shown in Fig. 6. Saring from a near zero score for an almos empy frame (Fig. 6 (a)), as players sar enering he view of he camera, he score sars increasing (Fig. 6 (b)) and reaches a higher value when muliple players are wihin he view (Fig. 6 (c)). The score reaches is maximum when an aemp-on-baske even is deeced (Fig. 6 (d)). Afer he aemp on baske, players sar moving ouside he field of view hus leading o a decrease in he score (Fig. 6 (e)). 3 3. Resuls Experimenal seup We es he performance of he proposed mehod on a baskeball mach moniored by 5 cameras wih parially overlapping fields of view (Fig. 7). The daa consis of approximaely 5 minues (22475 frames a 25 fps) of recording for each camera. We used 500 frames per camera for raining he sysem. The regions of ineres are defined as he areas bounded by he hree-poin-line (Fig. 7, shown in green). The value of α was seleced o be 0.75. The poin of ineres is he ball and is locaion was marked manually. Alernaive approaches exis o deec he ball auomaically [3], however hey work only on porions of a scene and are no reliable for he enire mach. The camera selecion reference videos ( ) was generaed by non-professional users for approximaely 4 minues of he video and he mode a each ime was aken as he reference ground ruh for he seleced camera. To evaluae he effeciveness of he proposed camera selecion sraegy, here referred o as, we compare i wih hree alernaive scheduling sraegies: he maximum-rank (selecing he camera wih he highes value of ρi ), he recursive decision on a group-of-frames, implemened via Dynamic Programming (DP) [0], and he Dynamic Figure 7: Camera layou wih he field of view of each camera Ci highlighed. The regions of high imporance are shown in solid (green) color. Bayesian ework (DB) based approach. Please noe ha [6] uses only pas informaion and does no ake ino accoun he fuure (prediced) values, ha humans use when waching a video. The auomaically generaed video using he proposed approach and is comparison of various camera selecion sraegies can be seen a hp://www.eecs.qmul.ac.uk/ andrea/view-selecion.hml 3.2 Analysis The performance of each mehod for camera selecion is compared o. The overlap beween and he resuls of he mehods under analysis is shown in Table as a funcion of he seleced feaures. I can be seen ha he choice of feaures has an impac on he overlap score. For insance, when we use only he number of objecs, Ji, in he view of each camera (F ), here is a very small overlap. Such values are due o he camera layou (Fig. 7) as C2 observes he whole baskeball cour and hus i is seleced mos of he ime. Here, ouperforms he hree oher mehods ha penalize camera swiching. In comparison F2, which only uses he amoun of moion di, has
Figure 8: Selecion oupu for differen mehods for camera selecion. (Key. Black: ϒ go f ; brown: ϒ max ; blue: ϒ g ; red: ϒ dbn ; pink: ϒ uil ) Table : Comparison of camera selecion approaches on differen feaure ses. The numbers represen he % of frames seleced by a specific approach on a paricular feaure vecor composiion ha overlaps wih he reference selecion ϒ g. (Key. ϒ max : maximum-score-based; ϒ go f : DP-based [0]; ϒ dbn : DB-based [6]; ϒ uil : proposed mehod; F F 5 : feaure vecor composiions; J i : umber of objecs; d i : Amoun of moion; E i : Accumulaed objec score; Θ i : Even score) F F 2 F 3 F 4 F 5 F 6 F 7 F 8 F 9 F 0 F F 2 F 3 F 4 F 5 Feaures Mehods J i d i E i Θ i ϒ max 6.24 68.4 30.65 28.4 83.7 4.38 36.6 73.65 84.7 48.43 8.65 83.7 86.23 49.52 88.3 ϒ go f 4.49 70.7 24.26 7.89 78.3 40.52 27.26 83.72 74.46 47.93 83.72 82.4 74.59 45.97 83.80 ϒ dbn 3.39 68.37 28.4 28.37 80.7 38.39 30.39 79.37 74.37 53.06 88.42 8.34 86.53 48.49 9.35 ϒ uil 4.39 70.2 29.46 32.3 8.3 48.23 30.07 78.80 78.29 44.09 88.97 83.85 90.27 5.73 95.42 a larger overlap for all mehods as he amoun of moion can be reaed as an indicaor of he significance of he conen. However, as discussed earlier, i does no compensae for evens ha do no considerably affec he moion in he view (for insance he ball being hrown o he baske). Moreover, i is very sensiive o noise due o illuminaion changes and using i alone is no always appropriae. Similarly, he use of even informaion Θ i alone, F 4, is no a reliable feaure as evens may be very sparse in ime and can hus lead o decision poins ha are far apar in ime. In F 3, he accumulaed objec score E i, ha includes he objec size and locaion informaion has generally a larger overlap han F. However hese feaures are local, depend on he objecs only and do no ake ino accoun any even informaion. From F 5 o F 0, we couple feaures. When d i is used (F 5, F 8 and F 9 ) a larger overlap is achieved. In comparison, F 7 has he smalles overlap as i only akes ino accoun he number of objecs and he even score. These feaures as described earlier, are eiher oo sparse o be used alone (F 4 ) or misleading as hey would favor he selecion of he camera wih he maximum number of objecs (J i ). If we include he amoun of moion along wih hese feaures as in F 2, he overlap for all he mehods is significanly increased as compared o F 7 and F 2. However, when hey are included wih E i (F 4 ), he increase in he overlap percenage is limied. The larges overlap is achieved when all he feaures are used ogeher (F 5 ), where ϒ uil has he larges overlap. The percenage overlap for ϒ go f is he smalles as i has he minimum number of swiches Table 2: Mean error in he number of swiches per second of he auomaically generaed videos, compared o he ground ruh. oe ha has no effec on ϒ go f as i operaes on a emporal window and herefore he mean error 0.024 remains unaffeced. Mehod ϒ max ϒ dbn ϒ go f ϒ uil.394 0.88 0.024 0.047 5 0.404 0.83 0.024 0.047 0 0.97 0.32 0.024 0.038 5 0.4 0.03 0.024 0.038 20 0.075 0.089 0.024 0.04 25 0.06 0.066 0.024 0.009 bu does no always selec he bes-view (see Fig. 8, black). In comparison ϒ max, alhough presens more swiches, sill operaes around he bes view (see Fig. 8, brown). Figure 9 shows sample resuls of he proposed mehod ϒ uil, he ground ruh ϒ g, he sae-of-he ar mehods ϒ go f and ϒ dbn, and he baseline mehod ϒ max. The hree frames (from each camera) are indicaed by an arrow in he graph shown in Fig. 8 as (a), (b) and (c). Saring from Fig. 9 (a), mos of he players are seen by C 2, C 3 and C 4, whereas C sees only one player and he view from C 5 is empy. However as mos of he objecs are on he righ hand side of he cour and when viewed from C 2 and C 3 have relaively smaller sizes, C 4 is seleced by all he
C C2 C C5 C4 (b) C3 C2 C4 (a) C3 C2 C C3 C5 C4 C5 (c) Figure 9: Comparison of seleced cameras from differen mehods for he hree ime insances annoaed as (a), (b) and, (c) in Fig. 8. The same color coding as Fig. 8 is used o highligh he seleced camera. selecion mehods and. When he players sar moving from he lef o righ in Fig. 9 (b), C is seleced by for i shows a zoomed ou version of he lef side of he cour allowing o see he placemen of he players as hey move in is field of view. Based on, C is indeed he bes camera as i sees he maximum number of objecs a a reasonable size (as compared o C2 ). is able o correcly selec his camera, while, bound by he ransiions allowed in he adjacency marix (see [6]) has o swich from C4 o C2 and hen o C, selecs C2. selecs C2 as well, as i sees he enire baskeball cour from he side and has higher accumulaion of he objec score over ime. Finally, in Fig. 9 (c), players have aken up posiions in he lef hand side of he cour leaving C4 empy. According o he he bes camera is C2, which is also seleced by. The bes camera based on he QoV as seleced by is C5. Our proposed mehod selecs he bes-camera C5 while remains on he same camera C2. 3.3 Comparison To evaluae he effeciveness of smoohing inroduced by he proposed approach we compare i wih he oher mehods in erms of mean error in he average number of swiches per second. In his experimen we inroduce he selecion inerval such ha he decision is aken every frames. This resuls in reducing he number of swiches. Table 2 shows he obained resul where he mean error in he average swiches per second for reduces from.394 o 0.06 as τ increases from o 25. In he case of he error decreases from 0.047 o 0.009 only. For his mean error decreases from 0.88 o 0.066. The scheduling inerval has no effec on as i operaes on a emporal window and he mean error 0.024 remains unaffeced. This shows ha he proposed approach reduces he number of swiches wihou he need of inroducing an addiional parameer which may need o be adjused based on he dynamics of he scene. Figure 0 shows he improvemen achieved via emporal smoohing using he proposed approach from frame 57 o frame 560. Figure 0 (a-e) shows 6 swiches beween C and C5 for. Using and here is only one swich. However in his swich occurs afer 59 frames, whereas in his swich occurs a he 48h frame. This is due o he fac
(a) (b) (c) (d) (e) Figure 0: Camera selecion comparison of he hree approaches under analysis for 2 seconds of video. Row :. Row 2:. Row 3:. (Frame numbers: (a) 57, (b) 52, (c) 530, (d) 548, and (e) 560). ha is able o predic he nex sae and is able o swich o show he bes view before any informaion is los. However he ball is passed ouside he view of he camera (Fig. 0(d) row 2) when using. 3.4 Subjecive esing To evaluae he goodness of he auomaically generaed videos we performed a subjecive es (a Turing es) using 7 videos of 5000 frames a 25 frames per seconds and 3 subjecs. Ou of hese 7 videos, 3 videos (M - M3) were generaed manually by differen (non-professional) users, 4 videos were generaed by using,, and he proposed mehod. The manually generaed videos were ranked such ha he oal number of swiches in he end video were increasing (M (58), M2 (63), M3 (09)). Each subjec was asked o decide, for each video, wheher i was generaed manually (by a human) or auomaically (by an algorihm). The resuls of his subjecive evaluaion are shown in Tab. 3. I is possible o noice ha 83.87% of he subjecs misidenified he video auomaically generaed by as manually generaed. one of he subjecs seleced he video generaed by as manual, whereas 93.55%, 80.65% and 6.29% of he subjecs were able o correcly idenify M, M2 and M3, respecively as manually generaed video. 3.5 Compuaional cos Figure shows he compuaional cos of he proposed algorihm, broken down for each block oulined in Fig., in erms of relaive execuion ime. The ime for each block was calculaed on an Inel core i5 3.33 GHz Penium dual core using a non-opimized serial Malab implemenaion. The enire process ook on average 0.937 seconds per frame. Mulilayer projecions and objec deecion ook 0.567 seconds in oal wih 44% of he ime aken for muli-layer projecions and 7% for he objec deecion module. Change deecion (8%) and even deecion using moion informaion (4%) cos on average 0.67 and 0.037 seconds per frame, respecively. The objec- and frame-ranking (8% and 2%, respecively) and he camera selecion (7%) akes only 0.59 seconds per frame. 4 Conclusions We presened a echnique for auomaed video producion from muliple cameras, which is based on objec- and frame-level feaure ranking. The proposed approach esimaes objec visibiliy scores using a mulivariae Gaussian disribuion model and employs an opimal conrol policy for maximizing visibiliy over ime, while minimizing he number of camera swiches. The performance of he proposed approach was demonsraed on a muli-camera nework wih semioverlapping fields of view from a real baskeball mach. An overlap of 95.42% wih a manually generaed ground ruh is achieved for he bes seleced view a any given ime. The effeciveness of he proposed approach was also validaed hrough subjecive esing wih 3 people, of which 26 considered he auomaically generaed video via he proposed approach as good as a manually generaed one. The proposed camera selecion framework can be adaped o work wih oher feaure ses and i is herefore adapable o differen applicaion scenarios. In he fuure we plan o use auomaic ball deecion [3] and o use muliple modaliies, including audio, for auomaed video producion.
Table 3: Summary of he subjecive evaluaion resuls based on he Turing es. Turing % represens he percenage of subjecs who classifed he video as manually generaed. References Mehod Classified as manual Classified as auomaic Turing % M 29 2 93.55 M2 25 6 80.65 M3 9 2 6.29 ϒ max 0 3 0.00 ϒ go f 6 5 5.6 ϒ dbn 24 7 77.42 ϒ uil 26 5 83.87 [] R. Bellman. Dynamic Programming. Princeon Universiy Press, Princeon, 957. [2] A. Del Bimbo and F. Pernici. Towards on-line saccade planning for high-resoluion image sensing. Paern Recogniion Leers, 27(5):826 834, ov. 2006. [3] A. R. Cassandra. Exac and approximae algorihms for parially observable Markov decision processes. PhD hesis, Brown Universiy, 998. [4] C. Cosello, C. Diehl, A. Banerjee, and H. Fisher. Scheduling an acive camera o observe people. In Proc. of he ACM In. Workshop on Video surveillance & sensorneworks, pages 39 45, 5-6 Oc. 2004. [5] F. Daniyal, M. Taj, and A. Cavallaro. Conen-aware ranking of video segmens. In Proc. of ACM/IEEE In. Conf. on Disribued Smar Cameras, pages 9, 7- Sep. 2008. [6] F. Daniyal, M. Taj, and A. Cavallaro. Conen and ask-based view selecion from muliple video sreams. Mulimedia Tools Appl., 46(2-3):235 258, Jan. 200. [7] D. Delannay,. Danhier, and C. De Vleeschouwer. Deecion and recogniion of spors(wo)men from muliple views. In Proc. of ACM/IEEE In. Conf. on Disribued Smar Cameras, pages 9, Sep. 2009. [8] J. Denzler and C.M. Brown. Informaion heoreic sensor daa selecion for acive objec recogniion and sae esimaion. IEEE Trans. on Paern Analysis and Machine Inelligence, 24(2):45 57, Feb. 2002. [9] Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, and Z. Zhou. Muli-view video summarizaion. IEEE Trans. on Mulimedia, 2(7):77 729, ov. 200. [0] H. Jiang, S. Fels, and J. J. Lile. Opimizing muliple objec racking and bes view video synhesis. IEEE Trans. on Mulimedia, 0(6):997 02, Oc 2008. Frame score esimaion Objec score 2% esimaion 8% Even deecion 2% Moion exracion 2% Objec deecion 7% Camera selecion 7% Change deecion 8% Muli-layer projecions 44% Figure : Relaive execuion ime for each module of he proposed video producion approach. [] V. Krishnamurhy and D.V. Djonin. Srucured hreshold policies for dynamic sensor scheduling-a parially observed markov decision process approach. IEEE Trans. on Signal Processing, 55(0):4938 4957, Oc. 2007. [2] S.. Lim, L.S. Davis, and A. Elgammal. Scalable imagebased muli-camera visual surveillance sysem. In Proc. of IEEE In. Conf. on Advanced Video & Signal Based Surveillance, pages 205 22, Jul. 2003. [3] F. Poiesi, F. Daniyal, and A. Cavallaro. Deecor-less ball localizaion using conex and moion flow analysis. In Proc. of IEEE In. Conf. on Image Processing, pages 393 396, 26-29 Sep. 200. [4] F. Z. Qureshi and D. Terzopoulos. Surveillance in virual realiy: Sysem design and muli-camera conrol. In Proc. of IEEE In. Conf. on Compuer Vision and Paern Recogniion, pages 8, 7-22 Jun. 2007. [5] M. Rezaeian. Sensor scheduling for opimal observabiliy using esimaion enropy. In IEEE In. Workshop on Pervasive Compuing and Communicaions, pages 307 32, 9-23 Mar. 2007. [6] M. Spaan and P. Lima. A decision-heoreic approach o dynamic sensor selecion in camera neworks. In Proc. of In. Conf. on Auomaed Planning and Scheduling, pages 8, 9 23 Sep. 2009. [7] M. Taj, E. Maggio, and A. Cavallaro. Muli-feaure graph-based objec racking. In Proc. of Classificaion of Evens, Aciviies and Relaionships (CLEAR) Workshop, pages 90 99, Apr. 2006. [8] H.C. Tijms. A firs course in sochasic models. John Wiley and Sons, Ld, England, 2003. [9]. Zhou, R. Collins, T. Kanade, and P. Mees. A maserslave sysem o acquire biomeric imagery of humans a disance. In Proc. of ACM SIGMM In. Workshop on Video surveillance, pages 3 20, ov. 2003.