Multi-Clip Video Editing from a Single Viewpoint

Muli-Clip Video Ediing from a Single Viewpoin Vinee Gandhi INRIA/Laboraoire Jean Kunzmann, France vinee.gandhi@inria.fr Remi Ronfard INRIA/Laboraoire Jean Kunzmann, France remi.ronfard@inria.fr Michael Gleicher Universiy of Wisconsin-Madison, USA gleicher@cs.wisc.edu ABSTRACT We propose a framework for auomaically generaing muliple clips suiable for video ediing by simulaing pan-il-zoom camera movemens wihin he frame of a single saic camera. Assuming imporan acors and objecs can be localized using compuer vision echniques, our mehod requires only minimal user inpu o define he subjec maer of each sub-clip. The composiion of each sub-clip is auomaically compued in a novel L-norm opimizaion framework. Our approach encodes several common cinemaographic pracices ino a single convex cos funcion minimizaion problem, resuling in aesheically pleasing sub-clips which can easily be edied ogeher using off-he-shelf muli-clip video ediing sofware. We demonsrae our approach on five video sequences of a live heare performance by generaing muliple synchronized subclips for each sequence. Caegories and Subjec Descripors I.3.8 [Compuer Graphics]: Applicaions; I.4.9 [Image processing and Compuer Vision]: Applicaions General Terms Compuer Vision, Compuer Graphics Keywords Video Ediing, Video Processing. INTRODUCTION High qualiy video uses a variey of camera framings and movemens edied ogeher o effecively porray is conen on screen. To produce such video from a live even, such as a heaer performance or concer, requires source video from several cameras o capure muliple viewpoins. These individual camera videos, or rushes, are hen edied ogeher o creae he final resul. The requiremen of a muli-camera shoo, including having muliple Permission o make digial or hard copies of all or par of his work for personal or classroom use is graned wihou fee provided ha copies are no made or disribued for profi or commercial advanage and ha copies bear his noice and he full ciaion on he firs page. Copyrighs for componens of his work owned by ohers han ACM mus be honored. Absracing wih credi is permied. To copy oherwise, or republish, o pos on servers or o redisribue o liss, requires prior specific permission and/or a fee. Reques permissions from Permissions@acm.org. CVMP 4 3-4 November 204, London, Unied Kingdom Copyrigh 204 ACM 978--4503-385-2/4/...$5.00. hp://dx.doi.org/0.45/2668904.2668936. synchronized cameras each wih a skilled operaor capable of creaing good framings and camera movemens, makes i expensive and inrusive, and herefore impracical for many scenarios. In his paper, we inroduce an approach o creae muliple synchronized videos from a live saged even ha are suied for ediing. Our key idea is o use a single non-moving camera ha capures he enire field of view of he even and simulaing muliple cameras and heir operaors as a pos-process, creaing a synchronized se of rushes from he single source video. This sraegy allows us o avoid he cos, complexiy and inrusion of a muliple camera shoo. A pan-il-zoom (PTZ) camera can be simulaed simply by cropping and zooming sub-windows of he source frame. The challenge we are addressing here is simulaion of a compeen camera operaor: choosing he differen virual viewpoins such ha heir resuls are videos ha are likely o be useful for ediing. This means ha each video mus no only obey cinemaic principles o be "good video" bu also have properies ha make i easier o edi wih he oher rushes. Our soluion, illusraed in Figure, akes as inpu a single maser sho a high resoluion video aken from he viewpoin of a member of he audience. Because we consider saged evens (such as heaer performances or concers), we can assume ha his vanage poin is sufficien o see all of he acion (oherwise, he audience would miss i as well). Compuer vision echniques are used as a pre-process o idenify and rack he acors. Our mehod hen creaes videos by simulaing each of he cameras individually. Each virual camera akes a simple specificaion of which acors i should conain and how hey should be framed (screen posiion and size). A novel opimizaion-based algorihm compues he dynamic framing for he camera i.e. he movemen of a sub-window over he maser sho. The opimizaion considers he specificaion, cinemaic principles of good camera movemens, and requiremens ha he videos can be cu ogeher. The oupu of our mehod is a se of synchronized videos (or rushes) ha provide muliple viewpoins of he saged even. These rushes can hen be edied ogeher using radiional muli-rack video ediing sofware. This allows a human edior o use heir creaiviy, experise in ediing, and undersanding of he conen, o creae a properly edied video. Our approach auomaes he synheic rush generaion process, which is edious o perform manually. The cenral idea of our approach is ha we can cas he problem of deermining a virual camera movemen, ha is he size and posiion of he subregion of he maser clip over ime, as an opimizaion problem. Specifically, we cas i as a convex program allowing for efficien and scalable soluion. The key insigh is ha many of he principles of good camera moion, including wha camera shos are useful in ediing, can be cas wihin his opimizaion framework.

Figure : Our mehod akes as inpu a high resoluion video from a single viewpoin and oupus a se of synchronized subclips by breaking he groups ino a series of smaller shos. 2. RELATED WORK The problem of creaing a muli-view edied video from limied inpu camera views has been considered previously in very specialized scenarios. The Virual Videography sysem [4] used virual PTZ o simulae muliple cameras from a lecure video. The MSLRCS [23] and AuoAudiorium [5] sysems use a small se of fixed views including inpu of presenaion slides. These sysems achieve full auomaion, albei a he expense of allowing for creaiviy and human inpu in he ediing process. They are also limied o working in he more consrained environmen of usually a single presener in fron of a chalkboard or a slide screen. Exending o he richer environmen of muliple acors in more complex sagings would be challenging, especially as i requires a richer se of sho ypes o be generaed. In conras, our sysem addresses only he rush generaion aspec, bu can generae a rich range of sho ypes from more complex scenarios including muliple acors. Remarkably lile work as been devoed specifically o he problem of opimizing he composiion of a virual PTZ camera in posproducion. Recen work has focused more on providing ineracive conrol of he virual PTZ camera in real ime [7, 8, 9]. A mehod for auomaic ediing of baskeball games was recenly proposed by Carr e al. [6] employing boh roboic and virual PTZ cameras. The roboic camera follows he cenroid of he deeced players and hen a virual camera is employed o crop subregions of he obained images o compensae for moor errors. Because hey are argeing online applicaions, heir mehod makes decisions in near real ime (wih a small delay), which can lead o sub-opimal soluions. To he bes of our knowledge, our echnique provides he firs general soluion o he problem for offline compuaion of a virual pan-il-zoom camera sho given a lis of visual arges over he enire duraion of a recorded performance. The quesion of a compuaional model o describe good camera movemens, paricularly in erms of subwindows of video, has been considered by work in video sabilizaion. While mos video sablizaion work simply aimed o remove jier, Re-Cinemaography [] formalized cinemaic principles in erms of movemen in he frame for compuaion. Grundmann e al. [3] showed ha his can be formulaed in an opimizaion framework. We build on his compuaional formulaion of camera movemen, exending i o he problem of muliple rush generaion. Our work is also relaed o he general problem of auomaic video ediing. Our main conribuion is his area is concerned wih generaing rushes ha can easily be cu ogeher. To he bes of our knowledge, his has no been addressed in previous work. Berhouzoz e al. [4] have looked a he problem of when o place cus and ransiions during dialogue scenes, based on audio analysis, assuming ha he inpu shos are correcly framed. We solve he complemenary problem of generaing correcly framed inpu shos. Wang e al. [22] proposed an approach for auomaic ediing of home videos. Their mehod selecs imporan pars of he scene based on deecions and moion saliency and summarizes video by removing pars of he video boh emporally and spaially. Our mehod does no remove pars of he video emporally and generaes muliple shos for he enire duraion of he video wih conrolled camera dynamics. Arev e al. [2] looked a he problem of selecing rushes and ediing hem ogeher using a large number of viewpoins aken by "social cameras" (i.e. cameras operaed by he audience) possibly wih reduced resoluion. We are solving he opposie problem of using a single high-resoluion viewpoin. Ineresingly, muliple cameras focusing on he same aenion poin give a heurisic measure of he imporance of ha aenion poin, which can be used o auomaically edi he video wihou deep undersanding of he scene. In our case, we canno rely on such heurisic and insead focus on he rush generaion problem, leaving he "final cu" o he user. A general framework for auomaically cropping sub-clips from a panoramic video and ediing hem ogeher ino a movie in realime has been proposed for he case of spors evens [7, 20]. Operaing in real-ime limis he capabiliy of ha sysem o racking of simple arges. Because we are focusing on pos-producion, raher han live broadcas, we are able o perform racking and recogniion of muliple acors even in complex cases, which makes our mehod more generally applicable o rich culural heriage conen. Furhermore, our mehod can be used o generae arbirary sho composiions, which beer addresses he challenges of pos-producion. Using compuer vision o conrol cameras and creae full-edied movies is no a new idea. Pinhanez and Bobick [9] have invesigaed he use of low-level moion racking o conrol TV cameras bu heir mehod is limied o closed-world domains (exemplified by cooking shows) and requires exensive domain-specific knowledge. In conras, our approach is domain-agnosic and relies on general principles of image composiion and film ediing. Our mehod assumes a pre-processing sage, where acors are racked and he locaion of heir heads and floor projecions are recovered in each frame of he inpu video. In his paper, we implemened a simple offline racker based on acor specific deecions [0] bu any suiable racker can be used insead, including ineracive rack-

Figure 2: Inpu o he sysem are he upper body bounding boxes for each acor (shown wih squares) and he poins where he acors ouch he sage floor (shown wih cross markers). ers inroduced recenly by Bailer e al. [3] which are well suied for he ask of video producion. 3. VIRTUAL CAMERA SPECIFICATION Given a single video covering he enire sage our mehod allows he user o creae differen reframed shos or rushes wih a simple descripion for each sho. Each reframed video is generaed by compuing a virual camera rajecory over he maser sho i.e. choosing a cropped window over he maser sho a each ime. The core componen of our approach is an opimizaion framework o compue he virual camera rajecory wih opimal composiion, movemen and cuabiliy. In his secion, we presen and explain our sho specificaion scheme in deails, and review some fundamenal conceps in cinemaography ha moivae our opimizaion approach. Several esablished video producion books emphasize he role of composiion and cuing in cinemaography [, 6, 5, 8]. Mascelli [6] idenifies five main opics in cinemaography: camera angles (where o place he camera relaive o he subjec); coninuiy (making sure ha acions are fully covered); cuing (making sure ha ransiions beween cameras are possible); close-ups (making sure ha imporan pars of acions are emphasised); and composiion. Since we are working from a single camera angle (he camera posiion chosen for he inpu maser sho) and he acion is recorded in full coninuiy, camera angles and coninuiy are given as an inpu o our mehod. On he oher hand, we need o ake special care of composiion and cuing when compuing virual camera movemens o compensae for he consan camera angle. 3. Acor deecion The inpu o our mehod is a lis of all acors presen in he maser sho and heir bounding boxes. We assume hey are given as (bx,by,bs,bh) where bx,by are he cener coordinaes of he upper body bounding box, bs is he acor s upper body size and bh is he acor s heigh on sage(all in pixels). An acors heigh on sage is he lengh from op of he upper body bounding box o he poin i ouches he sage floor. An example of he inpu bounding boxes and corresponding floor poins are shown in Figure 2. For sanding posure he sage heigh (bh) of an acor is approximaely four imes he size of he upper body bounding box (bs). Bu his raio may no be rue for oher posures like siing or bending and i becomes imporan o ake ino accoun he floor poins o compue he sage heigh of he acor. 3.2 Virual camera geomery The images of wo virual PTZ cameras wih idenical camera ceners are relaed by a projecive ransformaion (homography). In principle, a virual PTZ camera image can herefore be specified exacly from he maser camera image wih four poins (eigh parameers). In pracice, we use a recangular cropping window wih a fixed aspec raio as a simplified model of a virual PTZ camera. Thus our camera model is specified wih jus hree parameers : he virual camera cener, f y and he virual camera heigh f s (all in pixels), where (, f y, f s) lie wihin he space of he maser image. This has he benefi ha he virual camera does no creae unwaned geomeric deformaions in he virual image and i preserves he resoluion of he maser camera image. The virual camera image can herefore be obained by isoropic re-sampling of he cropped image from he maser sho. 3.3 Sho naming convenions Based on common pracice in video producion [2], shos are specified by heir subjec maer and size. The subjec maer for a sho is ypically he lis of acors who should be onscreen. Opionally, objecs or sage locaions can also be subjec maers for a virual camera sho, alhough his is no used in his paper. The size for a sho is ypically defined by specifying he average screen size occupied by each acor. We use classical convenional sho size names including "long sho" (LS), "full sho" (FS) and "medium sho" (MS) as a specificaion. Our sysem les users choose he subjec maers and sho sizes for each virual camera independenly. Those specificaions are given for he enire duraion of he maser sho. For insance, he specificaion for a virual camera can simply be a "full sho of acor A and acor B" or jus "FS A,B". Figure 3 shows ha a oal of five virual cameras can be specified using only wo sho sizes and wo acors. Given he acor bounding boxes, he subjec maers and sho sizes for all virual cameras, he ask is now o compue virual camera rajecories resuling in good composiion for each camera, and preserving possibiliies for cuing beween cameras. 3.4 Composiion As emphasized by Mascelli [6], good camera work begins wih composiion, which includes no jus he composiion of objecs in a saic frame, bu also he composiion of movemens in a dynamic frame. Framing, or image placemen, is he posiioning of subjec maer in he frame[6] and is he mos imporan aspec of sho composiion in our conex. For our purpose, he mos imporan framing principles emphasized by Mascelli are ha (a) subjecs should no come ino conac wih he image frame; (b) he boom frame should no cu across subjec s joins (knees, wais, elbows, ankles) bu should insead cu beween joins; (c) subjecs mus be given more space in he direcion hey ravel and he direcion hey look. The subjecs in our case are he acors in he sho specificaion. To ensure ha he subjecs do no come ino conac wih he image frame, we define an inclusion region for he given sho specificaion. This inclusion region is hen encoded as a hard consrain in our opimizaion framework o make sure ha he subjecs are always nicely kep inside he virual camera frame. Examples of inclusion regions wih wo differen sho specificaions are shown in Figure 4 and Figure 5 wih shaded recangles. The inclusion region is defined using four coordinaes (xl,xr,yu,yb ). The values xl, xr and yu denoe he lefmos, righmos and he opmos coordinae of he upper body bounding boxes of acors included in he sho. The coordinae yb is defined differenly for he full sho and he medium sho. For a single acor in a medium sho yb is he opmos coordinae plus wice he size of is upper body bounding

Figure 3: The figure shows he possible se of framing wih wo acors (A,B) and wo sho sizes i.e. he medium sho (MS) and he full sho (FS). Even wih a simple case of wo acors a oal of 6 camera choices are available including he original wide sho (WS). Figure 4: Inclusion region for sho specificaion "FS A,B" i.e. he full sho of Acor A and Acor B. A full sho of wo acors is a ighes window which keeps boh of hem enirely inside he frame. Figure 5: Inclusion region for sho specificaion "MS A" i.e. he medium sho of Acor A. The boundaries of neares acors on he lef and he righ are also shown in his figure. box. In he case of full sho of an acor yb is he poin where he acor ouches he sage floor. In case of muliple acors, he lower coordinae is compued for each acor individually and yb is aken he as maximum value among hem. A sho size penaly in he opimizaion cos funcion ries o keep he virual camera framing close o he inclusion region mainaining a nice composiion. The penaly is explained wih deail in Secion 4.2. The sho size penaly and hard consrains ensure ha he acors specified in he sho are nicely framed inside he virual camera. Bu oher acors may sill come in conac wih virual camera window. To avoid his we add anoher penaly erm in he opimizaion framework which avoids chopping exernal acors and ries o keep hem eiher fully ouside or pulls hem fully inside he virual camera window. This is explained in deail in Secion 4.6. 3.5 Cuing rules An anoher imporan consideraion in our work is o make sure ha he virual PTZ cameras produce shos ha can easily be edied ogeher. In his paper, we enforce cuabiliy of he shos by mainaining screen coninuiy of all acors and by creaing only sparse virual camera movemens. When cuing beween cameras showing he same acors wih differen composiions, i is imporan o keep hem in similar screen posiions. To enforce such screen coninuiy, we give a preference for virual sho composiions where he acors keep he screen posiions from he maser sho. An example is given in Figure 3 wih wo acors. The acor on he he lef is kep on he lef side in he virual camera sho composiion "MS A" and he acor on he righ is kep on he righ side of he virual camera sho composiion "MS B". Cuing during camera movemen is difficul because he movemens of he wo cameras should be mached. As a resul, film ediors ypically prefer o cu when none of he cameras are in moion.to maximize he number of opporuniies for cuing, we herefore give a preference o virual cameras wih sparse pan, il and zoom movemens. As will be explained in he nex secion, we enforce his preference by regularizing he firs order derivaive of he virual camera coordinaes in he L-norm sense. 3.6 Camera movemen Imporance of a seady camera has been highlighed by Thomson s Grammar of he sho [2]. I is of no use o prepare a wellcomposed sho only o have is image blurred or confused by an unsable camera. As discussed in earlier secion, a seady camera

is also beneficial for cuing. Also he camera should no move wihou a sufficien moivaion as i may appear puzzling o he viewer. The goal of avoiding movemen as much as possible sill leaves he quesion of wha kinds of movemens o use when hey are necessary. Thomson in his book [2] menions ha a good pan/il sho should comprise of hree componens: a saic period of he camera a he beginning, a smooh camera movemen which "leads" he movemen of he subjec and a saic period of he camera a he end. As menioned in he earlier secion, we use L-norm regularizaion over he firs order derivaive of virual camera coordinaes o ge he saic camera behavior. In order o obain smooh ransiions beween he saic segmens we add L-norm regularizaion erm over he hird order derivaive of he virual camera coordinaes. This will end o give segmens of consan acceleraion and deceleraion, creaing he ease-in and ease-ou effec. This is explained wih deail in Secions 4.3 and 4.4. An insan problem which may arise while moving a cropping window inside he maser sho is ha he acual moion of he acor on sage may no be preserved inside he virual camera. For example, an acor which is saic on sage may appear moving/sliding inside he virual camera frame or an acor moving on he lef on sage may appear moving on he righ in he virual camera frame. We inroduce anoher penaly erm in he opimizaion framework o preserve he apparen moion of he acors. This is explained wih deail in Secion 4.5. 4. OPTIMIZATION In his secion we show ha how differen cinemaographic principles explained in he previous secion are defined as differen penalies or consrains and are combined in a single convex cos funcion which can be efficienly minimized o obain he virual camera rajecory for a given sho specificaion. We firs summarize he noaion and hen explain each erm of he cos funcion in deail. Noaion: The algorihm akes as inpu he bounding boxes (bx m, by m, bs m, bh m ) for each acor (m = [ : M]) and ime. The algorihm also akes as inpu he inclusion region {xl,xr,yu,yb } and he exernal acor boundaries {xl,xr,xl,xr }, which are derived using he acor racks and he sho specificaion. The algorihm oupus a cropping window ξ = {, f y, f s } for each frame ( = [ : N]), where (, f y ) are he coordinaes of he cener and ( f s ) is he size i.e. half of he heigh of he cropping window. We also define x = 2 (xl + xr ) as he midpoin of he lef and he righ coordinaes of he inclusion region and y = 2 (yu + yb ) as he midpoin of verical inclusion coordinaes. We define s = 2 (yb yu ) as he desired size of he cropping window and A r as he required aspec raio. The variable f s denoes he verical half lengh of he cropped window and he horizonal half lengh is given by A r f s. 4. Inclusion consrains We inroduce wo ses of hard consrains, firs ha he cropping window should always lie wihin he maser sho and second ha he inclusion region should be enclosed wihin he cropping window. Hence, he lef mos coordinae of cropping window A r f s should be less han xl and should be greaer han zero. Similarly, he righ mos coordinae of cropping window + A r f s should be greaer han xr and less han he widh (W) of he maser sho. Formally, we define he horizonal inclusion consrains as: 0 < A r f s xl and xr + A r f s W. () Similarly, we define he verical inclusion consrains: 0 < f y f s yu and yb f y + f s H, (2) where H is he heigh of he maser sho. 4.2 Sho size penaly As explained earlier, o mainain he desired composiion he virual camera cropping window should remain close o he inclusion region. So, we wan o be close o he midpoin of lef and righ coordinaes of he inclusion region. Similarly, we wan f y o be close o he midpoin of he op and he boom coordinaes of he inclusion region. Also, we wan f s o be close o he heigh of he inclusion region. Any diversion from he desired posiion and size is penalized using a daa erm: D(ξ ) = N 2 (( x ) 2 + ( f y y ) 2 + ( f s s ) 2 ). (3) = This erm by defaul always ceners he given se of acors. This may no be always good for ediing laer, where an appropriae look-space is preferred. As discussed in Secion 3.5, his problem can be resolved by mainaining he screen posiions of he acor in he maser sho. To do his, we pre-compue a vecor h which is a ime if he acor is righmos on sage; - if he acor is lefmos on sage; and 0 if he acor is beween oher acors. Now appropriae look-space can be creaed by modifying he erm ( x ) in Equaion 3 o ( + 0.7Ar f s h x ). 4.3 Firs order L-norm regularizaion Simply compuing composiions independenly a each ime sep, may lead o a noisy virual camera moion. As discussed in previous secion, a seady camera behavior is necessary for a pleasan viewing experience. Also, long saic camera segmens are favorable for he purpose of cuing. To obain he desired saic camera behavior we inroduce an L-norm regularizaion erm over he firs order derivaive. When L-norm erm is added o he objecive o be minimized, or consrained, he soluion ypically has he argumen of he L-norm erm sparse (i.e., wih many exacly zero elemens). Hence, adding L-norm erm o he velociy will end o give piecewise consan segmens combined wih fas ransiions. This will filer ou he noisy camera moion. The erm is defined as follows: N L (ξ ) = ( + + f y + f y + f s + f s ). (4) = This is illusraed in Figure 6 wih a synheic one dimensional signal. This signal can be inerpreed as he x coordinae of cropping window compued based on he inclusion region derived from noisy acor racks. The middle plo in Figure 6 shows he opimized signal minimizing he closeness erm o original signal (sho size penaly in one dimension) wih L-norm regularizaion on velociy erm. We can observe ha adding he L-norm on velociy ends o give piecewise consan segmens(wih exacly zero moion). Using a more common L2 norm ends o spread he movemen over many frames, leading o coninual drifing moions, raher han disinc periods of zero movemen. 4.4 Third order L-norm regularizaion When he camera moves i should move smoohly. The camera movemen should sar wih a segmen of consan acceleraion and

0 200 400 600 800 000 45 40 35 30 25 20 5 0 5 0 5 45 40 35 30 25 20 5 0 5 0 5 45 40 35 30 25 20 5 0 5 0 5 min r Original signal x, = [ : N] 0 200 400 600 800 000 min r 2 = N (x r ) 2 + λ = N r + r 0 200 400 600 800 000 2 = N (x r ) 2 + λ = N r + r + λ = N 3 r +3 3r +2 + 3r + r Figure 6: Top: Synheic one dimensional daa x. Middle: Opimized signal r, minimizing he sum of squares closeness erm o original daa wih L-norm regularizaion on velociy. Boom: Opimized signal r, minimizing he sum of squares closeness erm o original daa wih L-norm regularizaion on boh velociy and jerk. should end wih a segmen of consan deceleraion. Using only L- norm on velociy will lead o sudden sar and sop of he camera (sharp corners in middle plo of Figure 6). I also leads o a saircase arifac (slopes in middle plo of Figure 6). Previous work [3] on camera sabilizaion has shown ha a combinaion of firs order L-norm regularizaion wih higher order L-norm regularizaion on camera coordinaes can be used in an opimizaion framework o obain smooh camera rajecories wih jerk free ransiions beween saic segmens. In he same spiri we inroduce a hird order L-regularizaion erm which is defined as follows: N 3 L 3 (ξ ) = ( +3 3 +2 + 3 + = + f y +3 3 f y +2 + 3 f y + f y + f s +3 3 f s +2 + 3 f s + f s ). (5) Inroducing L-norm on hird order derivaive will give jerk free ransiions a he sar and sop of he camera movemen, wih segmens of consan acceleraion and deceleraion. We can observe his in boom plo of Figure 6 ha using a combinaion of L-norm on boh velociy and jerk gives he desired camera behavior showing smooh ransiions beween long piecewise consan segmens. 4.5 Apparen moion penaly To preserve he sense of aciviy on sage, he acual moion of he acors should be same as he apparen moion seen in he cropped window on he virual camera. We inroduce wo differen penaly erms o include his in he opimizaion cos funcion. The firs erm penalizes any cropping window moion if he acor included in sho specificaion is saic. N M (ξ ) = (cx m + + cy m f y + f y m = + cs m f s + f s ). (6) Here cx m, cy m and cs m are pre-sored binary vecors which ake a value of if he acor is saic in he posiion and size co-ordinaes respecively. For example, cx m is if (bx+ m bxm ) is less han a hreshold else i is 0, where bx m is x-coordinae of he cener of he bounding box of he given acor (m) a ime (). This penaly is added for each acor specified in he sho descripion. If he acor is saic, a penaly equivalen o he cropping window moion is added o he cos funcion, else his erm is zero. The second erm adds a penaly if he direcion of apparen moion is no preserved: N M 2 (ξ ) = (max(0, ( bx m f x ) bx m ) m = + max(0, ( by m f y ) by m ) + max(0, ( bs m f s ) bs m )). (7) Here, bx m = (bx+ m bxm ) gives he he acual horizonal moion of he acor on sage, m = ( + m m ) is he horizonal moion of he virual camera cropping window and ( bx m f x ) is he apparen moion of he acor inside he virual camera cropping window beween consecuive ime insans. The erm ( bx m f x ) bx m is posiive if he apparen direcion of moion is same as he acual direcion of moion on he sage. A penaly is added if he erm is negaive, oherwise he penaly is zero. This is summed over he se of acors included in he sho descripion. 4.6 Pull-in or keep-ou penaly To avoid being cu by he frame, each acor mus eiher be in or ou of he virual camera window. For acors included in he sho descripion, his is ensured by a hard consrain on he inclusion region. Bu oher acors may sill come in conac wih he virual camera frame (if hey come in close viciniy of he inclusion region or cross across i). So we would like o add a penaly if he righmos coordinae of he cropping window + A r f s lies wihin he righ exernal acor boundaries xr and xr (please refer o Figure 5). Similarly, we would like o add a penaly if he lefmos coordinae of he cropping window A r f s lies wihin he lef exernal acor boundaries xl and xl (please refer o Figure 5). Bu such a conjuncion is no convex. To approximae his wihin he convex framework, we use a heurisic ha pre-compues binary vecors l and r which ake a value of if a ouch even occurs from he lef or he righ respecively or hey ake a value of zero. A ouch even occurs if an ouside acor comes in close viciniy of he inclusion region for a given sho specificaion. Using hese wo vecors, we define wo separae penaly erms E ou and E in. The E ou penaly is only applied when no ouch even is occurring on he lef or he righ inclusion regions. I is defined as follows: N E ou (ξ ) = (( l )max(0,xl + A r f s ) = + ( r )max(0, + A r f s xr )). (8) When no ouch even occurs any insance of he cropping window frame ouching he closes exernal acor on he lef or he righ

Figure 7: Screensho of a muliclip sequence generaed using a se of four sequences in Final Cu Pro. In he middle we see he four sequences including he original maser sho and hree reframed sequences (MS A, MS B, FS All) which were generaed using he proposed mehod. On he righ, we see he edied sequence. is penalized. For example, if he righ edge of he cropping window +A r f s is greaer han xr, a penaly of +A r f s xr is added, oherwise he penaly is zero. Similarly, he penaly is also defined for lef edge. The no ouch even in Equaion 8 is defined as he logical no ( ) of he lef and he righ ouch vecors l and r. When a ouch even occurs, he penaly erm E ou swiches o E in, which is defined as follows: N E in (ξ ) = (l max(0, A r f s xl ) = +r max(0,xr A r f s )). (9) Here, xr and xl denoe he lefmos and righmos coordinae of he upper body bounding box of an ouside acor (no included in sho specificaion) ouching from he lef or he righ side respecively. For example, if an ouside acor is ouching from he righ, he righmos coordinae of he cropping window +A r f s should be greaer han he righmos coordinae of he racking window of he ouching acor xr, oherwise a penaly of (xr A r f s ) is added o he cos funcion. 4.7 Energy minimizaion Overall he problem of finding he virual camera rajecory given he acor bounding boxes and he sho specificaion, can simply be summarized as a problem of minimizing a convex cos funcion wih linear consrains. Which is defined as follows: minimize (D(ξ ) + λ L (ξ ) + λ 2 L 3 (ξ ) + λ 3 E ou (ξ ), f y, f s subjec o + λ 4 E in (ξ ) + λ 5 M (ξ ) + λ 6 M 2 (ξ )) 0 A r f s xl, xr + A r f s W, 0 f y f s yh, yb f y + f s H, =,...,N. (0) Here, λ, λ 2, λ 3, λ 4, λ 5 and λ 6 are parameers. They can be adjused o conrol he amoun of regularizaion and he weigh of each penaly erm. In his paper, we use only wo parameers wih (λ = λ 2 ) and (λ 3 = λ 4 = λ 5 = λ 6 ), giving a similar preference o each penaly erm. Bu his can be adjused in special cases where higher preference may be required for a specific penaly erm. One major advanage of our mehod is ha any sandard off he shelf convex opimizaion oolbox can be used o solve Equaion 0. In our case we use cvx [2]. 5. RESULTS We presen resuls on five differen sequences from Arhur Miller s play Deah of a Salesman. The sequences were recorded during rehearsals a Célesins, Théâre de Lyon. Each of hese sequences were recorded from he same viewpoin in Full HD (920 080). Those sequences were chosen from scenes wih wo, hree and four acors o demonsrae he versailiy of our approach. For each of hese maser shos, we generae a variey of reframed sequences wih differen sho specificaions. The reframed sequences are generaed wih a resoluion of (640 360), mainaining he original 6 : 9 aspec raio. These generaed sequences can be direcly impored and edied in a sandard video ediing sofware as a muli-clip. Figure 7 shows example of a muli-clip sequence consising of he original sequence (maser sho) and he hree reframed sequences generaed using our mehod. All he original videos and generaed rushes are available online. Qualiaive evaluaion. The resuls on wo differen sequences are shown in Figure 8 and Figure 9. Each figure shows a few seleced keyframes from he original video and he corresponding frames from he virual camera sequences generaed using our mehod. A plo of he horizonal posiion of he virual camera rajecory agains ime is shown for each of he generaed sequences. The generaed sequences allow he edior o highligh deails which may be no be so easy o noice in he original sequence. Also, i provides much more variey o keep he viewer ineresed. Now we discuss he generaed sequences on hree imporan aspecs of cinemaography: Composiion. We can observe ha he virual cameras mainain a nice composiion based on he sho specificaion. For example, he virual cameras "MS A" and "MS B" in Figure 8 keep a hps://eam.inria.fr/imagine/vgandhi/cvmp_ 204/

Orig FS All 2 3 4 5 6 7 MS B 2 3 4 5 6 7 MS A 4 5 6 7 2 3 Figure 8: Reframing resuls on a sequence wih wo acors (A,B). The op row shows a se of seleced keyframes from he original video. The corresponding keyframes from he hree differen virual camera sequences are shown below. The hree reframed sequences include he medium sho of each acor (MS A, MS B) and a full sho of boh he acors (FS All). A plo of he horizonal posiion of he virual camera rajecory agains ime is shown for each of he hree reframed sequences. The posiion of he keyframes on he plo is marked wih red dos. sable medium sho of boh acors avoiding he acors o come in conac wih he image frame. The generaed sho also preserves he screen coninuiy, for example he camera "MS B" keeps he acor B a /3 righ as she is posiioned on he righ side of he sage. Similarly, he camera "MS A" keeps acor A on /3rd lef as he eners from he lef. Anoher example can be seen wih camera "MS B" in Figure 9, where he camera keeps he acor in he cener as i says beween wo oher acors on sage. The virual cameras also avoid cropping he acors no menioned in sho specificaion. For example, he camera "MS B" in Figure 8 pulls in acor A when i comes close o acor B a keyframe 6. Similar example can be seen wih camera "FS B,C" in Figure 9, which mainains a igh full sho of acors B and C bu pulls in acor A when i comes close o he camera frame. Camera moion. The plos of in Figure 8 and Figure 9 show ha he virual camera pah smoohly ransiions beween long saic segmens. Observe how he virual camera remains saic for long period beween keyframes 4 o 5 and keyframes 6 o 7 in Figure 8 as he acors do no move significanly. When he camera moves, i moves smoohly preserving he apparen moion of he acors on sage. For example, observe how he camera "MS A" in Figure 8 moves o he righ as he acor A eners he sage beween keyframes 3 and 4. Cuabiliy. Good composiion, screen coninuiy and long saic cameras in he generaed virual camera sequence provides he edior pleny of choices o cu. For example he edior can swich among all four possibiliies (including he original) a keyframe 4 and 5 in Figure 8. Similarly, he edior can swich among all five opions a keyframe in Figure 9. In a few scenarios he generaed virual cameras may no be cuable, for example cuing beween camera "MS A" and "MS B" a keyframe 6 in Figure 8 in no possible because i would creae a jump cu. This happens because, due o he pull in even boh cameras end up framing he same acors wih slighly differen composiions. In some cases, he virual camera framing comes oo close o he framing of he original maser sho and cuing beween hem may lead o a jump cu. An example of his can be seen in keyframe 3 of camera "FS All" in Figure 8. 6. LIMITATIONS AND FUTURE WORK Currenly in our sysem, opimizaion is performed separaely for each given sho specificaion. This may lead o jump cus in few cases as discussed in previous secion. In fuure work, we plan o perform a join opimizaion for he se of given sho specificaions. The proposed work focuses on framing acors presen on sage bu does no allow o include objecs in he sho specificaion. In fuure work, we plan o inegrae some simple objecs in he sho naming convenions using sandard objecs deecors. The Full HD maser shos used in he experimens in his paper did no provide us

Orig FS All FS B,C 2 3 4 5 6 7 2 3 4 5 6 7 MS A 2 3 4 5 6 7 MS B 2 3 4 5 6 7 Figure 9: Reframing resuls on a sequence wih hree acors (A,B,C). Seleced keyframes from he original sequence and 4 virual camera sequences are shown in his figure. The four virual camera sequences include he full sho of all hree acors (FS All), full sho of acors wo acor (FS B,C) and medium shos of wo of he acors (MS A, MS B). A plo of he horizonal posiion of he virual camera rajecory agains ime is shown for each of he four reframed sequences. The posiion of he seleced keyframes on he plo is marked wih red dos. enough resoluion o go closer han medium shos. Bu he mehod can be easily applied o maser shos wih higher resoluions (4K or 6K), which will allow o exend he range of shos o medium close-ups (MCU) and close-ups (CU). The reframed rushes obained from our mehod are auomaically annoaed wih acor and camera movemens which makes hem suiable for auomaic ediing. In fuure work we plan o invesigae he problem of auomaic camera selecion given he rushes. 7. CONCLUSION We have presened a sysem which can generae muliple reframed sequences from a single viewpoin aking ino consideraion he composiion, camera movemen and cuing aspecs of cinemaography. We have cas he problem of rush generaion as a convex minimizaion problem and demonsraed qualiaively correc resuls in a variey of siuaions. To our knowledge his is he firs ime ha he problem of rush generaion has been addressed and validaed experimenally. In effec, our mehod provides a coseffecive soluion for muli-clip video ediing from a single viewpoin. 8. ACKNOWLEDGMENT This work was suppored by Region Rhone Alpes (projec Scenopique) and ERC advanced gran EXPRESSIVE. We hank Claudia Savisky and he complee cas and crew of Deah of a Salesman for leing us record heir rehearsals a Célesins, Théâre de Lyon. We hank Hélène Chambon and Auxane Duronc for heir suppor, advice and experise. Par of his research was suppored by French ANR research projec Specacle en ligne(s). 9. REFERENCES [] John Alon. Paining Wih Ligh. Universiy of California Press, 949.

[2] Ido Arev, Hyun Soo Park, Yaser Sheikh, Jessica K. Hodgins, and Ariel Shamir. Auomaic ediing of fooage from muliple social cameras. In ACM Transacions on Graphics (SIGGRAPH), 204. [3] Chrisian Bailer, Alain Pagani, and Didier Sricker. A user suppored racking framework for ineracive video producion. In Proceedings of he 0h European Conference on Visual Media Producion, 203. [4] Floraine Berhouzoz, Wilmo Li, and Maneesh Agrawala. Tools for placing cus and ransiions in inerview video. ACM Transacions on Graphics (SIGGRAPH), 202. [5] Michael Bianchi. Auomaic video producion of lecures using an inelligen and aware environmen. In Proceedings of he 3rd inernaional conference on Mobile and ubiquious mulimedia, 2004. [6] Peer Carr, Michael Misry, and Iain Mahews. Hybrid roboic/virual pan-il-zom cameras for auonomous even recording. In Proceedings of he 2s ACM Inernaional Conference on Mulimedia, 203. [7] F. Daniyal and A. Cavallaro. Muli-camera scheduling for video producion. In Visual Media Producion (CVMP), 20 Conference for, 20. [8] Vamsidhar Reddy Gaddam, Ragnar Langseh, Sigurd Ljødal, Pierre Gurdjos, Vincen Charvilla, Carsen Griwodz, and Pål Halvorsen. Ineracive zoom and panning from live panoramic video. In Proceedings of Nework and Operaing Sysem Suppor on Digial Audio and Video Workshop, NOSSDAV 4, 203. [9] Vamsidhar Reddy Gaddam, Ragnar Langseh, Håkon Kvale Sensland, Pierre Gurdjos, Vincen Charvilla, Carsen Griwodz, Dag Johansen, and Pål Halvorsen. Be your own cameraman: Real-ime suppor for zooming and panning ino sored and live panoramic video. In Proceedings of he 5h ACM Mulimedia Sysems Conference, MMSys 4, 204. [0] Vinee Gandhi and Remi Ronfard. Deecing and Naming Acors in Movies using Generaive Appearance Models. In Compuer Vision and Paern Recogniion, 203. [] Michael Gleicher and Feng Liu. Re-cinemaography: Improving he camerawork of casual video. ACM Transacions on Mulimedia Compuing Communicaions and Applicaions (TOMCCAP), 5(): 28, 2008. [2] Michael Gran and Sephen Boyd. CVX: Malab sofware for disciplined convex programming, version 2.. hp://cvxr.com/cvx, March 204. [3] M. Grundmann, V. Kwara, and I. Essa. Auo-direced video sabilizaion wih robus l opimal camera pahs. In IEEE Conference on Compuer Vision and Paern Recogniion (CVPR), 20. [4] Rachel Heck, Michael Wallick, and Michael Gleicher. Virual videography. ACM Trans. Mulimedia Compu. Commun. Appl., 3(), February 2007. [5] Andrew Laszlo and Andrew Quicke. Every Frame a Rembrand: Ar and Pracice of Cinemaography. Focal Press, 2000. [6] Joseph Mascelli. The Five C s of Cinemaography: Moion Picure Filming Techniques. Silman-James Press, 965. [7] Adiya Mavlankar and Bernd Girod. Video sreaming wih ineracive pan/il/zoom. In High-Qualiy Visual Experience, Signals and Communicaion Technology, pages 43 455, 200. [8] Gusavo Mercado. The Filmmaker s Eye: Learning (and Breaking) he Rules of Cinemaic Composiion. Focal Press, 200. [9] Claudio S. Pinhanez and Aaron F. Bobick. Inelligen sudios modeling space and acion o conrol v cameras. Applied Arificial Inelligence, (4):285 305, 997. [20] O. Schreer, I. Feldmann, P. Weissig, C.and Kauff, and R. Schafer. Ulrahigh-resoluion panoramic imaging for forma-agnosic video producion. Proceedings of he IEEE, 0():99 4, January 203. [2] Roy Thomson and Chrisopher J. Bowen. Grammar of he sho. Focal Press, 2009. [22] Tinghuai Wang, A Mansfield, Rui Hu, and J.P. Collomosse. An evoluionary approach o auomaic video ediing. In Visual Media Producion, 2009. CVMP 09. Conference for, pages 27 34, Nov 2009. [23] Cha Zhang, Yong Rui, Jim Crawford, and Li-Wei He. An auomaed end-o-end lecure capure and broadcasing sysem. ACM Trans. Mulimedia Compu. Commun. Appl., 2008.