Orthogonally modeling video structuration and annotation: exploiting the concept of granularity

From: I Technicl Report WS-00-08. Compiltion copyright 2000, I (www.i.org). ll rights reserved. Orthogonlly modeling video structurtion nd nnottion: exploiting the concept of grnulrity M. Dums, R. Lozno,M.-C.Fuvet,H.MrtinndP.-C.Scholl LSR-IMG, University of Grenoble B.P. 72, 38402 St-Mrtin d He- res, Frnce Mrlon.Dums@img.fr bstrct In this pper, we exploit the concept of grnulrity to design video metdt model tht ddresses both logicl structurtion nd content nnottion in n orthogonl wy. We then show tht thnks to this orthogonlity, the proposed model properly cptures the interctions between these two spects, in the sense tht nnottions my be independently ttched to ny level of video structurtion. In other words, n nnottion ttched to scene is not treted in the sme wy s n nnottion ttched to every frme of tht scene. We lso investigte wht re the implictions of this orthogonlity regrding querying nd composition. Introduction For long time, videos were mnged by specific environments due to their huge size nd demnding hrdwre requirements. Improvements in dt compression, continuously incresing network trnsfer rtes nd processing cpcity, nd dvnces in operting systems, now llow conventionl computing systems to hndle videos, thereby opening ppliction res such s video-on-demnd, video conferencing nd home video editing. The specificities of these pplictions hve been ddressed from severl perspectives by lrge body of reserch works, prototypes nd commercil tools (Elmgrmid et l., 1997). Mny of these efforts hve focused on performnce issues relted to storge, rel-time trnsfer in distributed environments, qulity of service, synchroniztion, etc. However, s the needs of these pplictions evolve, other issues such s content-bsed retrievl, summry extrction, interctive editing nd browsing, etc. become incresingly importnt. In prticulr, dvnces in utomtic indexing techniques, together with video metdt stndrdiztion efforts such s MPEG-7, hve ccentuted the need for high-level representtion models for video nnottions. In this setting, we dvocte tht dtbse technology is likely to ply n incresingly importnt role in video mngement, provided tht it offers high-level concepts nd interfces for storing, retrieving nd editing videos nd their ssocited metdt. Our work is intended to be contribution in this direction. Copyright c 2000, mericn ssocition for rtificil Intelligence (www.i.org). ll rights reserved. * Supported by SFERE CONCYT, Mexico. Video modeling in the context of dtbse entil numerous issues which my be ddressed from temporl representtion viewpoint. In this pper we focus on two of them, nmely structurtion (lso known s segmenttion)ndnnottion (lso known s indexing). One of the min originlities of our pproch, is tht it llows nnottions to be independently ttched to ny level of video structurtion. This feture considerbly increses the expressive power of the resulting model. Indeed, in most previous proposls (Oomoto nd Tnk, 1993; Gndhi nd Robertson, 1994; dli et l., 1996; Li et l., 1997), nnottions my only be ttched to the video frmes nd not to its shots or its scenes. This withholds the user from expressing fcts which re true of given scene, without being true of ech frme in this scene. For instnce, stting tht in given scene two chrcters tlk to ech other, is something tht is true t the grnulrity of the scene, without being true of ech frme in tht scene. Similrly, given tennis mtch video, structured into two levels respectively modeling the gmes nd the sets of the mtch, stting tht the score of set is 6 4 does not entil nything bout the score of prticulr gme in tht set. We lso investigte the issues of metdt querying nd composition in the context of our model. To ddress the first of these issues, we propose set of lgebric opertors on video nnottions. Some of these opertors llow the user to compre nnottions ttched to different levels of structurtion. To ddress video composition on the other hnd, we extend the semntics of some previously proposed composition opertors, so s to propgte the structurtion nd the nnottions of the rgument videos into the resulting ones. Unlike previous proposls, the videos obtined through these composition opertors therefore inherit some metdt from their sources. For presenttion purposes, we dopt throughout the pper the terminology nd nottions of the ODMG object dtbse stndrd (Cttell et l., 2000). However, since the concepts nd opertors tht we introduce re encpsulted into bstrct dttypes, the proposed pproch my be ccommodted into other dt models. The rest of the pper is structured s follows. First, we describe our pproch to model video structurtion nd nnottion. Next, we discuss the implictions of our modeling choices regrding querying nd composition. Lstly, we 37

compre our proposl with respect to previous ones nd provide some concluding remrks. Video structurtion Video structurtion is the process of grouping the frmes of video ccording to their semnticl correltions. For instnce, video is usully structured into shots, scenes, nd video sequences. The structure of video is nlogue to the tble of contents of text document. s such, it cn be used s guide for browsing, s the entry point for building video bstrcts, or s men to retrieve the context of prticulr piece of video. There is strong similrity between the structurtion of video nd the structurtion of time in clendrs. ccordingly, we propose video structurtion model bsed on the concept of grnulrity. This concept hs been extensively studied in severl contexts such s temporl resoning in rtificil intelligence (Hobbs, 1985), nd temporl dtbses (Wng et l., 1995). The definition of grnulrity tht we dopt is generliztion of the one developed in this ltter work. It is importnt to note tht in this work, the concept of grnulrity is merely used to model hierrchicl structurtion, nd not s n bsolute temporl metric system. Indeed, when time is structured into regulr grnulrities such s minutes nd hours, these grnulrities my then be used to express distnces between time points. The bsic concept of the video dt model tht we consider is tht of time-line. t n bstrct level, time-line TL is pir (D, TL ), mde up of finite set of time-mrks D, nd binry reltion TL over this set defining totl liner order. Time-lines model independent time-coordinte systems, with respect to which dt nd met-dt my be temporlly nchored. Ech video hs its own time-line, which cptures the temporl reltions between its elementry components (i.e. its frmes). grnulrity over time-line TL is defined s prtition of TL into convex sets: ech of these convex sets is then seen s n tomic grnule nd every time-mrk of the time-line is pproximted by the grnule which contins it. grnulrity is intrinsiclly ttched to unique time-line. The function tht mps grnulrity into time-line is subsequently denoted TimeLine. Thus, the expression TimeLine(G) denotes the time-line tht grnulrity G prtitions. hierrchicl structure is defined over the set of grnulrities of time-line through the definition of the finer-thn reltion. grnulrity G1 is finer-thn nother grnulrity G2 (G1 G2) iftl(g1) = TL(G2), nd if mpping cn be estblished between ech grnule of G1 nd set of consecutive grnules of G2 Given two grnulrities G1 nd G2 such tht G1 G2, two mpping functions re defined: one for expnding grnule of G2 into set of grnules of G1 (noted G2,G1 ), nd the other for pproximting grnule of G1 by grnule of G2 (noted «G1,G2 ). ¾ ¾ ¾ G2,G1 ¾ µ ½ ¾ ½ ½ ¾ ½ ¾ ½ «G1,G2 ½ µ the unique ¾ ¾ ¾ such tht ½ ¾ ny level of structurtion of video is nturlly modeled by grnulrity over its time-line. Indeed, the set of shots of video my be seen s set of disjoint intervls of timemrks covering the whole time-line of the video. Notice tht it is lso possible to view the set of frmes s degenerted prtitioning of video time-line, in which ech prtition is composed of single time-mrk. The dvntge of this modeling is tht the notion of frme is seen s structurtion level of video in the sme wy s shots or scenes. This llows to ttch nnottions to shots or scenes in the sme wy s they re ttched to frmes, s detiled in the next section. The following ODMG interfces show how the concept of grnulrity cn be used to model clssicl structurtion schem, in which videos re decomposed into sequences, scenes nd shots. interfce TimeLine; /* not detiled here */ interfce Grnulrity TimeLine TimeLine(); bool finerthn(grnulrity other); interfce Video ttribute TimeLine VTL; ttribute Grnulrity sequences; ttribute Grnulrity scenes; ttribute Grnulrity shots; ttribute Grnulrity frmes; /* Other ttributes re described in the next section. */ For given object V whose clss implements the bove Video interfce, the following constrints pply: V.frmes = MinGrnulrity(V.VTL), wheremingrnulrity denotes the function retrieving the finest grnulrity of time-line (i.e. the grnulrity whose grnules re ll singletons). Such grnulrity is usully clled the chronon of time-line in the temporl dtbse literture (Tnsel et l., 1993). TimeLine(V.sequences) = TimeLine(V.scenes) = TimeLine(V.shots) = V.VTL V.shots V.scenes V.sequences The bove Video interfce my be specilized through inheritnce, so s to hndle situtions where other structurtion levels re needed. For instnce, in the context of video bout tennis mtch, structurtion levels such s point, gme nd set my be introduced. Notice tht two structurtion levels within single video re not necessrily comprble with respect to the finer thn reltion. For instnce, consider video bout tennis mtch possessing two structurtion levels, nmely point nd shot. If grnule of the shot grnulrity overlps two grnules of the point grnulrity (without contining them completely), then neither point shot, nor shot point hold. Video nnottion Given the current stte of imge processing technology, it is not resonble in the generl cse, to dynmiclly extrct 38

semnticl informtion from video during query evlution over video dtbse. Insted, in order to formulte content-bsed queries over videos, their semnticl content must be previously described s nnottions. nnottions re generlly stored seprtely from the rw video dt. This pproch is quite nturl, since nnottions re normlly only needed during video querying 1, while ccess to rw video dt is only required during video plying. In ddition, this pproch llows to shre the rw video dt mong severl virtul videos, without necessrily shring ll the corresponding nnottions (Lozno nd Mrtin, 1998). Our pproch to video nnottion is bsed upon the concepts of instnt nd sequence. n instnt is s n pproximtion of connected region of time-line by grnule. It is represented by pir mde up of nturl number (the position of the denoted grnule) nd grnulrity. There is subtle difference between n instnt nd grnule: grnule is simply set of time-mrks, while n instnt is reference to grnule, not the set of timemrks on itself. In prticulr, stting tht fct is true t n instnt (representing for instnts prticulr scene) does not mens tht this informtion holds t ech time-mrk contined in the grnule referenced by this instnt. sequence is bstrctly defined s mpping from set of instnts defined over single grnulrity, to set of objects shring common structure. n exmple of sequence is: s1 min chrcter: Lind, min ction: cook s2 min chrcter: Lind, min ction: cook s3 min chrcter: Lind, min ction: cook s6 min chrcter: Joe, min ction: et s7 min chrcter: Joe, min ction: et s8 min chrcter: Joe, min ction: et Where s1, s2, s3, s6, s7, s8 denote instnts t the sme grnulrity (e.g. set of scenes of video), nd tt nme: vlue, tt nme: vlue denotes n object. The bove concepts re embedded into two DTs respectively nmed Instnt nd Sequence. This ltter is prmeterized by type modeling the set of possible vlues tken by this sequence t given instnt. In the sequel, we will use the nottion Sequence T to denote the instntition of the type Sequence with type T. Since sequences re modeled s instnces of n DT, the user does not need to worry bout the internl representtion of sequence in order to reson bout them. For instnce, the bove sequence could be internlly represented in the following wy : s1..s3 min chrcter: Lind, min ction: cook s6..s8 min chrcter: Joe, min ction: et Where [s1..s3] nd [s6..s8] denote intervls. To chieve orthogonlity between video structurtion nd nnottion, we ttch to ech video, s mny sequences s there re levels of structurtion defined over it. Ech of these 1 Close cptions re n exception, since they re nnottions tht must be displyed during video plying. sequences hs its own grnulrity, nd it groups ll the nnottions ttched to given level of structurtion, s depicted in figure 1. The following ODMG interfce (which completes the one given previously) summrizes the bove discussion. interfce Video /* ttributes described in the previous section go here */ ttribute Sequence Object sequencesnnottions; ttribute Sequence Object scenesnnottions; ttribute Sequence Object shotsnnottions; ttribute Sequence Object frmesnnottions; For given object V whose clss implements the bove interfce, the following constrints pply: Grnulrity(V.sequencesnnottions) = V.sequences, where Grnulrity(SQ) stnds for the grnulrity of the domin of sequence SQ. Grnulrity(V.scenesnnottions) = V.scenes Grnulrity(V.shotsnnottions) = V.shots Grnulrity(V.frmesnnottions) = V.frmes s stted before, the bove interfce is not intended to be used s is. Insted, the user my specilize it to ccount for prticulr kinds of videos. For exmple, consider dtbse mnging collection of movies identified by their title. The frmes of ech movie re nnotted with the nmes of the ctors ppering on them, while the scenes re nnotted with the loction where they tke plce. The following specifiction cptures this sitution, by introducing clss Movie which implements the interfce Video. Notice tht the ttribute frmesnnottions inherited from this interfce, is specilized to ccount for the structure of the nnottions mnged in this context. similr remrk pplies to scenesnnottions. clss Movie : Video (extent TheMovies, key title) ttribute string title; ttribute sequence Loction scenesnnottions; ttribute sequence Set string frmesnnottions; Querying multi-level nnotted videos set of lgebric opertors is defined on sequences. These opertors re divided into five groups: pointwise mpping, join, restriction, prtitioning nd splitting. The first three correspond to the clssicl projection, join nd selection opertors of the reltionl lgebr, while the ltter two re proper to sequences. In fct, prtitioning nd splitting re tightly connected to two importnt chrcteristics of sequences: grnulrity nd order. In the sequel, we focus on the restriction nd the prtitioning opertors. For detils on the other opertors, the reder my refer to (Dums et l., 1999). There re two restriction opertors on sequences. The first one (noted during) restricts the domin of sequence to the instnts lying in given set of instnts. The second one (noted when) restricts sequence to those instnts t which its vlue stisfies given predicte. Such predicte is given s boolen function whose prmeter denotes vlue of the sequence s rnge. Concretely, the expression SQ s 39

Video sequences <ction: meeting> vs1 <ction : discussion> vs2 Scenes <plce : house> sc1 sc2 <plce : sc3 sc4 sc5 <plce : cr> prking> <plce : pub> <plce : pub> Shots sh1 sh2 sh3 sh4 sh5 sh6 sh7 sh8 sh9 Frmes f1 f2 f3 f4 f5 f6 f7 f8 f9 <ctors: {Ed}> <ctors: {Ed, Tom}> <ctors: {Ed, Tom, Mike}> <ctors: {l, Hns}> <subtitle: - Hi Tom - Hi > <subtitle: - Something s wrong? - Perhps > Figure 1: Sequences of nnottions ttched to video. Video sequences (the corsest units of structurtion) re nnotted with ction descriptors, scenes with plces, shots with close cptions, nd frmes with the chrcter nmes. x when P(x) denotes sequence obtined by restricting sequence SQ to those instnts where its vlue stifies predicte P (vrible x stnds for the vlue of the sequence t given instnt). The following two queries illustrte the bove restriction opertors. Q1 : Restriction (bsed on the domin) Retrieve the nnottions ttched to the frmes ppering within the first 20 seconds of the movie entitled Hopes (ssuming constnt presenttion rte of 30 frmes/second) select F.frmesnnottions during [0 @ F.frmes..(20 * 30) @ F.frmes] from TheMovies s F where F.title = Hopes /* N @ G denotes the instnt of grnulrity G whose position is N. Expression [I1..I2] denotes the closed intervl of instnts comprised between I1 nd I2. */ Q2 : Restriction (bsed on the rnge) In which movies is John present for more thn 15 minutes (ssuming the sme presenttion rte s bove). select F from TheMovies s F where durtion(f.frmesnnottions s nctorset when John in nctorset) (15 * 60 * 30) /* Opertor durtion retrieves the crdinlity of the domin of sequence. */ Notice tht in both of these queries, we ssumed constnt presenttion rte when converting number of seconds into number of frmes. However, virtul videos my involve severl rw videos possibly recorded (nd therefore presented) under different frme rtes. In these situtions, the conversion function between seconds nd frmes becomes fr more complex. To our knowledge, this problem hs not been ddressed by ny of the existing video dt models. Indeed, the models which offer conversion functions between metric temporl vlues (e.g. durtions expressed in terms of seconds) nd frme numbers, ssume constnt presenttion rte (see for instnce (Dionisio nd C rdens, 1998)). We believe tht this is n interesting perspective to our work. The prtitioning opertor, nmely prtitioned by, ccounts for grnulrity chnge (zoom-out). More specificlly, SQ prtitioned by G2, SQ being t grnulrity G1 (G1 G2), mkes prtition of SQ ccording to grnulrity G2. The result is sequence, t grnulrity G2, of sequences t grnulrity G1, such tht the vlue of the min sequence t ny instnt I (t grnulrity G2) is the restriction of SQ to the intervl expnd(i, G1). This is illustrted in figure 2. Since in our model, nnottions my be independently ttched to ech level of video structurtion, it is necessry to provide mechnism for switching between these levels. The prtitioned by opertor provides this mechnism in our OQL extension. Q3 : Grnulrity chnge: sequence prtitioning Retrieve the scenes of the movie entitled Freedom, in which John is in t lest hlf of the frmes of the scene. select F.frmesnnottions prtitioned by F.scenes s Scene when (durtion(scene s nctorset when John in nctorset) durtion(scene) / 2) 40

S S [<f1, 1>, <f2, 1>, <f3, 1>, <f1000, 2>, <f1001, 3>, <f1002, 3>, <f2550, 2>, <f2550, 2>, <f2552, 2>, S prtitioned by scenes [<s1, [<f1, 1>, <f2, 1>, <f3, 1>, <f1000, 2>]>, [<s2, [<f1001, 3>, <f1002, 3>, <f2550, 2>, <f2551, 2>, <f2552, 2>]>, ssumptions: - Grnulrity(S) = frmes - frmes finer thn scenes - expnd(s1, frme) = [f1..f1000] - expnd(s2, frme) = [f1001..f2552] - post-condition: Grnulrity(S ) = scenes Figure 2: Prtitioning opertor: Instnts t grnulrity frmes (resp. scenes)re nmed f1, f2, etc. (resp. s1, s2, etc.) from TheMovies s F where F.title = Freedom In combintion with the temporl join opertor on sequences (not described here), the bove prtitioning opertor my be used to express queries which simultneously involve nnottions t the grnulrity of the scene nd t the grnulrity of the frme, s in : Retrieve those scenes in film Freedom, which tke plce in Pris, nd in which John is present in t lest hlf of the frmes of the scene.. Composing multi-level nnotted videos Most existing video dt models (Hjelsvold et l., 1995; Weiss et l., 1995) provide opertors for generting videos by extrcting nd composing frgments of existing ones. In (Lozno nd Mrtin, 1998) for instnce, we studied five such opertors : extrction, conctention, union, intersection nd difference. However, none of these works considers the issue of propgting the metdt contined in the originl videos when generting new ones through the bove composition opertors. s result, the videos obtined through composition hve no structurtion nd some of the nnottions ttched to the originl videos re not inherited by the generted ones. In (Dums et l., 1999), we investigte this ltter issue nd extend the semntics of the five composition opertors proposed in (Lozno nd Mrtin, 1998), so s to preserve the structurtion nd nnottions of the rgument videos during composition. Figures 3 nd 4 sketch the wy in which this metdt propgtion is crried out in the cse of the extrction nd the conctention opertors. The other three opertors (union, intersection nd difference) my be expressed in terms of these two. Extrction. The extrction opertor (noted Extrct), tkes s prmeter video nd set of instnts denoting video frmes, nd genertes new video contining exctly these frmes. The derivtion of the structurtion of the resulting video is essentilly crried out through n opertor Restrict(G, IS), which builds new grnulrity, by restricting grnulrity G to the grnules referenced within the set of instnts IS. For instnce : Restrict( f1, f2, f3, f4, f5, f6, f7, f8, f9, V S1............. S2 f1 f2 f3 f4 f5 f6 f7 f8 f9 b b c c b f1 b S1 f2 S2 f3............. f4 S3 Extrct(V, {f2, f3, f4, f8, f9}) S3 B B f5 b scenes shots frmes scenes shots frmes Figure 3: Propgtion of the structurtion nd nnottions during video extrction., B nd C stnd for nnottions ttched to the shots, while, b, nd c re nnottions ttched to the frmes. f1, f3, f7 ) = f1, f2, f3 Which mens tht the resulting grnulrity hs two grnules : the first one composed of two time-mrks (nmely f1 nd f2 ), which correspond to frmes f1 nd f3 in the originl video, nd the second grnule contins single time-mrk (nmely f3 ), which corresponds to frme f7 in the originl video. The derivtion of the nnottions, on the other hnd, is performed in two steps. First, ech of the sequences of nnottions of the rgument video is restricted to the set of frmes given s prmeter to Extrct. Thisfirst step is crried out through the opertor during on sequences. Then, the resulting sequences of nnottions re trnsformed into equivlent sequences over the grnulrities generted by the Restrict opertor. This second step is crried out using n opertor clled Compct. Intuitively, this opertor mps sequence with non-convex domin, into one with convex domin. For exmple : Compct( 3, v1, 5, v2, 6, v2, 9, v1 ) = 41

.......................... scenes V 1 V 2 S1 S2 S1 D S2 E shots f1 f2 f3 b f4 f5 f6 b c f1 f2 f3 f4 d e f g f5 f f6 g frmes V 1 V V 2............. scenes S1 S2 S3 D S4 E shots f1 f2 b f3 f4 f5 b f6 c f7 f8 f9 f10 d e f g f11 f f12 g frmes Figure 4: Propgtion of the structurtion nd nnottions during video conctention 1, v1, 2, v2, 3, v2, 4, v1 Notice tht the extrction opertor is defined in such wy tht shot is kept in the resulting video if t lest one of the frmes within this shot is kept (nd idem for the other structurtion levels). In some situtions, this my led to some semnticl problems. For instnce, if the nnottion ttched to scene is Lind wlks, nd tht ll the frmes in this scene where Lind wlks re tken off during the extrction process, then the resulting video hs scene whose nnottion (i.e. Lind wlks ), is inconsistent with respect to its contents. We believe tht it is up to the user to cope with these situtions. In other words, the user must mke sure tht the nnottions derived by the Extrct opertor re consistent with the video contents, nd correct them if needed. Conctention. The conctention opertor (noted V), tkes s prmeter two videos nd simply conctentes them. The structurtion nd the nnottions of the resulting video re obtined by simply conctenting the grnulrities nd sequences of the two rgument videos. During this process the frmes, shots nd scenes of the second rgument re renumbered. For instnce, in the conctention depicted in figure 4, the shot number 1 of video V 2 becomes the shot number 3 in video V 1 V V 2. Relted works There is welth of works deling with semi-structured formts for representing semnticl video contents (i.e. nnottions). The results of these works re currently being exploited by ongoing stndrdiztion efforts, such s MPEG- 7 (Nck nd Lindsy, 1999). In this pproch, nnottions re represented s segment descriptors, i.e. hierrchiclly structured entities ttched to prticulr segment of document, or possibly to n entire document. MPEG-7 is intended to be flexible, in the sense tht user-defined nnottion types re llowed in ddition to the ones provided by the stndrd. Our proposl is complementry to the bove one, since it is not intended to define low-level formt for video storge, but rther high-level dt model for browsing, editing, composing nd querying videos using the functionlities of n object DBMS. The ide of using DBMS functionlities to store, browse, nd query video contents is not novel: it hs been pplied in mny prototype systems nd dt model proposls (Elmgrmid et l., 1997). The innovtion of our pproch, lies on the orthogonlity with which the different spects of video modeling re tckled. Indeed, in our proposl nnottions my be independently ttched to ech level of the video structurtion, wheres in most of the existing video dt models, e.g. OVID (Oomoto nd Tnk, 1993), VIS (dli et l., 1996) nd CVOT (Li et l., 1997), nnottions my only be ttched to the frmes 2. This pproch withholds the user from expressing fcts which re true of given scene, without being true of ech frme in tht scene. For instnce, suppose tht the user wishes to encode into set of nnottions, the ctions tht tke plce in movie. strightforwrd lbeling of the frmes s proposed in the bove models, forces the user to ttch n ction to ech frme. However, n ction, such s two chrcters hving discussion, is not n instntneous event whose truth cn be ssocited to prticulr imge. It is rther the outcome of the conctention of tomic events such s two chrcters looking t ech other, intermittently tlking bout common subject, with some kind of pitch nd voice intensity, etc. In other words, it is fct whose truth cn only be ttched to video segment with some semntic unit, i.e. scene. (Hjelsvold et l., 1995) is perhps one of the closest works to ours. This pper describes frmework for modeling 2 In fct, n nnottion in these proposls cn be ttched to n intervl of frmes, but the underlying semntics is the sme s if the nnottion ws ttched to ech frme in tht intervl. 42

videos bsed on the concept of frme strem. s in VIS nd CVOT, nnottions re ttched to intervls of frmes, but the underlying semntics is the sme s if the nnottions were directly ttched to the frmes. Temporl queries re formulted by using comprison opertors on the lower nd upper bounds of these intervls, nd not through highlevel lgebric opertors s in our proposl. Severl composition opertors re studied, but the uthors neglect the issue of propgting the nnottions of the primitive videos when generting new ones through composition. This ltter remrk lso pplies to other similr works such s (Weiss et l., 1995). In some respect, the work reported here is close to tht of (Seshdri et l., 1996), which proposes n DT-driven model for sequences nd n ssocited query lnguge. However, this ltter work focuses on discretely-vrying dt such s finncil time-series, nd the uthors do not discuss how it my be extended to del with stepwisely vrying dt such s video nnottions. Conclusion We hve presented model for video metdt which properly cptures the orthogonlity between logicl structurtion nd nnottion within video, nd we hve investigted the impcts of this orthogonlity regrding video querying nd composition. The concept of grnulrity plys key role in our proposl. On the one hnd, it is used to model the logicl structurtion of video. On the other hnd, it provides foundtion for expressing queries involving videos nnotted t multiple levels of structurtion. The bove model hs been implemented s prototype on top of the object-oriented DBMS O ¾. In its current stge, this prototype is composed of librry of clsses implementing the bstrct dttypes discussed throughout the pper, nd of four user interfces : schem definition nd query lnguge preprocessor, video editing user interfce nd video plyer. (Dums et l., 1999) provides detils on this implementtion. The work developed in this pper illustrtes how temporl grnulrity my be pplied to design simple yet powerful video metdt models. We re convinced tht other similr experiences my be crried out using the concept of sptil grnulrity. Indeed, sptil grnulrities could be used to ddress severl issues tht rise when modeling the reltionships within nd mong the the imges composing video. References dli, S., Cndn, K. S., Chen, S., Erol, K., nd Subrhmnin, V. (1996). The dvnced Video Informtion System: dt structures nd query processing. Multimedi Systems, 4:172 186. Cttell et l., R., editor (2000). The Object Dtbse Stndrd: ODMG 3.0. Morgn Kufmnn. Clifford, J. nd Tuzhilin,., editors (1995). Proc. of the workshop on Recent dvnces in Temporl Dtbses, Zurich, Switzerlnd. Springer Verlg. Dionisio, J. nd C rdens,. (1998). unified dt model for representing multimedi, time-line nd simultion dt. IEEE Trnsctions on Knowledge nd Dt Engineering, 10(5). Dums, M., Lozno, R., Fuvet, M.-C., Mrtin, H., nd Scholl, P.-C. (1999). sequence-bsed object-oriented model for video dtbses. Reserch Report RR 1024-I- LSR 12, LSR-IMG, Grenoble (Frnce). Elmgrmid,., Jing, H., bdelslm,., Joshi,., nd hmed, M. (1997). Video Dtbse Systems: Issues, Products nd pplictions. Kluwer cdemic Publisher. Gndhi, M. nd Robertson, E. (1994). dt model for udio-video dt. In Proc. of the 6th Int. Conference on Mngement of Dt (CISMOD), Bnglore (Indi). McGrw-Hill. Hjelsvold, R., Midtstrum, R., nd Sndst, O. (1995). temporl foundtion of video dtbses. In (Clifford nd Tuzhilin, 1995). Hobbs, J. R. (1985). Grnulrity. In Proc. of the 9th Interntionl Joint Conference on rtificil Intelligence (IJ- CI). Li, J., Gorlwll, I., Ozsu, M., nd Szfron, D. (1997). Modeling video temporl reltionships in n object dtbse mngement system. In Proc. of IS&T/SPIE Int. Symposium on Electronic Imging: Multimedi Computing nd Networking, pges 80 91, Sn Jose, C (US). Lozno, R. nd Mrtin, H. (1998). Querying virtul videos with pth nd temporl expressions. In proc. of the CM Symposium on pplied Computing, tlnt, Georgi (US). Nck, F. nd Lindsy,. T. (1999). Everything You Wnted to Know bout MPEG-7. IEEE MultiMedi Mgzine,6(3 4):65 77. Oomoto, E. nd Tnk, K. (1993). OVID : Design nd implementtion of video-object dtbse system. IEEE Trnsctions on Knowledge nd Dt Engineering, 5(4). Seshdri, P., Livny, M., nd Rmkrishnn, R. (1996). The design nd implementtion of sequence dtbse system. In Proc. of the Int. Conference on Very Lrge Dtbses (VLDB), pges 99 110, Bomby, Indi. Tnsel,. U., Clifford, J., Gdi, S., Jjodi, S., Segev,., nd Snodggrss, R., editors (1993). Temporl Dtbses. The Benjmins/Cummings Publishing Compny. Wng, X. S., Jjodi, S., nd Subrhmnin, V. (1995). Temporl modules : n pproch towrd federted temporl dtbses. Informtion Systems, 82. Weiss, R., Dud,., nd Gifford, D. (1995). Composition nd serch with video lgebr. IEEE Multimedi, 2(1):12 25. 43