The VGIST Multi-Modal Data Management and Query System

The VGIST Multi-Modal Data Management and Query System Dinesh Govindaraju Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213, USA Manuela Veloso Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213, USA Paul Rybski Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213, USA dineshg@cmu.edu veloso@cs.cmu.edu prybski@cs.cmu.edu ABSTRACT This paper presents the VGIST data management and query system which allows for the storage and retrieval of multi-modal data gathered in relation to a video segment. The system consists of a logging format which provides a central mechanism for storing and archiving semantic information about the video annotated by various sources. The VGIST system also consists of a scripting query language which allows users to execute customized queries on the annotated data to carry out higher level inference and retrieval of specific segments of video. General Terms Management, Human Factors, Standardization. Keywords TBD 1. INTRODUCTION As the costs of storage media and personal video recording equipment decrease with time, the use of video media as an archival tool has become increasingly common. The video medium provides users with a rich and immersive source of data about the past events, entities and environments, and presents this information in an engaging manner. The image sequence presented in a video segment can also be used as a base of reference to provide a convenient temporal framework to which we can then attach other semantic information inferred about the event being recorded. This semantic information can be gathered from external modular annotation sources which, by processing the information contained in individual image frames, can provide higher level inferences. Examples of semantic information generated by external annotation modules that can further describe a video segment include the positions of tracked faces over time and recognized speech. Describing video semantically in this way opens up the possibility of integrating video centric information to provide a heightened ability to reason about the entities and events presented in the sequence of images. This integration and storage also helps users analyze video data in the future and allows for a convenient means to access to the wealth of information contained in past videos. 2. PROBLEM DESCRIPTION Unfortunately, the majority of recorded video that we see today does not already come tagged with semantic data which we can quickly search and run higher level inferences on. Due to this, when using video as a medium to describe past events, we find that it is limited and cannot describe the information being captured as richly as it had happened. The most probable reason for this is that there is no commonly used standardized logging mechanism with which to store and meaningfully describe what is being shown in a particular video. Given that we can potentially collect a large amount of semantic information about a particular sequence of images by using external annotation modules, efficiently integrating and managing this abundant data is a non-trivial task. As such, any logging tool used must allow for a simple module-independent common reference format with which to store data from various annotators. The logging format must also be easily extensible so as to accommodate information from new annotation modules as they are created. As such, the importing of data from a new annotation source should not require so much work as to render the process impractical. Given that any tool created to help describe video data functions primarily to aid the user in managing his data, this tool should also be able to visually represent video information in a way that is easy for the user to understand. As such, the tool should allow the user to clearly see how the annotated semantic information relates back to the actual frames of video which it describes. Since video data is often kept to archive information in case it is needed in the future, any system which consolidates and manages the multi-modal information gathered from the meeting must also enable the user to query and retrieve specific segments of video. Ideally, the querying tool should be customizable, extensible and allow for the design of small user-centric query components which can then be re-used in structuring more complex queries. 3. SELECTED APPROACH In working towards the aim of providing such a tool, we have devised the VGIST data management and query system to provide a frame based mechanism for storing and querying semantic information related to a segment of video. This storage is done in terms of higher level annotations added to each frame of video data. A per frame baseline resolution was chosen for the log format as this would allow output to be aligned to the frames of

the image sequence found in the video. The VGIST system consists of the VGIST graphical user interface which encapsulates the logging and querying functionality as shown in Fig. 1. Video User Annotations Automated Annotations Face Analyzer Speech Analyzer Figure 2. VGIST gui frame with overlayed annotations VGIST log Run Query Display Frames VGIST GUI 3.2 VGIST LOG FORMAT The VGIST log format provides a mechanism by which information from multiple annotators can be incorporated into a common log file on a frame by frame basis. The log file itself was chosen to be a plain ASCII text file so as to provide a convenient means of viewing the file on various platforms. The VGIST log file contains information in the VGIST log format which consists of blocks of text separated by an empty line. Each log file contains one meta text block, one immutable block and a number of mutable text blocks (one for each frame) which specify typing information, describe immutable attributes and describe mutable attributes respectively. Each line in a text block is made up of at least four ASCII character s not containing spaces or tab characters. The layout of the text blocks in each VGIST log file is shown in Fig. 3. Figure 1. VGIST System Overview Logging via the VGIST log format consists of storing semantic information about the video in the form of -value attribute pairs. Log attributes are annotations which are added by either users or external modules and are either immutable, in which case they do not change throughout the video (such as a person s ), or mutable, in which case they are specified for each frame of video (such as a person s physical location). The query system then acts as a filter and when run, returns a subset of video frames which satisfy some query request. 3.1 VGIST GUI The VGIST graphical user interface allows the system to play back the video sequence while overlaying annotated information which is pulled form the underlying VGIST log file as per Fig. 2. meta text block immutable text block frame 1 mutable block frame 2 mutable block Figure 3. Layout of VGIST log file 3.2.1 Meta Text Block The meta text block is the first block found at the top of the VGIST log file and provides a means to explicitly declare the types of certain variables found later on in the file. Each line of the meta text block consists of four literals with its corresponding grammar specified as per Table 1. Non-italicized text that appear in Table 1 as well as subsequent tables concerning language grammar denote text as they will appear in the text file wile italicized text denote terms in the grammar. The term (

literal) refers to a of ASCII characters which does not contain space or tab characters. Table 1. meta Grammar of line in meta text block meta category type ::= meta category ::= immutable mutable type ::= int float The category and terms describe the information about the attribute which we are specifying the typing rule for. The category term specifies whether the attribute is immutable or mutable and the term specifies the attribute s which will be used to identify it subsequently in the log file. Given that the log file comprises of only text, there is a strict parsing hierarchy to determine the types for the value of each variables. For example if the parser encounters a -value pair in the immutable text block that assigns the value "123" to the immutable attribute d "myattr", this value is automatically taken to be of type int. The meta text block then allows the user to explicitly fix the type of variables encountered later on the in the text file. For example, if the line shown in Table 2 is included in the meta text block, then when subsequently encountering the value "123" for the attribute "myattr", the value will be cast as a float. Table 2. Line in meta text block specifying type of immutable attribute "myattr" meta immutable myattr float 3.2.2 Immutable Text Block The immutable text block immediately follows the meta text block and contains -value pairs of immutable attributes in the meeting. Each line in the immutable text block adheres to the grammar shown in Table 3. Table 3. Grammar of line in immutable text block entity immutable value entity meeting immutable ::= immutable value ::= ::= value Each line in the immutable text block describes either an attribute relating to a person or object in the meeting, or some aspect of the meeting itself. The entity term is therefore d accordingly. The value term contained in each line of the immutable text block is either one or a set of literals separated by spaces. For example, the line shown in Table 4 would indicate that the immutable attribute referred to by the "age" belonging to the entity "Alan" would be assigned the value 8. Table 4. Line in immutable text block specifying "age" of "Alan" Alan immutable age 8 3.2.3 Mutable Text Block A set of mutable text blocks immediately follow the immutable text block and contain -value pairs of immutable attributes regarding the video segment. Each block in this set is assigned sequentially to each frame of video as per Fig. 2 resulting in the number of mutable text blocks being equal to the number of frames in the video sequence. Each line in the mutable text block adheres to the grammar shown in Table 5. Table 5. Grammar of line in mutable text block entity mutable value entity meeting mutable ::= mutable value ::= ::= value Each line in the mutable text block describes either an attribute relating to a person or object in the meeting or some aspect of the meeting itself and therefore is functionally similar to a line of text in an immutable text block. Likewise, the entity term is d accordingly. The value term contained in each line of the mutable text block is either one or a set of literals separated by spaces. For example, the line shown in Table 6 would indicate that in this frame, the mutable attribute referred to by the "action" belonging to the entity "Alan" would be assigned the value "SIT". Table 6. Line in mutable text block specifying "action" of "Alan" Alan mutable action SIT 3.3 VGIST QUERY LANGUAGE Once the various attributes of a meeting have been integrated into a VGIST log file using the VGIST logging format, queries can then be run to extract information about the meeting. Queries are run via scripts which are structured using the VGIST query language and consist of variable definitions, function definitions and query execution. The query execution engine operates by applying a set of selection criteria to each frame of video and outputting those frames whose mutable attributes satisfy the selection criteria. The query engine also consists of a set of environment variables which include both the immutable attribute -value pairs as well as any user defined variables which are introduced during run time via the query script. These environment variables then allow the

user to augment the use of defined functions in query execution as we will see later on. 3.3.1 Variable Declaration The variable declaration statement introduces the notion of user defined variables to the VGIST query language and allows a convenient way for the user to specify values which can only be determined at run time. The grammar of a variable declaration statement is as shown in Table 7. Table 7. Grammar of variable declaration statement Variable value variable ::= VARIABLE value ::= ::= value For example after declaring the variable "fps" using the statement shown in Table 8, the user can then define functions using this variable and simply change the hard coded value to the desired one when the query is being executed. Table 8. Example of variable declaration VARIABLE fps 32.0 3.3.2 Function Definition The function definition statement introduces the concept of simple functions to the VGIST query language and allows the user to describe the form of a functional component which can be re-used during the subsequent query execution. The grammar of a function definition statement is as shown in Table 9. Table 9. Grammar of function definition statement define function as entity category conditional value_term define ::= DEFINE function as entity ::= ::= AS meeting env_var category ::= immutable mutable env_var ::= env_var conditional ::= > < >= <= ==!= value_term ::= _term numerical_term _term ::= env_var numerical_term ::= numerical_term operator numerical_term (numerical_term) number env_var operator ::= * / + - env_var number ::= %% ::=?number? ::= (numerical integer) The term env_var denotes a reference to an environment variable of the same as the literal found in the env_var term. This value will be looked up during run time prior to query execution. The term denotes a reference to parameter passed into the function when it is called during runtime. The numerical integer found in the term refers to the index of the parameter whose value will be used. The conditional term serves the purpose of evaluating the value of the attribute referenced by the entity, category and terms against that value referenced by the value_term term. This evaluation is done on literals if the conditional term evaluates to "==" or "!=", the value of the referenced attribute is a literal and value_term term is a _term. Otherwise the evaluation is numerical. 3.3.3 Query Execution The query execution statement initiates the execution of a query and adheres to the grammar shown in Table 10. select output_term where_clause Table 10. Grammar of query execution statement select output_term where_clause ::= SELECT ::= timestamp ::= WHERE functional where_clause WHERE functional functional ::= defined_function direct_function defined_function ::= (parameter_list) () parameter_list ::= paramter parameter,parameter_list parameter ::= number env_var direct_function ::= entity category conditional value_term entity meeting env_var category ::= immutable mutable env_var

::= env_var conditional ::= > < >= <= ==!= value_term ::= value_term operator value_term (value_term) number env_var operator ::= * / + - env_var number ::= %% ::=?number? ::= (numerical integer) The env_var, and conditional terms operate in an identical manner to those in the grammar of function definition statements found in Table 9. The defined_function term allows the user to reuse a function that was previously defined using a function definition statement. Query execution works on a particular video log by running though the mutable attribute sets for each frame (corresponding to each mutable text block) and checks to see which frames satisfy the set of conditions specified in the query execution statement. It should also be noted that multiple where_clause terms can be compounded to check each frame against more than one condition statement. The desired output (specified by the output_term term) is then generated by the execution engine. 4. USAGE EXAMPLE Fig. 3 shows the VGIST log file of a recorded meeting where each mutable text block describes the attribute "timestamp" which corresponds to the frame index as well as an attribute describing the action of the entity "person0". meta immutable meta mutable timestamp float meeting immutable date 20050601 person0 immutable Dinesh meeting mutable frame 1 person0 mutable action SIT meeting mutable frame 2 person0 mutable action SIT meeting mutable frame 3 person0 mutable action STAND... Figure 4. Sample VGIST log file In the VGIST query script shown in Fig. 13, an environment variable d "fps" is defined to indicate the frame rate of the video. It is then used in the definition of the function "get_sec" whose purpose is to check if the considered frame occurs within a particular number of seconds since the video began (which is passed in as an environment variable). The function "get_action" is then defined and serves to check if the entity specified by the first input parameter is engaged in the action specified by the second input parameter. Both these functions are then used when executing the query which seeks to find frames which occur within the first 3 seconds of the video and in which the entity "person0" is engaged in the activity "SIT". VARIABLE fps 1.0 DEFINE get_sec AS meeting mutable timestamp <= (?1? * %fps%) DEFINE get_action AS?1? mutable ACTION ==?2? SELECT frames WHERE get_sec(3) WHERE get_action(person0,sit) Figure 5. Sample VGIST query script In this example, when the query has finished executing it outputs the first 2 frames of video. 5. RELATED WORK TBD

6. Conclusion and Future Work In this paper we have presented the VGIST system for managing and retrieving multi-modal semantic data gathered in relation to a video segment. The system employs a text based logging format for easy viewing and modification across platforms and also makes it convenient to augment the logs with information from new annotators. The VGIST system also incorporates a customizable script based querying language which, by operating on VGIST log files, allows for higher level queries involving inference on annotations from multiple sources. A possible improvement to the logging format is to introduce statements which allow for retroactively modifying attributes found in earlier frames. This would enable log files to be generated from annotation systems on the fly and would allow for the retroactive correction of incorrect annotations. Another possible improvement to the system involves augmenting the VGIST query language by adding a term to the grammar of the execution statement which operate as a disjunctive conditional clause (in addition to the conjunctive where_clause term). This addition would diversify the possible types of queries the system could handle and would make it more versatile. 7. REFERENCES [1] Ionescu, A., Stone, M., Winograd, T., WorkspaceNavigator: Tools for Capture, Recall and Reuse using Spatial Cues in an Interactive Workspace. In: Stanford Computer Science Technical Report 2002-04(2002) [2] John Niekrasz, Matthew Purver, John Dowding, and Stanley Peters. Ontology-based discourse understanding for a persistent meeting assistant. In Proceedings of the AAAI Spring Symposium on Persistent Assistants: Living and Working with AI, Stanford, California (2005). [3] John Niekrasz and Matthew Purver. A multimodal discourse ontology for meeting understanding. To appear in The 2nd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms, Edinburgh, Scotland (2005). [4] Cook, P., Ellis, C., Graf, M., Rein, G., and Smith, T. 1987. Project Nick: meetings augmentation and analysis. ACM Trans. Inf. Syst. 5, 2 (Apr. 1987) [5] Ahuja, S. R., Ensor, J. R., and Horn, D. N. 1988. The rapport multimedia conferencing system. In Conference Sponsored By ACM SIGOIS and IEEECS TC-OA on office information Systems (Palo Alto, California, United States, March 23-25, 1988) [6] Chiu, P., Kapuskar, A., Wilcox, L., and Reitmeier, S. 1999. Meeting Capture in a Media Enriched Conference Room. In Proceedings of the Second international Workshop on Cooperative Buildings, integrating information, Organization, and Architecture (October 01-02, 1999). N. A. Streitz, J. Siegel, V. Hartkopf, and S. Konomi, Eds. Lecture Notes In Computer Science, vol. 1670. Springer- Verlag, London, 79-88. [7] Cutler, R., Rui, Y., Gupta, A., Cadiz, J., Tashev, I., He, L., Colburn, A., Zhang, Z., Liu, Z., and Silverberg, S. 2002. Distributed meetings: a meeting capture and broadcasting system. In Proceedings of the Tenth ACM international Conference on Multimedia [8] http://www.ai.sri.com/project/calo [9] P. Rybski, F. De la Torre Frade, R. Patil, C. Vallespi, M. Veloso, and B. Browning. CAMEO: The Camera Assisted Meeting Event Observer, Tech. Report CMU-RI-TR-04-07, Robotics Institute, Carnegie Mellon University, January, 2004.