Voice Driven Animation System Zhijin Wang Department of Computer Science University of British Columbia Abstract The goal of this term project is to develop a voice driven animation system that could take human voice as commands to generate the desired character animation, based on motion capture data. In this report, the idea of our system is first introduced, followed by a review of background which is related to the project. Then we will talk about the Microsoft Speech API, which is used as the voice recognition engine in our system. Some details of our implementation are then explained and results are given at the end. 1. Introduction In a traditional animation system, the animator must use mouse and keyboard to specify the path along which the character will move and the action that the character will be doing during the movements. This kind of interaction is not very effective indeed because either clicking on some buttons or typing on the keyboards will be distracting for the animator who is trying to focus on creating the animation. To improve the interaction, we can borrow the idea from the filmmaking scenario, where the director uses his voice to tell the actor what to do before the shooting of the scene, and then the actor will perform the action exactly as he was told to. This is how we come up the idea of using human voice as a media to make a better interface for the animation system. 2. Background In 1986, Dr. Jacob Nielsen asked a group of 57 IT professionals to predict what would be the greatest changes in user interfaces by the year 2000. The top-five answers were Table 1: User Interfaces Prediction Table [1]
While Graphical User Interfaces (GUIs) have clearly been the winner since that time, Voice User Interfaces (VUIs) certainly failed to reach the demand that IT professionals expected. The key issue in interaction design and the main determinant of usability is what the user says to the interface. Whether you say it by speaking or by typing at the keyboard matters less to most users. Thus, having voice interfaces will not necessarily free us from the most substantial part of user interface design: determining the structure of the dialogue, what commands or features are available, how the users are to specify what they want, and how the computer is to communicate the feedback. All that voice does is to allow the commands and feedback to be spoken rather than written. [2] Voice interfaces have their greatest potential in the following cases where it is problematic to rely on the traditional keyboard-mouse-monitor combination: [1] Users with various disabilities that prevent them from using a mouse and/or keyboard or that prevent them from seeing the pictures on the screen. All users, with or without disabilities, whose hands and eyes are occupied with other tasks. For example, while driving a car or while repairing a complex piece of equipment. Users who do not have access to a keyboard and/or a monitor. For instance, users accessing a system through a payphone. So it's not that voice is useless. It's just that it is often a secondary interaction mode if additional media are available. Just as in our system, in addition to using voice commands to select different actions of the character, the user still has to use mouse clicks to specify the location at which the action is taking place. The combination of multiple computer medias proves to provide better interactions for most users. As for the voice recognition system, there are two mainstream products available at the current market. One is IBM ViaVoice Dictation SDK, which is based on many years of development by IBM. The other one is Microsoft Speech API, also known as SAPI. As an SDK toolkit, each of them has their unique features, and it s hard to compare which one is better. However, considering the functionality we re going to use and the relative cost of them, we have decided to use Microsoft Speech API as our voice recognition system in this project. 3. Microsoft Speech API 3.1. API Overview The Microsoft Speech API (SAPI) provides a high-level interface between an application and speech engines. SAPI implements all the low-level details needed to control and manage the real-time operations of various speech engines, thus dramatically reduces the code overhead required for an application to use speech recognition and text-to-speech, making speech technology more accessible and robust for a wide range of applications. The two basic types of SAPI engines are text-to-speech (TTS) systems and speech recognizers. TTS systems synthesize text strings and files into spoken audio using synthetic voices. Speech recognizers convert human spoken audio into readable text strings and files.
Figure 1: SAPI Engines Layout [3] For this project, we will only use Speech Recognition Engine to retrieve the voice commands from the user. There are two types of utterances to be recognized by the Speech Engine. The first one is dictation, which means the Speech Engine will try to recognize whatever the user is saying to the microphone. The recognition rate of dictation is usually very low, because the computer doesn't know what to expect from the user's speech without any given context. The other type of recognizable utterances is called command and control grammar, which means the user can tell the engine in advance what kinds of voice commands he will probably say to it, then the speech engine will try to match his speech with one of those commands during run time of the application. This is the way we're doing it because it has a much higher recognition rate than dictation. 3.2. Context-Free Grammar The command and control features of SAPI 5 are implemented as context-free grammars (CFGs). A CFG is a structure that defines a specific set of words, and the combinations of these words that can be used. In basic terms, a CFG defines the sentences that are valid, and in SAPI 5, defines the sentences that are valid for recognition by a speech recognition (SR) engine. The CFG format in SAPI 5 defines the structure of grammars and grammar rules using Extensible Markup Language (XML). The XML format is an "expert only readable" declaration of a grammar that a speech application uses to accomplish the following: Improve recognition accuracy by restricting and indicating to an engine what words it should expect. Improve maintainability of textual grammars, by providing constructs for reusable text components (internal and external rule references), phrase lists, and string and numeric identifiers. Improve translation of recognized speech into application actions. This is made easier by providing "semantic tags," (property name, and value associations) to words/phrases declared inside the grammar. The CFG/Grammar compiler transforms the XML tags defining the grammar elements into a binary format used by SAPI 5-compliant SR engines. This compiling process can be performed either before or during application run time. Since our system does not need to modify the grammar at run time, the compiled binary format is loaded statically before the application run time.
4. Development of the Voice Driven Animation System 4.1. Design of Grammar Rules Phase one of the project is designing grammar rules for the voice commands. Here is an example of one of the grammar rules used in our system. <RULE ID="VID_TurnCommand" TOPLEVEL="ACTIVE"> <P>turn</P> <RULEREF REFID="VID_Direction" PROPID="VID_Direction"/> <O>by</O> <O><RULEREF REFID="VID_Degree" PROPID="VID_Degree"/></O> <O>degrees</O> </RULE> <RULE ID="VID_Direction" > <L PROPID="VID_Direction" > <P VAL="VID_Left">left</P> <P VAL="VID_Right">right</P> <P VAL="VID_Around">around</P> </L> </RULE> <RULE ID="VID_Degree" > <L PROPID="VID_Degree" > <P VAL="VID_Ten">ten</P> <P VAL="VID_Twenty">twenty</P> <P VAL="VID_Thirty">thirty</P> <P VAL="VID_Forty">forty</P> <P VAL="VID_Fifty">fifty</P> <P VAL="VID_Sixty">sixty</P> <P VAL="VID_Seventy">seventy</P> <P VAL="VID_Eighty">eighty</P> <P VAL="VID_Ninety">ninety</P> </L> </RULE> According to this grammar rule, if the user says "turn right by 70 degrees", then the speech engine will indicate the application that rule name "VID_TurnCommand" has been recognized, with the property of child rule VID_Direction being "right", and the property of child rule VID_Degree being "seventy". We have also performed basic testing on the grammar rules we have written, using a grammar compiler and tester provided by the SDK toolkit. All of the grammar rules can be recognized from the user s speech very well.
4.2. Integration of the Speech Engine Phase 2 of the project is integrating the speech engine into the system, so that we can use the recognized information to generate the desired animation. Before the system gets more complicated, we first tested with simple object and movements, i.e., using voice commands to drive a ball move from one place to another. Here are some snapshots of the running application: Figure 2: A Simple Voice Driven Application The blue ball represents the subject, and the red ball represents the destinations that the subject must pass through in the same order as they were created. The locations of the red balls are specified by the mouse clicks of the user. If the user says move to here, to here, to here while he s doing the mouse clicks, the application will recognize the voice command and start to move the blue ball towards those red balls, once the mouse clicking has ended. When the subject passes through a destination, the red ball will disappear showing that it has been reached, and then the subject will head straight to the next destination again until all the red balls have been reached, as shown in the above images. Although this application may seem simple enough, it demonstrates that the speech recognition engine has been successfully integrated into the windows program and they can work seamlessly together. This makes sure that we can build more complex system on top of the speech engine.
4.3. Combination with Motion Capture Data Now we have got the speech engine working properly, we can combine voice recognition with motion capture data to generate the animation of a character driven by voice commands. Here is a snapshot of the interface of our system: Figure 3: Voice Driven Animation System The user can speak a limited set of voice commands to make the character walk in different style, such as walk fast, slow backwards ; and he can use faster or slower commands to control the speed of the walking motion. Besides these, the system also supports directional control, so if the user says turn left (by) sixty (degrees), the character will make a left turn by sixty degrees. The brackets enclosing by and degrees mean that these two words are optional, i.e., the system will recognize user s voice command either with or without those two words being said. When the system is started, a single walk cycle motion of the character is loaded from the motion capture data, and the system basically replays this walk cycle at different speed using different translation and orientation of the character, according to the user s voice commands. As for the change of walking direction, in order to make the rotation smoother, we apply a linear interpolation to the rotation angle, so that the orientation of the character is changed by 10 degrees in each successive frame, until the desired rotation is achieved. In this way the turning action looks more natural than a straight cut to the new walking direction.
5. Conclusion and Future Work In this project we have successfully developed a voice driven animation system that allows the user to control the movements of a simple object or a character by using voice commands. It is a new and more efficient type of interaction since the user could just use natural human voice as input media, rather than typing on the keyboard or clicking on some buttons of the interface. Since much effort of this project has been done in exploring the Microsoft Speech API and integrating the speech engine into the animation system, only a few features are provided and a few motions can be generated by the system at the moment of this time. However, this system can be easily expanded to support more kinds of operation, such as blending and transition of different motion clips using motion graph technique, and obstacle avoidance using motion planning algorithms. There are still a lot of work to be done in this area. References [1] Nielsen, J., Will Voice Interfaces Replace Screens, In IBM DeveloperWorks, 1999 [2] Apaydin, O., Networked Humanoid Animation Driven by Human Voice Using Extensible 3D (X3D), H-Anim and JAVA Speech Open Standards. In Thesis of Naval Postgraduate School, 2002 [3] Microsoft Speech SDK 5.1 Documentation, 2004