DESIGNING MPEG-4 FACIAL ANIMATION TABLES FOR WEB APPLICATIONS STEPHANE GACHERY, NADIA MAGNENAT-THALMANN MIRALab - University of Geneva 22 Rue Général Dufour, CH-1211 GENEVA 4, SWITZERLAND Web: http://www.miralab.unige.ch {garchery,thalmann}@miralab.unige.ch Abstract: The Internet today already uses a lot of text, pictures, videos, animations, etc. to attract the user attention. Considerable advancement in this direction can be achieved by improving the interactivity with the user. In this article, we expose different methods for designing facial models and to make them interactive based on MPEG-4 facial animation parameters. We explain the integration of the web version of our MPEG-4 Facial Animation system in an applet towards making the web pages not only attractive but also interactive. Keywords: facial animation, MPEG-4, face cloning, Internet application 1 Introduction Facial modeling and animation is an important research topic in computer graphics. During last twenty years, a lot of research work has been done on this topic. It still remains to be a challenging task to create and animate very realistic faces. The impact of the previous and ongoing research has been felt in many applications, like games, web based 3D animations, 3D animation movies, etc. Different approaches have been taken such as pre-calculating animation with very realistic results used for animated film productions and real time animation for interactive applications. Correspondingly, the animation techniques vary from keyframe animations where experienced animator sets each frame with artistic precision to algorithmic parameterized mesh deformation. The current developments are towards having real time facial animation on the web, maintaining quality and beauty of animation. With the recent development of Internet services, new applications for virtual human technologies can be identified. Talking virtual characters can be useful when integrated into web sites to provide new services as navigation aid, salespersons, hosts, presenter, etc. In this article, we explain a technique for constructing interactive talking head based on MPEG-4 facial animation standard for a webbased application.
The first part of this paper explains some details about MPEG-4 standard for facial animation (FAP, FAT, etc) that is used in our web application. In section 3, we describe the cloning system based on two photos and automatic method for extracting Facial Animation Tables. We also explain a method for constructing facial model and corresponding FAT that involves skills of an experienced animator. In section 4, we describe the details of how the facial animation engine is adapted for a web-based application and the implementation issues related to speed, realism and interactivity. Some results of our MPEG-4 player based on Internet are also presented. We give an example of current implementation and different possible utilization, and discuss about the future work in the last section. 2 Face Animation Using MPEG-4 Facial Animation Tables 2.1 MPEG-4 Facial Animation Standard The ISO/IEC JTC1/SC29/WG11 (Moving Pictures Expert Group MPEG) has formulated the MPEG-4 standard [1]. SNHC (Synthetic Natural Hybrid Coding), a subgroup of MPEG-4, has devised an efficient coding method for graphics models and the compressed transmission of their animation parameters specific to the model type. For synthetic faces, the Facial Animation Parameters (FAP) are designed to encode animation of faces reproducing expressions, emotions and speech pronunciation. The 68 parameters are categorized into 10 different groups related to parts of the face. FAPs represent a complete set of basic facial actions, and therefore allow the representation of most natural facial expressions. The parameter set contains two high level parameters, the viseme and the expression. The viseme parameter allows the rendering of visemes on the face without the need to express them in terms of other parameters or to enhance the result of other parameters, insuring the correct rendering of visemes. Only 14 static visemes that are clearly distinguished are included in the standard set. In order to allow for coarticulation of speech and mouth movement, transitions from one viseme to the next are defined by blending the two visemes with a weighting factor. Similarly, the expression parameter allows definition of high-level facial expressions. The 6 highlevel expressions are joy, sadness, anger, fear, disgust and surprise. In contrast to viseme, facial expressions are animated with a value defining the excitation of the expression. Two high level facial expressions can be blended with a weighting factor. Since the FAPs are required to animate faces of different sizes and proportions, the FAP values are defined in Face Animation Parameter Units (FAPU). The FAPU are computed from spatial distances between major facial features on the model in its neutral state. It must be noted that the standard does not specify any particular
way of achieving facial mesh deformation given a FAP. The implementation details such as resolution of the mesh, deformation algorithm, rendering etc. are left to the developer of the MPEG-4 facial animation system. Since the standard does not emphasis on the animation procedure details, there should be a way of specifying how an FAP is interpreted in a precise manner, as the need may be. Hence, the standard also specifies use of a Facial Animation Table (FAT) to determine which vertices are affected by a particular FAP and how. Experience shows that MPEG-4 feature points by themselves do not provide sufficient quality of face reproduction for all applications. Therefore the Facial Animation Tables model is very useful, guaranteeing not only the precise shape of the face, but also the exact reproduction of animation. Thus, a classic MPEG-4 deformation engine is based on complex algorithms. One the contrary, a MPEG-4 FAT engine is based on simple interpolation functions. See the MPEG-4 specification for details on this approach [1]. For this reason, we use this method for a web-based application with high artistic precision. 2.2 Facial Animation Tables The Face Animation Tables (FATs) define how a model is spatially deformed as a function of the amplitude of the FAPs. Each facial model is represented by a set of vertices and associated information about the triangles joining them. This information is called IndexedFaceSet in a geometric representation. A model can be composed of more than one IndexedFaceSets depending upon the textures and topology. For each IndexedFaceSet and for each FAP, the FAT defines which vertices are deformed and how. We can consider two different cases of FAP: - If a FAP causes a transformation like rotation, translation or scale, a Transform node can describe this animation. - If a FAP, like open jaw, causes flexible deformation of the facial mesh, the FAT defines the displacements of the IndexedFaceSet vertices. These displacements are based on a piece-wise linear motion trajectories (Figure 1). Syntactically, the FAT contains different fields. The intervalborder field specifies the interval borders for the piece-wise linear approximation in an increasing order. The coordindex field contains a list of vertices that are affected by the current deformation. The Coordinate field defines the intensity and the direction of the displacement for each of the vertex mentioned in the coordindex field. Thus, there must be exactly (num(intervalborder) 1) * num(coordindex) values in Coordinate field. Vertex n 1 st interval [I 1, I 2 ] 2 nd interval [I 2, I 3 ] Index 1 displacement D 11 displacement D 12 Index 2 displacement D 21 displacement D 22 Table 1: exactly one interval border I k must have the value 0.
During animation, when the animation engine interprets a set of FAPs for current frame, it affects one or more IndexedFaceSet of the face model. The animation engine piece-wise linearly approximates the motion trajectory of each vertex of the affected IndexedFaceSet by using the appropriate table, such as the one show in Table 1. Figure 1: An arbitrary motion trajectory is approximated as a piece-wise linear one. 3 Face Modeling and FAT construction The construction of the 3D models representing the face, as well as the construction of corresponding Facial Animation Tables, can be approached from various ways. Firstly we explain how we can build a 3D face model in a quasiautomatic way from two simple photographs. Then, we will explain how we can build FAT for the face thus constructed, in an automatic way by using the MPEG-4 facial animation engine of MIRALab [2]. In the later sub-section, we explain another method for constructing the 3D face model and the corresponding FAT by using an experienced animator s competences. The last part will explain that it is possible to use a combination of both the methods for constructing FAT, thus extracting the advantages of each method. 3.1 Face Modeling 3.1.1 Face Cloning Face cloning means to make a virtual face in 3D, which resembles the shape of the given person [3]. In this section, we present a way to reconstruct a photo realistic head for animation from orthogonal pictures. First, we prepare a generic head with an animation structure and two orthogonal pictures of the front and side views (Figure 2). The generic head has efficient triangulation, with finer triangles over the highly curved and/or highly articulated regions of the face and larger triangles elsewhere. It also includes eyeballs and teeth.
Figure 2: normalization and detected features, modification of generic head with feature points. The main idea to get an individualized head, is to detect features points (eyes, nose, lips, and so on) on the two images and then obtain the 3D position of the feature points to modify a generic head using a geometrical deformation. The feature detection is processed in a semiautomatic way. The user sets a very few feature points (key points) and the other feature points are fitted using a piecewise affine transformation first and then snake methods. A most precise description of this method with some anchor functionality is described in another papers [3][4]. Texture mapping is useful not only to cover the rough matched shape, as here the shape is obtained only by feature point matching, but also to get a more realistic colorful face. The main idea of texture mapping is to get an image by combining two orthogonal pictures in a proper way to get the highest resolution for the most detailed parts. The detected feature point s data is used for automatic texture generation by combining two views. The Figure 3 shows several views of the final reconstructed head out of two pictures in Figure 2. The face animation is immediately possible as being inherited from the generic head. See the last face in next figure to see an expression on a face. Figure 3: snapshots of a reconstructed head in several views and animation on the face.
The next figure resumes the complete process for reconstruct face by cloning system. Orthogonal photographs Feature Detection Key feature detection Other features detection Modification of a Generic Model with Feature Points Generic model with animation structure DFFD coordinate calculation Texture Generation Texture Fitting Expression Database (FAP) Facial Animation Extracted vieseme (FAP) Interaction Automatic Only once Figure 5: overflow of face cloning 3.1.2 Face design A face model can be developed by an animator with the help of 3D graphics modeling tools like 3D Studio Max [5], Maya [6], etc. This method does not impose any constraints on the creativity of the animator. A complete freedom to choose any form, any number of meshes, and any structure (triangular or otherwise) enables a lot more flexibility and richness in the content. The next Figure 6 shows a sample of medium complexity face model developed by MIRALab for web applications.
Figure 6: a sample of medium complex model developed by an animator 3.2 Automatic FAT construction We have developed two different methods for the FATs construction. The first method is based on MIRALab MPEG-4 facial animation engine. We have developed the tools based on this engine in order to be able to automatically build and export FATs. The MPEG-4 facial animation engine works with MPEG-4 compliant faces. If the facial model is not MPEG-4, we need to construct the data concerning MPEG-4 facial feature points. This work consists of defining the correspondence between the face mesh vertices and the MPEG-4 feature points. We can note that the cloning method mentioned before is based on a generic mesh and the topology of this mesh does not change during the cloning construction, the design of this data is already done. Once this data is constructed, it is easy to generate FAT using a program developed (Figure 7) at MIRALab using the MPEG-4 facial animation engine. Figure 7: snapshot of GUI for exporting FAT using MPEG-4 facial engine
The process consists of initializing facial animation engine with face model designed by cloning method and with FDP data, as explained in the previous paragraph. Then, for each low level FAP, the procedure consists of applying various intensities of FAPs to the facial mesh, and then to make a comparison of the position of each vertex with its neutral position (Figure 8). Thus the tables of deformations are built for each FAP and for each vertex. If one needs more precision for a certain FAP, it is possible to use different intensities and different intervals (using a true piece-wise linear function ) to generate FAT. However, with various interval borders and intensities, the size of the animation data increases, so does the computation overhead. Figure 8: automatic process for designing FAT why MPEG-4 face model This method is able to construct FAT very quickly (a few seconds are enough for complete FAT design). At the same time, the animation results will be very close to that obtained with MIRALab MPEG-4 facial animation engine. 3.3 Artistic FAT construction In order to have a freedom of using any model (not necessary a human face), as well as to have a variety in deformation used for animation, we have developed some tools. These tools are provided to the animator to generate the artistic FAT. When an animator develops a model that will be animated by set of FAPs, the definition of the neutral position is important, as it is used for animation later. The neutral face is defined as follow: - The coordinate system is right-handed; head axes are parallel to the world axes - Gaze is in direction of Z-axis - All face muscles are relaxed - Eyelids are tangent to the iris
- The pupil is one third of IRISD (cf. FAPUs) - Lips are in contact. The line of the lips is horizontal and at the same height of lip corners - The mouth is closed and upper teeth touch the lower ones - The tongue is flat, horizontal with the tip of tongue touching the boundary between upper and lower teeth The animator needs to design only the FAPs without rotation (no head rotation, for example). The animation engine directly manages the deformations for the FAPs related to rotation. Generally, during the animation, the construction of only one interval by each FAP is enough, but if it is necessary that the animator can use multi interval. This would result into a more precise and realistic animation. Once the whole set of the FATs is designed, we compile the set of the deformations into the FAT information. The program developed for this, parses all the files and compiles the information to be then used by the FAT based engine (Figure 9). We currently use the VRML files format but it is very easy to integrate another file format supporting IndexedFaceSet or a similar structure. Figure 9: Process for designing FAT construction by experienced animator s work The most important advantage of this method is that it enables controlling exactly facial deformation in terms of the FAP intensity. We have extended this advantage to the high level FAP. FAP 1 and 2 have been defined to reduce the bit rate during streaming animation by defining 14 visemes and 6 expressions (Figure 10). Indeed, a majority of facial engines use table to convert the high level FAPs to
a set of low-level FAPs. In the case of FATs based engine, the goal is to be able to design a precise deformation. Hence, we have developed FATs information especially for high-level expressions. If the model does not have this high level FAPs deformation defined in FAT, we use the pre-defined tables to convert the high level FAPs into a set of low-level FAPs. Figure 10: artistic MPEG-4 FAP high-level expressions This method of artistic FAT construction takes longer than the previous one described in subsection 3.2 but the deformations can be exactly managed specially for the high level FAPs. In the next section, we describe how we can combine the advantage of both the methods, artistic precision and speed of design. 3.4 FAT construction by a combination of both the methods The first method based on MIRALab MPEG-4 facial animation engine can be used for any MPEG-4 compatible face model. The big advantage of this method is the speed of the FAT construction. For a new model with a new geometry, we just need to construct the FDP information and then we are able to compute FATs quickly. But it is not possible to design deformation by hand like the one designed by an animator.
On the other hand, the second method allows designing mostly exact deformation, but a complete FATs design take time. - Impossible to manage Automatic FATs + Speed deformation precisely (define by algorithmic engine) Artistic FATs + Realistic deformations - Slow design (need to design 66 low level FAPs + 20 high level FAPs) In order to extract benefits of both the methods, we can use MIRALab MPEG-4 engine for construct FAT for low level FAPs, and an animator to construct visemes and expressions. 4 MPEG-4 FAT Animation for Internet The previous sections explained the MPEG-4 FAT definition, construction of face models, and two different methods for designing FATs. This section explains the purpose of using the FAT: web-based MPEG-4 facial animation 4.1 Why use FAT for web applications? For real time face animation on the web, as explained in [7], following are the important requirements: - Easy installation: like in most of the application, the virtual presenter is not the most important attraction of the proposed services. We try to simplify installation procedure. For this, we would not use any plug-in and the complete applet development (Shout3D [8] rendering + deformation engine) is pure java. - Visual quality: in case of the virtual talking heads on the web looks and realism are of utmost importance. In the case of the virtual cartoon-like character, we can be more free to use artistic and exaggerated deformation. In either case the visual quality must be good. - Fast download: to reduce the delay of downloading model, we need to develop low resolution and less complex model (e.g. use of compressed model from Shout3D) and compress the size of data needed for animation (FAT). Some download sizes of files are described later. - Real time interactivity: our virtual presenter must be able to interact in different ways depending upon the user. In this case, video streaming or prepreprocessing is not possible. Thus, parameterized animation of the virtual presenter is of importance, and we choose MPEG-4 FAP as these parameters. One example of this interactivity is under development for the IST InterFace project [9], where a Dialogue Manager, a TTS and a Phoneme/Bookmark converter provide real time dialogue and animation.
- Easy web integration: integration of this applet in web pages should be easy and must allow communication between the applet and the html pages. This communication, to select a model and some visual parameters, uses JavaScript. 4.2 Description of java FAT engine characteristics In this section we explain the characteristics and some choices done, concerning libfat: the MPEG-4 facial animation engine based on FAT. 4.2.1 libfat The current implementation of the MPEG-4 facial animation player through FAT is written completely in java and uses the Shout3D rendering engine. This engine was developed in Java in order to simplify the web integration (no plug-in is necessary). But, Shout3D provides two types of rendering engines based on the same APIs. The first is pure java that can be use without a plug-in installation, but the performance is linked with the complexity of the models. To be able to use and animate most complex models, Shout3D also provides a plug-in rendering engine based on OpenGL. It is important to separate the implementation of the libfat from the rendering engine (Figure 11). This enables us to choose any rendering engine available. Thus, the current implementation of libfat is completely independent of Shout3D. Though this meant some of the functionalities already offered by Shout3D had to be re-implemented, it is now easy to use other rendering engine like Java3D [10], GL4Java [11], etc. Figure 11: scheme of libfat implementation in MPEG-4 FAP player based on FATs. Because we take a modular approach, the use of libfat is very simple. The first step is an initialization procedure. This process will be able to load and compile
FAT data into the engine, and provide some other information about the model like neutral position. During animation, for each frame, we load the current set of FAPs into libfat. After, for each different constituent of the face, the deformation of the mesh is computed by libfat using an Update function. For animation synchronization, we have two different cases: - If the rendering is very slow, it is necessary to skip frame. LibFAT is able to compute average of a set of high-level and low-level FAPs in order to avoid jerky animation. - If the rendering is fast enough, we add some delay function for catch up with the correct time. 4.2.2 Computation time The most important problem to be solved is to guarantee a minimum computation time for the deformation in order to have more computation power for rendering. To ensure this, we maintain an intermediate index table in addition to the IndexedFaceSets of the facial mesh. We use the facial model with topologically separate meshes (skin, hair, eyes, etc). It is important to know in advance which of the meshes are involved in a particular deformation. This index table constructed during the initialization contains the information as to which IndexedFaceSets are deformed for each FAP. Thus, it is possible to access any vertices rapidly and directly to compute and update the position. With this optimization, we can manage each mesh from in ways: directly from the individual vertices or a group of vertices (sub-mesh). With this implementation of libfat, we obtain very less computation time for the deformation part. The next table (Table 2) shows different results of the computation time for different models. The first two columns (libfat only) show the frame rates obtained only for deformation computation without any rendering. The second and third columns show the frame rates obtained with Shout3D java (without plug-in) and Shout3D OpenGL accelerator rendering. frames/seconds nb Nb libfat only soft render OpenGL render polygons meshes texture PC1 PC2 PC1 PC2 PC1 PC2 1094 1 no 950.9 2093.3 20.9 35.2 31.4 55.2 2313 1 yes 590.8 1046.2 10.5 18.9 20.3 37.2 4076 1 yes 246.8 377.6 6.9 12.9 12.6 25.6 2409 5 yes 77.1 174.9 7.1 13.9 14.9 29.1 7001 6 yes 45.6 67.6 3.4 6.5 6.2 12.1 Table 2 : performance measurements for different face models on different computer configuration. PC1=P3/2x500 Mhz/ELSA Gloria III, PC2=P3/2x1 GHz/ELSA Gloria III.
4.2.3 Special process for visemes/expressions deformations In most cases, an MPEG-4 animation engine uses a table for mapping highlevel FAPs into a set of low-level FAPs. In order to extract the maximum advantage of the FAT and to manage precise mesh deformation (not necessarily based on only a set of low-level FAPs), we have added a special process concerning the high-level FAPs (visemes and expressions). The high-level deformations designed by the animator are initially performed. Later, if necessary, low-level deformations are applied to the model as per the MPEG-4 specification. Figure 12: example of Joy expression define by a set of low level FAP or directly by high level FAP construct by animator The Figure 12 shows a sample of high-level expression designed using a set of low-level FAPs or directly by an animator. 4.2.4 Extended head rotation for the models with bust MPEG-4 animation stream provides 3 FAPs for head rotation. In most cases, a virtual speaker is provided with a bust. The easiest way to perform the head rotation is to define a separate mesh, topologically detached from the facial mesh. However, to improve the visual quality, the face and the neck should be constructed as a continuous mesh. In this case, we have to add some information about the neck, and process the complete model for head rotations. For obtaining a pleasant result, we define a set of vertices independent of mesh, which are influenced by the FAP head rotation parameters (Figure 13).
Figure 13: Neutral position, FAP 48 head pitch, FAP 49 head yaw, FAP 50 head roll results of neck deformation 4.2.5 Applet and files size In order to show the virtual presenter, one needs to download a set of different files. For the applet, we need to download the files including 3D rendering, libfat for computing the meshes deformation and for managing the FAP stream. - The size of the Shout3D applet can vary between 115 kb minimum to 200 kb maximum depending of the rendering functions used. - The size of the class to compute the deformation (libfat) is 18 kb - Other class to manage MPEG-4 FAP stream is 10 kb. On an average, the applet size (rendering and libfat) is around 150 kb. We also need to download the facial model and the corresponding FAT information. The next table (Table 3) describes the size for different face models. The size of the FAT depends on the number of the vertices included in each FAP deformation. For the same model, the size of this FAT file can be different depending on the level of precision desired. This couple of files can be compressed to decrease the size. The size for a medium complex model is around 200 kb. Face complexity FAP 1&2 designed face model size (kb) FAT size (kb) VRML compressed compressed 1 094 polygons no 58 12 72 21 2 313 polygons no 118 30 140 42 4 076 polygons no 195 54 312 94 2 409 polygons yes 104 26 137 35 Table 3: different file sizes for different face models The next figure shows a set of different complexity facial models developed or adapter to libfat and web application.
Figure 14: some snapshot of facial model being able to use FAT deformation based on face cloning or face designing 5 Different web application In this section we outline different possible integration of facial animation based on FATs for applications. The goal of this first example application is to provide a dialogue with a dialogue software improved with a synthetic face and the audio output. In this sample, the user can discuss with the dialogue software by typing text. The server includes a dialogue manager to generate an appropriate answer with emotions, a Text-To-Speech process to construct audio output and the phonemes and a Phoneme/Bookmark to FAP converter to construct the corresponding FAP stream. Our web-based facial animation module is integrated in the html page on the client side (Figure 14) and interprets the facial animation synchronized with the audio output. Figure 14: overview of complete guide system using TTS, Dialogue Manager and virtual presenter.
This example of application can be used, with a corresponding knowledge database of a Dialogue Manager, like a navigation aid to help the user during website navigation or to inform the user about different subjects, answer to frequently asked questions, etc. Another applications of the synthetic facial animation on the web can be virtual presenter by using pre-processed FAP stream only, chat application with the photocloned heads of the participants, etc. 6 Future work and conclusion The next step in this work is to improve the speed of the construction of the FAT by an animator and the speed of rendering. Another ongoing work is to integrate the body animation based on the MPEG-4 with the facial animation. We can note that the facial animation based on FATs can be also easily integrated in a stand-alone application in order to be able to use the advantage of artistic deformation and speed. We have analyzed different methods to design facial model and corresponding FAT. We have explained own approach and some modifications made to improve visual quality. We must conclude than the real time facial animation engine based on FATs can be used in a lot of applications (education, commerce, navigation aid, entertainments, etc.) and it is just the beginning of the real time interactive virtual humans in web applications. 7 Acknowledgments This research of MPEG-4 Facial Animation Table is partly supported by the EU IST project InterFace (IST-10036). We would also like to thank Sumedha Kshirsagar for her comments and Nicolas Erdos who designed face models and the corresponding FATs. References: 1. Specification of MPEG-4 standard, Moving Picture Experts Group, http://www.cselt.it/mpeg/ 2. Sumedha Kshirsagar, Stephane Garchery, Nadia Magnenat-Thalmann, Feature Point Based Mesh Deformation Applied to MPEG-4 Facial Animation. Deformable Avatars, Kluwer Academic Press, 2001, pp 24-34. 3. WonSook Lee and Jin Gu and Nadia Magnenat-Thalmann. Generating Animatable 3D Virtual Humans from Photographs, Proc. EUROGRAPHICS'2000, Computer Graphics Forum, 19(3), (August 2000). Blackwell Publishers. ISSN 1067-7055 4. Lee W. S., Kalra P., Magenat Thalmann N, Model Based Face Reconstruction for Animation, In Proc. Multimedia Modeling (MMM) 97, Singapore, pp. 323-338, 1997.
5. Kinetix, Autodesk Inc., 3D Studio Max, http://www.discreet.com 6. Alias wavefont, Maya, http://www.aliaswavefront.com 7. Igor S. Pandzic, Talking Virtual Characters for the Internet. Proc. ConTEL 2001, Zagreb, Croatia. 8. Shout3D, Eyematic Interfaces Incorporated, http://www.shout3d.com 9. The InterFace project, IST-1999-10036, www.ist-interface.org 10. Java 3D API, Sun, http://java.sun.com/products/java-media/3d/ 11. OpenGL for Java http://www.jausoft.com/products/gl4java/gl4java_main.html