IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE

Transcription

1 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE Learn2Dance: Learning Statistical Music-to-Dance Mappings for Choreography Synthesis Ferda Ofli, Member, IEEE, EnginErzin,SeniorMember,IEEE,YücelYemez, Member, IEEE, and A. Murat Tekalp, Fellow, IEEE Abstract We propose a novel framework for learning many-to-many statistical mappings from musical measures to dance figures towards generating plausible music-driven dance choreographies. We obtain music-to-dance mappings through useoffourstatisticalmodels:1) musical measure models, representing a many-to-one relation, each of which associates different melody patterns to a given dance figure via a hidden Markov model (HMM); 2) exchangeable figures model, which captures the diversity in a dance performance through a one-to-many relation, extracted by unsupervised clustering of musical measure segments based on melodic similarity; 3) figure transition model, which captures the intrinsic dependencies of dancefigure sequences via an -gram model; 4) dance figure models, capturing the variations in the way particular dance figures are performed, by modeling the motion trajectory of each dance figureviaanhmm.basedonthe first three of these statistical mappings, we define a discrete HMM and synthesize alternative dance figure sequences by employing a modified Viterbi algorithm. The motion parameters of the dance figures in the synthesized choreography are then computed using the dance figure models. Finally, the generated motion parameters are animated synchronously with the musical audio using a 3-D character model. Objective and subjective evaluation results demonstrate that the proposed framework is able to produce compelling music-driven choreographies. Index Terms Automatic dance choreography creation, multimodal dance modeling, music-driven dance performance synthesis and animation, music-to-dance mapping, musical measure clustering. I. INTRODUCTION CHOREOGRAPHY is the art of arranging dance movements for performance. Choreographers tailor sequences of body movements to music in order to embody or express ideas and emotions in the form of a dance performance. Therefore, dance is closely bound to music in its structural course, artistic expression, and interpretation. Specifically, the rhythm and expression of body movements in a dance performance are Manuscript received December 07, 2010; revised August 25, 2011 and November 17, 2011; accepted December 12, Date of publication December 23, 2011; date of current version May 11, This work was supported by TUBITAK under project EEEAG-106E201 and COST2102 action. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Daniel Gatica-Perez. F. Ofli is with the Tele-Immersion Group, Electrical Engineering and Computer Sciences Department, College of Engineering, University of California at Berkeley, Berkeley, CA USA ( fofli@eecs.berkeley.edu). E. Erzin, Y. Yemez, and A. M. Tekalp are with the Multimedia, Vision and Graphics Laboratory, College of Engineering, Koç University, Sariyer, Istanbul 34450, Turkey ( eerzin@ku.edu.tr; yyemez@ku.edu.tr; mtekalp@ku.edu.tr). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TMM in synchrony with those of the music, and hence, the metric orders in the course of music and dance structure coincide, as Reynolds states in [1]. In order to successfully establish the contextual bond as well as the structural synchrony between dance motion and the accompanying music, choreographers tend to thoughtfully design dance motion sequences for a given piece of music by utilizing a repertoire of choreographies. Based on this common practice of choreographers, our goal in this study is to build a framework for automatic creation of dance choreographies in synchrony with the accompanying music; as if they were arranged by a choreographer, through learning many-to-many statistical mappings from music to dance. We note that the term choreography generally refers to spatial formation (circle, line, square, couples, etc.), plastic aspects of movement (types of steps, gestures, posture, grasps, etc.), and progression in space (floor patterns), whereas in this study, we use the term choreography in the sense of composition, i.e., the arrangement of the dance motion sequence. A. Related Work Music-driven dance animation schemes require, as a first step, structural analysis of the accompanying music signal, which includes beat and tempo tracking, measure analysis, and rhythm and melody detection. There exists extensive research in the literature on structural music analysis. Gao and Lee [2] for instance propose an adaptive learning approach to analyze music tempo and beat based on maximum a posteriori (MAP) estimation. Ellis [3] describes a dynamic programming solution for beat tracking by finding the best-scoring set of beat times that reflect the estimated global tempo of music. An extensive evaluation of audio beat tracking and music tempo extraction algorithms, which were included in MIREX 06, can be found in [4]. There are also some recent studies on the open problem of automatic musical meter detection [5], [6]. In the last decade, chromatic scale features have become popular in musical audio analysis, especially in music information retrieval, since introduced by Fujishima [7]. Lee and Slaney [8] describe a method for automatic chord recognition from audio using hidden Markov models (HMMs) through supervised learning over chroma features. Ellis and Poliner [9] propose a cross-correlation based cover song identification system with chroma features and dynamic programming beat tracking. In averyrecent work, Kim et al. [10] calculate the second order statistics to form dynamic chroma feature vectors in modeling harmony structures for classical music opus identification. Human body motion analysis/synthesis, as a unimodal problem, has also been extensively studied in the literature in many different contexts. Bregler et al. [11] for example describe /$ IEEE

2 748 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 a body motion recognition approach that incorporates low-level probabilistic constraints extracted from image sequences of articulated gestures into high-level manifold and HMM-based representations. In order to synthesize data-driven body motion, Arikan and Forsyth [12], and Kovar et al. [13] propose motion graphs representing allowable transitions between poses, to identify a sequence of smoothly transiting motion segments. Li et al. [14] segment body motions into textons, each of which is modeled by a linear dynamical system, in order to synthesize human body motion in a manner statistically similar to the original motion capture data by considering the likelihood of switching from one texton to the next. Brand and Hertzmann [15] study motion style transfer problem, which involves intensive motion feature analysis and learning motion patterns via HMMs from a highly varied set of motion capture sequences. Min et al. [16] present a generative human motion model for synthesis of personalized human motion styles by constructing a multilinear motion model that provides explicit parametrized representation of human motion in terms of style and identity factors. Ruiz and Vachon [17] specifically work on dance body motion, and perform analysis of dance figures in a chain of simple steps using HMMs to perform automatic recognition of basic movements in the contemporary dance. A parallel track of literature can be found in the domain of speech-driven gesture synthesis and animation. The earliest works in this domain study speaker lip animation [18], [19]. In [18], Bregler et al. use morphing of mouth regions to re-sync the existing footage to a new soundtrack. Chen shows in [19] that lip reading a speaker yields higher speech recognition rates and provides better synchronization of speech with lip movements for more natural lip animations. Later a large body of work extends in the direction of synthesizing facial expressions along with lip movements to create more natural face animations [20] [22]. Most of these studies adopt variations of hidden Markov models to represent the relationship between speech and facial gestures. The most recent studies in this domain aim at synthesizing not only facial gestures but also head, hand, and other body gestures for creating more realistic speaker animations [23] [25]. For instance, Sargin et al. develop a framework for joint analysis of prosody and head gestures using parallel branch HMM structures to synthesize prosody-driven head gesture animations [23]. Levine et al. introduce gesture controllers for animating the body language of avatars controlled online with the prosody of the input speech by training a specialized conditional random field [25]. Designing a music-driven automatic dance animation system, on the other hand, is a relatively more recent problem involving several open research challenges. There is actually little work in the literature on multimodal dance analysis and synthesis, and most of the existing studies focus solely on the aspect of synchronization between a musical piece and the corresponding dance animation. Cardle et al. [26] for instance synchronize motion to music by locally modifying motion parameters using perceptual music cues, whereas Lee and Lee [27] employ dynamic programming to modify timing of both music and motion via time-scaling the music and time-warping the motion. Synchronization-based methods cannot however (actually do not aim to) generate new dance motion sequences. In this sense, the works in [28] [31] present more elaborate dance analysis and synthesis schemes, all of which follow basically the same similarity-based framework: They first investigate rhythmical and/or emotional similarities between the audio segments of a given input music signal and the available dance motion segments, and then, based on these similarities, synthesize an optimal motion sequence on a motion transition graph using dynamic programming. In our earlier work [32], we have addressed the statistical learning problem in a multimodal dance analysis scheme by building a correlation model between music and dance. The correlation model was based upon the confusion matrix of co-occurring motion and music patterns extracted via unsupervised temporal segmentation, and hence, was not complex enough to handle realistic scenarios. Later in [33], we have described an automatic music-driven dance animation scheme based on supervised modeling of music and dance figures. However the considered dance scenario was very simplistic, where a dance performance was assumed to have only a single dance figure to be synchronized with the musical beat. In this current paper, we propose a complete framework, based on [34], for modeling, analysis, annotation, and synthesis of multimodal dance performances, which can handle complex and realistic scenarios. Specifically, we focus on learning statistical mappings, which are in general many-to-many, between musical measure patterns and dance figure patterns for music-driven dance choreography animation. B. Contributions An open challenge in music-driven dance animation is due to the fact that, for most dance categories, such as ballroom and folk dances, the relationship between music and dance primitives usually exhibits a many-to-many mapping pattern. As discussed in the related work section, the previous methods proposed in the literature for music-driven dance animation, whether similarity-based [28] [30] or synchronization-based [26], [27], do not address this challenge. They are all deterministic methods and do not involve any true dance learning process (other than building motion transition graphs), therefore cannot capture the many-to-many relationship existing between music and dance, producing always a single optimal motion sequence given the same input music signal. In this paper we address this open challenge by modeling the many-to-many characteristics via a statistical framework. In this respect, our primary contributions are 1) choreography analysis: automatic learning of many-to-many mapping patterns from a collection of dance performances in a multimodal statistical framework, and 2) choreography synthesis: automatic synthesis of alternative dance choreographies that are coherent to a given music signal, using these many-to-many mapping patterns. For choreography analysis, we introduce two statistical models: one capturing a many-to-one and the other capturing a one-to-many mapping from musical primitives to dance primitives. The former model learns different melody patterns associated with each dance primitive. The latter model learns the group of candidate dance primitives that can be replaced with one another without causing an artifact in the choreography. To further consolidate the coherence and the quality of

3 OFLI et al.: LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 749 the synthesized dance choreographies, we introduce a third model to capture the intrinsic dependencies of dance primitives and to preserve the implicit structure existing in the continuum of dance motion. Combining the aforementioned three models, we present a modified Viterbi algorithm to generate coherent and enriched sequences of dance primitives for plausible choreography synthesis. The organization of the paper is as follows: Section II first gives an overview of our music-driven dance animation system, and then describes briefly the feature extraction modules. We present the proposed multimodal choreography analysis and synthesis framework, hence our primary contribution, in Section III. The problem of character animation for visualization of synthesized dance choreographies is addressed in Section IV. Section V presents the experiments and results, and finally, Section VI gives concluding remarks and discusses possible applications of the proposed framework. II. SYSTEM OVERVIEW AND FEATURE EXTRACTION Our music-driven dance animation scheme is musical measure based, hence we regard musical measures as the music primitives. A measure is the smallest compositional unit of music that corresponds to a time segment, which is defined as the number of beats in a given duration. We define a dance figure as the dance motion trajectory corresponding to a single measure segment. Dance figures are taken as the dance primitives. The overall system, as depicted in Fig. 1, comprises of three parts: analysis, synthesis, and animation. Audiovisual data preparation and feature extraction modules are common to both analysis and synthesis parts. An audiovisual dance database can be pictured as a collection of measures and dance figures aligned in two parallel streams: music stream and dance stream. Fig. 2 illustrates a sample music-dance stream extracted from our folk dance database. In the data preparation module, the input music stream is segmented by an expert into its units, i.e., musical measures. We use to denote the measure segment at frame. Measure segment boundaries are then used by the expert to define the motion units, i.e., dance figures. We use to denote the dance figure segment corresponding to measure at frame. The expert also assigns each dance figure a figure label to indicate the type of the dance motion. The collection of forms the set of candidate dance figures, i.e.,,where is the number of distinct dance figure labels that exist in the audiovisual dance database. The resulting sequence of dance figure labels is regarded as the original (reference) choreography, i.e.,,where and is the number of musical measure segments. The feature extraction modules compute the dance motion features and music chroma features for each and, respectively. We assume that the relation between music and dance primitives in a dance performance has a many-to-many mapping pattern. That is, a particular dance primitive (dance figure) can be accompanied by different music primitives (measures) in a dance performance. Conversely, a particular musical measure can correspond to different dance figures. Our choreography analysis and synthesis framework respects the many-to-many Fig. 1. Block diagram of the overall multimodal dance performance analysissynthesis framework. Fig. 2. Audiovisual dance database is a collection of dance figure-musical measure pairs. Recall that a measure is the smallest compositional unit of music that corresponds to a time segment, which is defined as the number of beats in a given duration. On the other hand, we define a dance figure as the dance motion trajectory corresponding to a single measure segment. Hence, by definition, the boundaries of the dance figure segments coincide with the boundaries of the musical measure segments, which is in conformity with Reynolds work [1]. nature of the relationship between music and dance primitives by learning two separate statistical models: one capturing a many-to-one and the other capturing a one-to-many mapping from musical measures to dance figures. The former model learns different melody patterns associated with each dance figure. For this purpose, music chroma features are used to train a hidden Markov model for each dance figure label to create the set of musical measure models.thelatter model, i.e., the model capturing a one-to-many relation from musical measures to dance figures, learns the group of candidate dance figures that can be replaced with one another without causing an artifact in the dance performance (choreography). We call such a model as exchangeable figures model.music chroma features are used to cluster measure segments according to the harmonic similarity between different measure segments. Based on these measure clusters, we determine the group of dance figures that are accompanied by the musical measures with similar harmonic content. We then create the exchangeable figures model basedonsuchdancefigure groups. While the former model is designed to keep the underlying correlations between musical measures and dance figures as intact as possible, the latter model is useful for allowing acceptable (or desirable) variations in the dance choreography by offering various possibilities in the choice of dance figures that reflect the diversity in a dance performance (choreography). To further consolidate the coherence and quality of the synthesized dance choreography, we introduce a third model, i.e., the figure transition model, to capture the intrinsic dependencies of the dance figures and to preserve the implicit structure existing in the continuum of dance motion. The figure transition model

4 750 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 basically appraises the figure-to-figure transition relations by computing -gram probabilities from the audiovisual dance database. The choreography synthesis makes use of these three models, namely,,,and, to determine the output dance figure sequence (i.e., choreography), taking music chroma features as input, which are extracted from a test music signal. Here,,where and is the number of musical measure segments. Specifically, the choreography synthesis module employs a modified Viterbi decoding on a discrete HMM, which is constructed by the musical measure models,,andthefigure transition model,, to determine the sequence of dance figures subject to the exchangeable figures model. In the analysis part of the animation model, the dance motion features are used to train a hidden Markov model for each dance figure label to construct the set of dance figure models. Eventually, the body posture parameters corresponding to each dance figure in the synthesized choreography are generated using the dance figure models to animate a 3-D character. A. Music Feature Extraction Unlike speech, music consists of a sequence of tones whose frequencies are defined. Moreover, musical melody is a rhythmical succession of single tones in different patterns. In this study, we model the melodic pattern in each measure segment with tone-related features using temporal statistical models, i.e., HMMs. We extract tone-related chroma features to characterize the melodic/harmonic content of music. In order to represent the chroma scale, we project the entire spectrum onto 12 bins corresponding to the 12 distinct semi-tones of the musical octave. Theoretically, the frequency of the th note in the th octave is defined as,wherethepitchofthec0note is Hz based on Shepard s helix model in [35] and,. In this study, we extract the chroma features of 60 semi-tones for (over 5 octaves from the C4 note to the B8 note). We extract chroma features similar to the well-known melfrequency cepstral coefficient (MFCC) computation [36] by applying cepstral analysis to the semitone spectral energies. Hence our chroma features capture information of fluctuations of semitone spectral energies. We center the triangular energy windows at the locations of the semi-tone frequencies,, at different octaves for and. Then, we compute the first 12 DCT coefficients of the logarithmic semitone spectral energy vector, that constitute the chromatic scale cepstral coefficient (CSCC) feature set,, for the music frame. We also compute the first and second time derivatives of these 12 CSCC features, using the following regression formula: The music feature vector,, is then formed by including the first and second time derivatives: (1) (2) Each, therefore, corresponds to the sequence of music feature vectors that fall into the measure segment. Specifically, is a matrix of CSCC features in the form where. (3) is the number of audio frames in measure segment B. Motion Feature Extraction We acquire multiview recordings of a dancing actor for each dance figure in the audiovisual dance database using 8 synchronized cameras. We then employ a motion capture technique for tracking the 3-D positions of the joints of the body based on the markers 2-D projections on each camera s image plane, using the color information of the markers [33]. The resulting set of 3-D points are used to fit a skeleton structure to the 3-D motion capture data. Through this skeleton structure, we solve the necessary inverse kinematics equations and calculate accurately the set of Euler angles for each joint in its local frame as well as the global translation and rotation of the skeleton structure for the motion trajectory defined by the input 3-D motion capture data. We prefer joint angles as our dance motion features due to their widespread usage in human body motion analysis-synthesis and 3-D character animation literature. We compute 66 angular values associated with 27 key joints of the body as well as 6 values for the global rotation and translation of the body, which leads to a dance motion feature vector of dimension 72 for each dance motion frame. However, angular features are generally discontinuous at boundary values due to their -periodic nature and this situation causes a problem in training statistical models to capture the temporal dynamics of a sequence of angular features. Therefore, instead of using the static set of Euler angles,weusetheirfirstandsecond differences computed with following difference equation: where the resulting discontinuities are eliminated by the following conditional update: if if otherwise. Then the 44-dimensional dynamic motion feature vector is formed as Hence, each is a sequence of motion feature vectors that fall into dance figure segment while training temporal models of motion trajectories associated with each dance figure label.thatis, is a matrix of body motion feature values in the form where is the number of dance motion frames within dance motion segment. We also calculate the mean trajectory for each dance figure label, namely, by calculating for each (4) (5) (6) (7)

5 OFLI et al.: LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 751 motion feature an average value over all instances (realizations) of the dance figures labeled as.thesemean trajectories ( ) are required later in choreography animation since each dance figure model capture only the temporal dynamics of the first and second differences of the Euler angles of the key joints associated with the dance figure label. III. CHOREOGRAPHY MODELING In this section we present the proposed choreography analysis-synthesis framework. We first describe our choreography analysis procedure which involves statistical modeling of music-to-choreography mapping. We then explain how this statistical modeling is used for choreography synthesis. A. Multimodal Choreography Analysis We perform choreography analysis through three statistical models, which together define a many-to-many mapping from musical measures to dance figures. These choreography models are: 1) musical measure models, which capture many-to-one mappings from musical measures to dance figures using HMMs; 2) exchangeable figures model, which captures one-to-many mappings from musical measures to dance figures, and hence, represents the subjective nature of the dance choreography with possibilities in the choice of dance figures and in their organization; and 3) figure transition model, which captures the intrinsic dependencies of dance figures. We note that these three choreography models constitute the choreography analysis block in Fig. 1. 1) Musical Measure Models ( ): In a dance performance, musical measures that correspond to the same dance figure may exhibit variations and are usually a collection of different melodic patterns. That is, different melodic patterns can accompany the same dance figure, displaying a many-to-one mapping relation from musical measures to dance figures. We capture this many-to-one mapping by employing HMMs to identify and model the melodic patterns corresponding to each dance figure. Specifically, we train an HMM over the collection of measures co-occurring with the dance figure, using the musical measure CSCC features,. Hence, we train an HMM for each dance figure in the dance performance. We define left-to-right HMM structures with for,where is the transition probability from state to state. The transitions from state to account for the differences in measure durations. Emission distributions of the chroma-based music features are modeled by Gaussian mixture density functions with diagonal covariance matrices in each state of. The use of Gaussian mixture density in enables us to capture different melodic patterns that correspond to a particular dance figure. We denote the collection of musical measure models as,i.e.,. Musical measure models provide a tool to capture the many-to-one part of the many-to-many musical measure to dance figure mapping problem. 2) Exchangeable Figures Model ( ): In a dance performance, it is possible that several distinct dance figures can be performed equally well along with a particular musical measure pattern, exhibiting a one-to-many mapping relation from musical measures to dance figures [1]. To represent this one-tomany mapping relation, we introduce the notion of exchangeable figure groups, each containing a collection of dance figures that can be replaced with one another without causing an artifact in a dance performance. To learn exchangeable figure groups, we cluster the measure segments in each musical piece with respect to their melodic similarities. The melodic similarity between two different measure segments and is computed as the local match score obtained from dynamic time warping (DTW) [37] of the chroma-based feature matrices and, corresponding to and, respectively, in the musical piece. Then, based on the melodic similarity scores between pairs of musical measure segments in,weforman affinity matrix,where if,and. Finally, we apply the spectral clustering algorithm described in [38] over to cluster the measure segments in. The spectral clustering algorithm in [38] assumes that the number of clusters is known a priori and employs k-means clustering algorithm [39]. Since we do not know the number of clusters a priori, we measure the quality of the partition in the resulting clusters using the internal indexes, silhouettes [40], to determine the appropriate number of clusters. The silhouette value for each point is a measure of how similar that point is to the points in its own cluster compared to the points in the other clusters, and ranges from to.averaging over all the silhouette values, we compute the overall quality of the clustering for a range of cluster numbers and pick the one that results in the highest silhouette value. We perform separate clustering for each musical piece in order to increase the accuracy of musical measure clustering since similar measure patterns are likely to occur in the same musical piece rather than spread among different musical pieces. Once we obtain clusters of measures in all musical pieces, we can then use all of the measure clusters in all musical pieces to determine the exchangeable figures group for each dance figure by collecting the dance figure labels that co-appear with in any of the resulting clusters. Note that a particular dance figure can appear in more than one musical piece (see also Fig. 5). Based on the exchangeable figure groups,wedefine the exchangeable figures model as an indicator random variable: if otherwise (8) where is the exchangeable figure group associated with the dance figure. The collection of for all dance figure labels in gives us the exchangeable figures model. The notion of exchangeable figures is the key to reflect the subjective nature of the dance choreography with possibilities in the choice of dance figures and their organization throughout the choreography estimation process. The use of exchangeable figures model allows us to create a different artistic dance performance content each time we estimate a dance choreography. 3) Figure Transition Model ( ): The figure transition model is built to capture the intrinsic dependencies of the dance figure sequences within the context of dance choreographies. The intrinsic dependencies of the choreography are defined with figure-to-figure transition probabilities. The figure-to-figure transition probability density functions

6 752 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 are modeled in -gram language models, where the probability of the dance figure at given the dance figure sequence at,,,, i.e.,,defines the -gram dance language model. This model provides a number of rules that specify the structure of a dance choreography. For instance, a dance figure that never appears after a particular sequence of dance figures in the training video does not appear in the synthesized choreography either. We can also enforce a dance figure to always follow a particular sequence of dance figures if it is also the case in the training video with the help of the -gram dance language model. B. Multimodal Choreography Synthesis We formulate the choreography synthesis problem as estimating a dance figure sequence from a sequence of musical measures. The core of the choreography synthesis is defined as a Viterbi decoding process on a discrete HMM, which is constructed by the musical measure models,,andthefigure transition model,. Furthermore, the exchangeable figures model,, is utilized to introduce acceptable variations into the Viterbi decoding process to enrich the synthesized choreography. Based on this Viterbi decoding process, we define three separate choreography synthesis scenarios: 1) single best path, 2) likely path, and 3) exchangeable path, which are explained in detail in the following subsections. We note that all these choreography synthesis scenarios are multimodal, that is, they depend on the joint statistical models of music and choreography, and they have different refinements and contributions to enrich the choreography synthesis. Besides these three synthesis scenarios, we investigate two other reference (baseline) choreography synthesis scenarios: one using only the musical measure models to map each measure segment in the test musical piece to a dance figure label (which we refer to as acoustic-only choreography), and another one using only the figure transition model (which we refer to as figure-only choreography). The acoustic-only choreography corresponds to a synthesis scenario in which only the correlations between musical measures and dance figure labels are taken into account, but the correlations between consecutive figures are ignored. In contrast to the acoustic-only choreography, the figure-only choreography scenario predicts the dance figure for the next measure segment only according to figure-to-figure transition probabilities, which are modeled as bigram probabilities of, by discarding the correlations between musical measures and dance figures. Note that the figure-only synthesis can be regarded as a synchronization-based technique discussed in Section I-A, such as the ones proposed in [26] and [27]. In figure-only synthesis, the dance figure sequence is generated randomly by only respecting figure-to-figure transition probabilities to ensure visual continuity of the resulting character animation. The dance figure sequences resulting from these two baseline scenarios constitute reference choreographies that help us comparatively assess the contributions of our choreography analysis-synthesis framework. 1) Single Best Path Synthesis: We construct a discrete HMM,, using the musical measure models,,andthefigure transition model,.inthefigure transition model,thefigure-to- Fig. 3. Lattice structure ofthediscretehmm. figure transition probability distributions are computed with bigram models. This choice of model is due to the scale of choreography database that we use in training, and a much larger database can be used to model higher order -gram dance language models. The discrete HMM,,isdefined with the following parameters: is the number of time frames (measure segments). For each time frame (measure), the choreography synthesis process outputs exactly one dance figure label. Recall that we denote the individual dance figures as and individual measures as for. is the number of distinct dance figure labels, i.e.,, where. Dance figure labels are the outputs of the process being modeled. is the dance figure transition probability distribution with elements defined as where the elements,, are the bigram probabilities from the figure transition model and they satisfy. is the dance figure emission distribution for measure.elementsof are defined using the musical measure models as is the initial dance figure distribution, where (9) (10) (11) The discrete HMM constructs a lattice structure, say, as given in Fig. 3. The proposed choreography synthesis can be formulated as finding a path through the lattice. Assuming a uniform initial figure distribution,thesingle best path synthesis scenario decodes the Viterbi path along the lattice to estimate the synthesized figure sequence.theviterbialgorithm for finding the single best path synthesis can be summarized as follows: 1) Initialization: (12)

7 OFLI et al.: LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 753 2) Recursion: For 3) Termination: 4) Path (dance figure sequence) backtracking: (13) (14) (15) (16) Here represents the partial likelihood score of performing the dance figure at frame,and is used to keep track of the best path retrieving the dance figure sequence. The path decodes the resulting dance figure label sequence as the desired output choreography. Note that the resulting dance choreography is unique for the single best path synthesis scenario since it is the Viterbi path along the lattice. 2) Likely Path Synthesis: In the second synthesis scenario, we find a likely path along in which we follow one of the likely partial paths in lieu of following the partial path that has the highest partial likelihood score at each time frame. The likely path synthesis is expected to create variations along the single best path synthesis, which results in an enriched set of synthesized choreography sequences with high likelihood scores. We modify the recursion step of the Viterbi algorithm to define the likely path synthesis: Recursion: For (17) (18) (19) where and are the figure indices with the top two partial path scores and returns randomly one of the arguments with uniform distribution. Note that corresponds to in (14). The likely path scenario is expected to synthesize different dance choreographies since it propagates randomly among top two ranking transitions at each time frame. This intricately introduces variation into the choreography synthesis process. 3) Exchangeable Path Synthesis: In this scenario, we find an exchangeable path by letting the exchangeable figures model replace and update the single best path. Unlikethelikely path synthesis, the exchangeable path scenario introduces random variations to the single best path that respect one-to-many mappings from musical measures to dance figures as defined in. The exchangeable path synthesis is implemented with the following procedure: 1) Compute the single best path synthesis for a given musical measure sequence and set the measure segment index. 2) The figure at measure segment is replaced with another figure from its exchangeable figure group (20) where returns randomly one of the arguments according to the distribution of acoustic scores of the dance figures. 3) The rest of the figure sequence,, is updated by determining a new single best path using the Viterbi algorithm. 4) The steps 2) and 3) are repeated for measure segments. The exchangeable path synthesis yields an alternative path by modifying the single best path in the context of the exchangeable figures model. Its key difference from the likely path is that the collection of the candidate dance figures that can replace a particular dance figure in the choreography, say, is constrained with the dance figures for which the exchangeable figures model yields 1. Hence, it is expected to introduce more acceptable variations into the synthesized choreography than the likely path. IV. ANIMATION MODELING In this section, we address character animation of dance figures to visualize and evaluate the proposed choreography analysis and synthesis framework. First we define a dance figure model, which captures the variations in the way particular dance figures are performed, by modeling the motion trajectory of a dance figure via an HMM. Then we define a character animation system, which generates a sequence of dance motion features from a given sequence of dance figures, so as to animate a 3-D character model. A. Dance Figure Models ( ) The way a dancer performs a particular dance figure may exhibit variations in time in a dance performance. Therefore, it is important to model the temporal statistics of each dance figure to capture the variations in the dance performance. Note that these models will also capture the personalized dance figure patterns of a dancer. We use the set of motion features to train an HMM,, for each dance figure label to capture the dynamic behavior of the dancing body. Since a dance figure contains typically a well-defined sequence of body movements, we employ a left-to-right HMM structure (i.e., for,where is the transition probability from state to state in )to model each dance figure. Emission distributions of motion parameters are modeled by a Gaussian density function with full covariance matrix in each state of. We denote the collection of dance figure models as,i.e.,. B. Character Animation The synthesized choreography (i.e., ) specifies the label sequence of dance figures to be performed with each measure segment whose duration is known beforehand in the proposed framework. The body posture parameters corresponding to each dance figure in the synthesized choreography are then generated such that they fit to the statistical dance figure models. To generate body posture parameters using the dance figure model for the dance figure,wefirst determine the number of dance motion frames required for the given segment duration. Next, we distribute the required number of motion frames among the states of the dance figure model according to the expected state occupancy duration: (21)

8 754 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 Fig. 5. Distribution of dance figures to musical pieces is visualized using an image plot. Columns of the plot represent the dance figures ( ) whereas the rows represent the musical pieces ( ) in the database. Note that and are dropped in the figure for clarity of the representation. Consequently, each cell in the plot indicates how many times a particular dance figure is performed with the corresponding musical piece. Cells with different gray values in a column imply that the same dance figure can be accompanied with different musical pieces. Similarly, cells with different gray values in a row imply that different dance figures can be performed with the same musical piece. Fig. 4. Plots compare a synthesized trajectory with two sample trajectories, as well as with the mean trajectory for different motion features from different dance figures in the database. The expected state durations, all associated with the HMM structure trained for the same dance figures, are also displayed in the plot with horizontal solid lines. The corresponding angular values of each horizontal solid line are the means of the Gaussian distributions, associated with the particular motion feature in each state of the HMM structure trained for the given dance figure. where is the expected duration in state, is the self-statetransition probability for state (assuming ), and is the number of states in. In order to avoid generation of noisy parameters, we first increase the time resolution of the dance motion by oversampling the dance motion model. That is, we generate parameters for a multiple of,say,where is an integer scale factor. Then, we generate the body motion parameters along the states of according to the distribution of motion frames to these states, using the corresponding Gaussian distribution at each state. To reverse the effect of oversampling, we perform a downsampling by that eventually yields smoother state transitions, and hence, more realistic parameter generation that avoid motion jerkiness. The dance figure models are trained over the first and second differences of the Euler angles of the joints, which are defined in Section II-B. Therefore, to obtain the final set of body posture parameters for a dance figure, we simply need to sum the generated first differences with the mean trajectory associated with,i.e.,. Each plot in Fig. 4 depicts a synthesized trajectory against two sample trajectories for one of the motion features from a dance figure in the database along with the mean trajectory associated with the same motion feature from the same dance figure. The length of the horizontal solid lines represent the expected state durations (in terms of the number of frames), and the corresponding angular values represent the means of the Gaussian distributions, associated with the particular motion feature in each state of the HMM structure trained for the particular dance figure. In these plots, the two sample trajectories exemplify the temporal variations between different realizations of the same dance figure. Deriving from the trained dance figure HMMs, the synthesized dance figure trajectories mimic the underlying temporal dynamics of agivendancefigure. Hence, modeling the temporal variations using HMMs for each dance figure allows us to synthesize more realistic and personalized dance motion trajectories. After repeating the described procedure for each dance figure in the synthesized choreography, the body posture parameters at the dance figure boundaries are smoothed via cubic interpolation within a -neighborhood of each dance figure boundary in order to generate smoother figure-to-figure transitions. We note that the use of HMMs for dance figure synthesis provides us with the ability of introducing random variations in the synthesized body motion patterns for each dance figure. These variations make the synthesis results look more natural due to the fact that humans perform slightly varying dance figures at different times for the same dance performance. V. EXPERIMENTS AND RESULTS We investigate the effectiveness of our choreography analysis and synthesis framework using the Turkish folk dance, Kasik. 1 The Kasik database consists of 20 dance performances with 20 different musical pieces with a total duration of 36 min. There are 31 different dance figures (i.e., ) and a total of 1258 musical measure segments (i.e., ). Fig. 5 shows the distribution of dance figures to different musical pieces where each column represents a dance figure label and each row represents a musical piece. Hence, entries with different colors in a column indicate that the same figure can be performed with 1 Kasik means spoon in English. The dance is named so, since the dancers clap spoons while dancing.

9 OFLI et al.: LEARN2DANCE: LEARNING STATISTICAL MUSIC-TO-DANCE MAPPINGS FOR CHOREOGRAPHY SYNTHESIS 755 TABLE I AVERAGE PENALTY SCORES (APS) OF VARIOUS CHOREOGRAPHY SYNTHESIS SCENARIOS Fig. 6. Matrix plot demonstrates the assessment levels associated with any pair of dance figures in the database. Assessment levels are indicated with different colors. The pairs of dance figures that fall into assessment level in a row correspond to the group of exchangeable dance figures for that particular dance figure. For instance, by looking at the first row, one can say that the dance figure is exchangeable with the dance figures,,,,and.inotherwords, {,,,, } is the group of exchangeable figures for,i.e.,. different melodic patterns whereas entries with different colors in a row indicate that different dance figures can be performed with the same melodic pattern. Therefore, Fig. 5 can be seen as a means to provide evidence for our basic assumption that there is a many-to-many relationship between dance figures and musical measures. We follow a 5-fold cross-validation procedure in the experimental evaluations. We train musical measure models with fourfifths of the musical audio data in the analysis part and use these musical measure models in the process of choreography estimation for the remaining one-fifth of the musical audio data in the synthesis part. We repeat this procedure five times, each time using different parts of the musical audio data for training and testing. This way, we synthesize a new dance choreography for the entire musical audio data. A. Objective Evaluation Results We define the following four assessment levels to evaluate each dance figure label in the synthesized figure sequence, compared to the respective figure label in the original dance choreography, assigned by the expert: (Exact-match): is marked as if matches. (X-match): is marked as if does not match, but it is in s exchangeable figure group ;i.e.,. (Song-match): is marked as if neither matches nor is in ; but, and are performed within the same musical piece; i.e.,. (No-match): is marked as if it is not marked as one of through. Fig. 6 displays all assessment levels associated with any possible pairing of dance figures in a single matrix. Note that the matrix in Fig. 6 defines a distance metric on dance figure pairs by Fig. 7. Percentage of figures that fall into each assessment level for the proposed five different synthesis scenarios. mapping the four assessment levels through into penalty scores from 0 to 3, respectively. Hence in this distance metric, low penalty scores indicate desirable choreography synthesis results. Average penalty scores are reported to measure the goodness (coherence) of the resulting dance choreography. Recall that we propose three alternative choreography synthesis scenarios together with the two other reference choreography synthesis techniques, as discussed in Section III-B. The average penalty scores of these five choreography synthesis scenarios are given in Table I. Furthermore, the distribution of the number of figures that fall into each assessment level for all synthesis scenarios are given in Fig. 7. The average penalty scores for the reference acoustic-only and figure-only choreography synthesis scenarios are 0.82 and 2.07, respectively. The high average penalty score of the figure-only choreography is mainly due to the randomness in this synthesis technique. Hence, the unimodal learning phase of the figure-only choreography synthesis, which takes into account only the figure-to-figure transition probabilities, does not carry sufficient information to automatically generate dance choreographies which are similar to the training Kasik database. On the other hand, the acoustic-only choreography, which employs the musical measure models for synthesis, attains a lower average penalty score. We also note that the average dance figure recognition rate using the musical measure models through five-fold cross validation is obtained as 49.05%. This indicates that the musical measure models learn the correlation between measures and figures. However, the acoustic-only choreography synthesis that depends only on the correlation of measures and figures fails to sustain motion continuity at figure transition boundaries. This creates visually unacceptable character animation for the acoustic-only choreography.

10 756 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 3, JUNE 2012 The proposed single best path synthesis scenario refines the acoustic-only choreography with the inclusion of the figure transition model, which defines a bigram model for figure-tofigure transitions. The average penalty score of the single best path synthesis scenario is 0.56, which is the smallest average penalty score among all scenarios. This is expected, since the single best path synthesis generates the optimal Viterbi path along the multimodal lattice structure.thelikely path synthesis introduces variation into the single best path synthesis. The average penalty score of the likely path synthesis increases to This increase is an indication of the variation introduced to the optimal Viterbi path. The exchangeable path synthesis refines the likely path synthesis by introducing more acceptable random variations to the single best path synthesis. Recall that random variations of the exchangeable path synthesis depend on the exchangeable figures model. The average penalty score of the exchangeable path synthesis is 0.63, which is an improvement compared to the likely path synthesis. In Fig. 7, we observe that among all the assessment levels, the levels and are indicators of the diversity of alternative dance figure choreographies, rather than being an error indicator, whereas the assessment levels and indicate an error in the dance choreography synthesis process. In this context, we observe that only 31% of the figure-only choreography and only 83% of the acoustic-only choreography fall into the first three assessment levels. On the other hand, using the mapping obtained by our framework increases this ratio to 94% and 92% for the single best path and the exchangeable path synthesis scenarios, respectively. The percentage drops to 80% for the likely path synthesis scenario, yet it is still a high percentage of the entire dance sequence. B. Subjective Evaluation Results We performed a subjective A/B comparison test using the music-driven dance animations to measure the opinions of the audience on the coherence of the synthesized dance choreographies with the accompanying music. During the test, the subjects were asked to indicate their preference for each given A/B test pair of synthesized dance animation segments onascaleof( ; ; 0; 1; 2), where the scale corresponds to strongly prefer A, prefer A, no preference, prefer B, and strongly prefer B, respectively. We compared dance animation segments from five different choreographies, namely, original, single best path, likely path, exchangeable path,andfigure-only choreographies. We, therefore, had ten possible pairings of different dance choreographies, e.g., original versus single best path, orlikely path versus figure-only, etc. For each possible pair of choreographies, we used short audio segments from three different musical pieces from the Kasik database to synthesize three A/B pairs of dance animation video clips for the respective dance choreographies. This yielded us a total of 30 A/B pairs of dance animation segments. We also included one A/B pair of dance animation segments for pairing each choreography with itself, i.e., original versus original, etc., in order to test if the subjects were careful enough and show almost no preference over five such possible self-pairings of the aforementioned choreographies. As a result, we extracted 35 short segments from the audiovisual database, where each TABLE II SUBJECTIVE A/B PAIR COMPARISON TEST RESULTS segment was approximately 15 s. We picked at most two non-overlapping segments from each musical piece in order to make a full coverage of the audiovisual database in the subjective A/B comparison test. The subjective tests are performed over 18 subjects. The average preference scores for all comparison sets are presented in Table II. Note that the rows and the columns of Table II, respectively, correspond to A and B of the A/B pairs. Also, the average preference scores that tend to favor B are given in bold to ease the visual inspection. The first observation is that the animations for the original choreography and for the choreographies resulting from the proposed three synthesis scenarios (i.e., single best path, likely path, andexchangeable path choreographies) are preferred over the animations for the figure-only choreography. We also note that the likely path and exchangeable path choreography animations are strongly preferred against the figure-only choreography animation. This observation is an evidence of the fact that audience is generally appealed by variations in the dance choreography as long as the overall choreography is coherent with the accompanying music. Hence we observe the likely path and the exchangeable path synthesis as the most preferable scenarios in subjective tests, and they manage to create alternative choreographies that are coherent and appealing to the audience. C. Discussion The objective evaluations carried out using the assessment levels in Section V-A indicate that the most successful choreography synthesis scenarios are the single best path and the exchangeable path scenarios. We note that the exchangeable path synthesis as well as the likely path can be seen as variations or refinements of the single best path synthesis approach. On the other hand, according to the subjective evaluations presented in Section V-B, the most preferable scenarios are the likely path and the exchangeable path. Hence the exchangeable path synthesis, being among the top two according to both objective and subjective evaluations, can be regarded as our best synthesis approach. Note also that the exchangeable path synthesis includes all the statistical models definedinsectioniii. The other two scenarios (acoustic-only and figure-only) are mainly used to demonstrate the effectiveness of the musical measure models and the figure transition model, hence they both serve as reference synthesis methods. The acoustic-only choreography however depends only on the correlations between measures and figures, and therefore fails to sustain motion continuity at figure-to-figure boundaries, which is indispensable to create realistic animations.hencewehavechosen the figure-only synthesis as the baseline method to compare with our best result, i.e., with the exchangeable path synthesis,