Jun Yin, Ye Wang and David Hsu ABSTRACT Prompt feedback is essentia for beginning vioin earners; however, most amateur earners can ony meet with teachers and receive feedback once or twice a week. To hep such earners, we have attempted an initia design of Digita Vioin Tutor (DVT), an integrated system that provides the much-needed feedback when human teachers are not avaiabe. DVT combines vioin audio transcription with visuaization. Our transcription method is fast, accurate, and robust again noise for vioin audio recorded in home environments. The visuaization is designed to be intuitive and easiy understandabe by peope with itte music knowedge. The different visuaization modaities video, 2D fingerboard animation, 3D avatar animation hep earners to practice and earn more effectivey. The entire system has been impemented with off-the-shef hardware and shown to be practica in home environments. In our user study, the system has received very positive evauation. Categories and Subject Descriptors H.5. [Mutimedia Information Systems] Animations H.5.5 [Sound and Music Computing] Signa anaysis, synthesis, and processing, Systems Genera Terms Agorithms, Design, Experimentation, Human Factors Keywords Music transcription, vibrato detection, score aignment, animation, vioin tuner. INTRODUCTION For beginners earning to pay a musica instrument, prompt feedback is essentia. However, constrained by time and financia resources, most amateur earners can ony afford to meet with teachers and receive feedback once or twice a week. In the rest of the time, they must practice on their own. Practice becomes a dirty word: the ack of feedback makes practice boring and Permission to make digita or hard copies of a or part of this work for persona or cassroom use is granted without fee provided that copies are not made or distributed for profit or commercia advantage and that copies bear this notice and the fu citation on the first page. To copy otherwise, or repubish, to post on servers or to redistribute to ists, requires prior specific permission and/or a fee. MM 05, November 6, 2005, Hiton, Singapore. Copyright 2005 ACM -583-893-8/04/000 $5.00. frustrating, which eads to sow progress. This issue is accentuated for chidren, who need more guidance, whie their parents, who often do not pay the instrument, ack the knowedge to hep. To address this critica issue, we have attempted the initia design of a system caed Digita Vioin Tutor (DVT) to hep beginning vioin earners, in particuar, young chidren, aong with their parents. In the absence of human teachers, DVT performs basic, yet key activities of a vioin esson, incuding heping to tune the vioin, pointing out the mistakes in earner s pay, and demonstrating the correct pay. This is achieved by combining audio, video, and 3D animation to form an effective feedback oop (Figure ). By expoiting the synergy of a muti-moda system, DVT attempts to offer beginning vioin earners a stimuating earning environment that increases their interest. (a) (c) Figure : The user interface of Digita Vioin Tutor. (a) Video of teacher s pay. (b) 3D avatar animation. (c) Visuaization of mistakes in earner s pay. (d) 2D fingerboard animation. The key design objective of DVT is effective and immediate feedback to beginning vioin earners practicing at home. To achieve this, first, we have designed a system modue for transcribing vioin audio with high accuracy and speed. It is aso robust against noise, which is common in audio recorded with ow-quaity microphones in home environments. Second, to identify mistakes in earner s pay, another system modue compares the transcribed music notes against the score or teacher s pay, and the resuts are given to the earner through effective visuaization. The visuaization of mistakes is designed to be intuitive and easiy understandabe by peope with itte music knowedge, e.g., parents who wish to hep their chidren. Our experience with the system indicates that this combination of audio anaysis and mistake visuaization is natura and effective. Finay, the entire system is impemented with off-the-shef (b) (d)
hardware basicay, a PC and a cheap microphone so that it can be easiy made avaiabe in home environments. In our user study, the system has received very positive evauation. DVT has severa advantages over existing educationa toos for beginning earners, such as audio or video recordings of teacher s pay. Most importanty, recordings do not provide any expicit feedback. Without feedback, beginning earners often ack the knowedge to find their own mistakes. Even for the simpe task of tuning the vioin, athough the traditiona aid, a tuning whiste, generates a sound at the correct pitch for each string, many beginners cannot hear the difference between the correct pitch and that produced by the string being tuned. Visuaization of the difference greaty simpifies the task. We woud ike to emphasize that DVT is not intended to repace human teachers. Instead, it fufis a critica need of beginning vioin earners by providing basic, yet effective feedback, when human teachers are not avaiabe. In this work, we have chosen vioin, because it is an important and popuar instrument; however, most of the previous work [3][0][3] is devoted to keyboard instruments, in particuar, piano. We are not aware of simiar systems for vioin. Athough we focus on vioin, the main design considerations appy to other reated string instruments as we. The paper is organized as foows. Section 2 reviews previous work. Section 3 gives an overview of DVT. Sections 4 7 describe the main system modues. Section 8 presents the system evauation based on a user study. Section 9 concudes with possibe improvements on the system in the future. 2. PREVIOUS WORK Many computer systems have been proposed to assist music education. Piano Tutor [3] is a comprehensive system for piano instruction. It performs score-tracking and error anaysis based on MIDI input and draws upon an expert system to coordinate the presentation of essons from a database. PianoFORTE [3] is another piano instruction system, which aims at enhancing the earner s abiity in music interpretation. It dispays the progression of earner s pay on a music score. This dispay format assumes that the earner is proficient in reading music notations and is thus ony suitabe for more advanced earners and not for beginners or parents who have itte music knowedge. A more recent system, Famiy Ensembe [0], tries to increase chidren s interest in practicing piano by providing duo pay. It uses score-tracking to generate accompaniment automaticay and heps a parent to pay a MIDI keyboard together with a chid. Most of the existing systems, incuding the ones described above, are devoted to keyboard instruments, e.g., piano. They usuay use a MIDI keyboard for input and circumvent the difficut probem of music transcription. Unfortunatey, this approach does not work for string instruments such as vioin: the equivaent of a MIDI keyboard does not seem to exist for vioin, especiay in a home environment. To sove this probem, our DVT system uses a fast and accurate music transcription method to dea with acoustic music input directy. In comparison with Piano Tutor, DVT is not a comprehensive instruction system. It targets beginners, especiay chidren, aong their parents, by providing effective visua feedback. The visuaization is easiy understandabe by peope who have itte music knowedge. 3. SYSTEM OVERVIEW After consuting with a professiona vioin teacher and our coaborators from our university s Conservatory of Music, we determined three main functions for the initia design of DVT: () heping to tune the vioin, (2) pointing out the mistakes in earner s pay, and (3) demonstrating the correct pay. These are deemed the most basic and most important activities during a vioin esson for beginners. To perform these functions, DVT consists of severa interconnected system modues (Figure 2): The transcriber transcribes audio input from earner s pay into a note tabe. Each tabe entry corresponds to a note with information on its pitch, oudness, onset, and duration. An entry may aso contain information on paying styes, such as vibrato, i.e., sma cycic change in pitch. The performance evauator tries to identify the mistakes in earner s pay by comparing the transcribed notes with either the score or transcribed teacher s pay. It then visuaizes the mistakes and provides feedback to the earner. The tuner uses the transcriber and simpe visuaization to hep the earner tune the vioin. The animator demonstrates the correct pay through both 2D fingerboard animation and 3D avatar animation. The animation is driven directy by a score or by a note tabe transcribed from teacher s pay. teacher s audio transcriber teachers s notes animator earner s audio transcriber earner s notes 3-D avatar animation 2-D fingerboard animation Figure 2: System diagram. tuner performance evauator teacher s or student s video DVT can aso show the teacher s demonstration video aong with the animations. The different modaities video, 2D fingerboard
animation, and 3D avatar animation hep the earner to practice and earn more effectivey. 4. TRANSCRIBER The transcriber is an essentia component of our system. It enabes DVT to dea with acoustic signas from microphones or CD recordings, rather than symboic signas from MIDI keyboards, which do not seem to exist for vioin. 4. Overview Music transcription is difficut. Despite decades of research and significant progress (e.g,. [8][9][]), reiabe transcription remains a chaenge in practica appications. Recent work on transcription focuses on transcribing commercia CD music, which is recorded professionay. However, DVT must process audio input recorded with ow-quaity microphones in home environments. Such audio input is typicay noisy. Furthermore, beginners tend to make a variety of mistakes, resuting in audio signa with irreguar patterns. A these make it chaenging to deveop a fast and accurate transcriber for our system. For our particuar appication, we have focused on three design objectives for the transcriber: accuracy, robustness against noise, and speed. Accuracy is important, because an inaccurate transcriber cannot be effective in providing feedback to earners. Robustness against noise is important, because sound recorded with ow-quaity microphones in a home environment is usuay noisy. Speed is important, because earners are unikey to be wiing to wait ong. To be usefu, the feedback must be amost instantaneous. A bock diagram of our transcriber is shown in Figure 3. The input is PCM-encoded acoustic signa, and the output is a note tabe which contains information on the pitch, oudness, onset, and duration of the notes payed. The upper path in Figure 3 shows the basic transcription process. In each time frame, the transcriber first converts the PCM signa to the frequency domain using FFT. It then maps the FFT spectrum to an energy-based semitone spectrum [5] (Figure 4) and estimates the pitch and oudness based on the harmonic structure of vioin sound. The transcriber then performs onset detection and duration estimation. An entry of the note tabe is then generated for each note detected. The ower path in Figure 3 shows the steps of our high-resoution pitch estimation in DVT. More detais of our transcriber are described in the foowing subsections. wave fie FFT spectrum semitone spectrum high-resoution pitch estimation pitch & oudness estimation vibrato detection Figure 3: Bock diagram of the DVT transcriber. onset detection note tabe 4.2 Accurate, Robust, and Fast Transcription Our transcriber uses the idea of harmonic-structure tracking, simiar to the agorithm described in [9], but extends the idea by taking advantage of specia characteristics of vioin sound. These specia characteristics are the key to soving our transcription probem accuratey and efficienty. Our transcriber makes three assumptions. First, it imits the pitch range between G3 and G6. Second, the harmonic structure of vioin sound has the property that the energy of the sound is concentrated at the fundamenta and the first five harmonics at 2, 9, 24, 28 and 3 semitones away from the fundamenta. Finay, we imit the transcription to monophonic vioin sound ony. In genera, these assumptions hod we for beginning earners. For pitch and oudness estimation, the transcriber first converts the FFT spectrum of vioin sound into the semitone spectrum. The conversion has severa advantages. In contrast to the inear scae of the FFT spectrum, the semitone spectrum is a ogarithmic scae, which is commony used in musica notation and matches cosey with human pitch perception. The conversion reduces the search space for pitch estimation, resuting in faster transcription speed. It aso increases noise robustness, because the energy in semitone bands is far more stabe than the individua magnitude in the FFT domain. After the conversion, we perform the pitch and oudness estimation in the semitone domain. Our testing shows that the semitone band energy spectrogram is robust in the presence of noise and other hostie factors, such as missing fundamenta or unstabe harmonic structure, both of which are typica for vioin sound. (a) (b) Figure 4: (a) FFT spectrum. (b) Semitone band energy spectrum. After pitch and oudness estimation, the transcriber performs onset detection. Onset detection for vioin sound is more difficut than that for other instruments (e.g., percussion instruments). The onset of vioin sound varies significanty from a soft one to a reativey hard one. To tacke this probem, we first normaize the oudness to account for variations in oudness of sound recorded in different environments and then use three criteria for onset detection. If any of the criteria is met, we consider it a vaid candidate for an onset. The first criterion detects hard onset. It estimates the derivative of the oudness function using a finite difference method. It finds an onset if the derivative is above a constant threshod. The second criterion detects soft onset, which occurs frequenty in vioin sound. It finds an onset if the oudness is above a constant threshod. The third criterion checks the oudness change in a sma surrounding window to distinguish two consecutive notes of the same pitch.
These three situations are iustrated in Figure 5. Finay, a note tabe is generated. See Tabe for an exampe. our transcriber to achieve rea-time performance. The transcription speed is shown in Figure 6. oudness 2 oudness soft onset (above a threshod) hard onset (high derivative) time Transcription Time (sec) 0 8 6 4 2 0 0.00 3.00 6.00 9.00 2.00 5.00 8.00 2.00 24.00 27.00 30.00 33.00 threshod time Music Length (sec) oudness threshod Figure 5: Three criteria for onset detection. Onset (O) Duration (D) Pitch (P) Loudness (L) 42 3 24 0.799 60 23 20.237 82 22 5 2.898 Tabe : An exampe basic note tabe. 4.3 Performance of the DVT Transcriber We tested our transcriber with 5 vioin pieces, of which 3 are professiona recordings and 2 are ordinary home recordings. Three criteria are used to evauate the accuracy of our transcriber: # of correcty detected notes P = precision = # of tota detected notes # of correcty detected notes R = reca = # of notes in the input music 2PR F = P + R 0 F 60% The test resuts (Tabe 2) show that by taking advantage of the specific characteristics of vioin sound, our transcriber performs much better than commercia software for genera-purpose music transcription. Transcriber Precision Reca F DVT 0.97 0.99 0.98 Amazing MIDI v.70 [6] 0.23 0.96 0.37 inteiscore v5. [7] 0.29 0.94 0.44 Tabe 2: Transcription accuracy. consecutive notes (high peak-vaey ratio) time Our transcriber is not ony accurate, but aso fast. The combination of searching in the semitone domain and tracking ony the first six most significant frequency components enabes Figure 6: Transcription speed. 4.4 High-Resoution Pitch Estimation and Vibrato Detection Vibrato is a distinct characteristic of vioin performance. It is not aways marked in music scores. Learners must earn by istening to teacher s pay. To hep beginners to easiy find vibrato in teacher s pay, DVT uses high-resoution pitch estimation to detect vibrato and visuaize the resuts. Vibrato is defined as sma osciations in pitch and/or oudness of a musica tone []. The osciations in pitch are usuay smaer than haf of a semitone away from the centra pitch. In order to detect the osciations and thus the vibrato, semitone eve resoution pitch estimation described in Section 4.2 is not adequate. Pitch estimation for detecting the osciations requires the resoution at the cent eve, defined as /00 of a semitone. There are severa methods to increase pitch estimation resoution, e.g., enarging FFT window and peak picking [6]. However, a arge FFT window reduces not ony the transcription speed, but aso the time resoution. Peak picking is found to be unreiabe, due to the irreguarity of vioin harmonic structure, especiay for sound recorded in a home environment. DVT uses a dynamic inear interpoation method, which offers good tradeoff between increased resoution in pitch estimation and transcription speed. Denote Y[0..4095] as the 4096-point FFT ampitude spectrum obtained during our semitone eve pitch estimation. Denote Y(c) as a function that returns the ampitude of frequency at cent c in the spectrum. If c is non-integer, Y(c) returns the inear interpoated vaue from Y[0..4095]. We estimate the cent of the pitch by the formua beow: c = arg max Y ( k i) i 6 k= c min i c For each cent from the owest to the highest, we compute the sum of the ampitudes for its fundamenta and the first five harmonics, and choose the one with argest sum as the estimated pitch. The principe of our agorithm is iustrated in Figure 7. Again, we have expoited the harmonic structure of vioin sound in our higresoution estimation. max.
ampitude FFT spectrum samping point Linear interpoating point frequency sound from a microphone and dispays it in rea time. The earner chooses a string to be tuned from the user interface and bows the corresponding string near a microphone. The tuner sampes the audio from the microphone at the rate of 22 khz and creates the corresponding spectrum using non-overapping 4096-sampe FFT in order to satisfy the rea-time requirement. The pitch is then estimated with the high-resoution method described in Section 4.4. The tuner dispays the detected pitch continuousy (Figure 9). The darker coor (bue) indicates that the pitch is too high. The ighter coor (green) indicates that the pitch is too ow. fundamenta 5 th harmonic Figure 7: Linear interpoation for the fundamenta and first fifth harmonics. The estimated cents of a frames of a note form a pitch trajectory. We then use minima-maxima detection [2] to find vibrato. In our impementation, vibrato is considered detected if the foowing two criteria are both satisfied: a minimum of three osciation cyces; a minimum average osciation ampitude. In our current system, we assume at most one vibrato for each note. This resuts in four parameters for each detected vibrato: starting frame, duration, strength (frequency shift), and vibrato frequency (the number of osciation cyces in a unit time). Our vibrato transcriber outputs a note tabe with vibrato parameters. See Tabe 3 for an exampe. DVT visuaize the notes and vibratos (Figure 8) in distinct ways, so that earners can easiy see what and how notes are payed. O D P L Vibrato Vibrato Vibrato Vibrato Start Length Strength Frequency 296 50 5 0.40000 0 0 0 0.000000 49 64 5 0.40625 0 0 0 0.000000 (a) 5. TUNER Tabe 3: An exampe vibrato note tabe. (b) Figure 8: (a) Visuaized notes. (b) Visuaized vibratos. The DVT tuner uses the transcriber and simpe visuaization to hep the earner tune the vioin. It detects the pitch of the vioin Figure 9: The tuner user interface. Our pitch visuaization differs from traditiona techniques in a simpe, but important way. Whie most traditiona techniques, which are neede-based, show ony the instantaneous estimated pitch, our visuaization shows a short history of the estimated pitch as we. This simpe feature, somewhat unexpectedy, seemed to reduce tuning time and received enthusiastic wecome in our user study. Our conjecture is that the pitch history is used by the earner to predict the future pitch and avoid overshooting the target. This is simiar to the idea of ook-ahead in contro theory. We pan to carry out controed experiments in the future to confirm this observation. 6. PERFORMANCE EVALUATOR After transcribing earner s pay into a note tabe, DVT compares it with a note tabe based on a score or one transcribed from teacher s pay, in order to find the earner s mistakes. 6. Note Aignment Our note tabe is basicay an annotated ist of notes. It is we known that simpe Eucidian distance is not usefu for comparing two ists of notes, as the earner may pay at a different starting time and with a different tempo. To find the earner s mistakes in a meaningfu way, we must first aign the notes in the two tabes first. Our note aignment is based on the dynamic programming (DP) agorithm, simiar to those used for score tracking [2][0]. It performs an initia aignment using ony pitch information and then adjusts the tempo to produce the fina aignment. In the DP agorithm, we define three types of mistakes in earner s pay:
Mismatch. The pitch of the note that the earner pays is different from that of the note that the teacher pays. Insertion. The earner pays an additiona note that shoud not be payed. Deetion. The earner misses a note that shoud be payed. Each type of mistake is given a fixed cost. Define S[..m] as the pitch ist of the earner s notes and T[..n] as the pitch ist of the teacher s notes. A DP aignment matches notes in S[..m] with those in T[..n]. Define V[i,j] as the cost for an optima aignment between S[.. and T[.., one with the minimum cost. To initiaize DP, we set V[0,0] = 0 V[ i,0] = V[ i,0] + cost( insertion) V[0, j] = V[0, j ] + cost( deetion) aignment resut shows that the earner pays an addition note, the first one and that the ast note is wrong in duration. before the aignment The DP recurrence is given by V [ i, j ] if S[ i ] = T[ j ] V[ i, j ] + cost( mismatch) if S[ i ] T[ j ] V[ i, j] = min V [ i, j] + cost( insertion) V [ i, j ] + cost( deetion) After computing V[m,n], we obtain an optima aignment between the earner s and the teacher s pitch ists, and the pitch mistakes in the earner s pay can thus be easiy identified. Next, we further aign the two note tabes using timing information to find mistakes in tempo. From the DP aignment, we obtain a ist of notes payed at the correct pitch and use the onset information of these notes as anchors for tempora aignment. Let X[..] and Y[..] be the onset ists of the correcty payed notes in the earner s and the teacher s note tabes, respectivey. We try to find the tempo ratio a and starting time shift b so that the sum of square error for onset e is minimized: e = ( Y[ ( a X[ + b)) This can be obtained by standard inear regression: a = b X[ 2 X[ X[ 2 X[ Y[ Y[ Finay, we stretch the earner s pay by a and shift it by b to produce the fina aignment. The aignment resut is visuaized in an intuitive way (Figure 0). Every note payed is marked by a short bar. The vertica position of the bar indicates the pitch. The horizonta position and the ength of the bar indicate the start time and the duration of the note, respectivey. The dark coor (bue) corresponds to teacher s pay. The medium coor (green) corresponds to earner s pay. The overapping parts, marked by a ight coor (yeow), correspond to correct pay by the earner. In our user study, we have found that this dispay format is easiy understandabe by peope who cannot read the usua music notation. Figure 0 shows an exampe. The teacher pays five notes in the E major scae with the ast one doubed in ength. The earner pays six notes of same ength. The after the aignment Figure 0: An exampe aignment resut. 6.2 Summary Evauation of Learner s Pay After the aignment, we have three aspects of information on the earner s pay: Pitch error rate, incuding mismatch, insertion and deetion. Tempo ratio b and starting time offset a. Onset offset of E [ = Y[ ( a X[ + b) for each note i. For the first criterion, we use precision, reca, and F measure to evauate the note error rate. For the second criterion, starting time offset a is not critica, because it is often due to sience at the beginning of a recording. However, tempo ratio b must be considered, because the earner shoud not pay the piece too sowy or too fast. A Gaussian function is used to score the tempo ratio. The third criterion, notes onset offset, refects the steadiness of tempo. A Gaussian function of average note square error is used. The overa performance score is defined as a weighted sum of the above three component scores. 7. ANIMATOR The DVT animator attempts to demonstrate the correct pay to the earner in an interactive and entertaining way. Traditionay, in the absence of human teachers, the earner usuay reies on video recordings, which offer no feedback and interactivity. The earner must watch the video from a fixed viewing ange, the one that is recorded, and cannot zoom in to watch more cosey, e.g., the finger movements. In comparison, the DVT animator is driven by a note tabe. The earner can choose any part of the tabe to be payed and zoom in to watch more cosey different aspects of the pay, e.g., fingering or bowing. We, however, do recognize the imitation of 3D animation. Despite the significant progress in the ast decade, fuy automatic 3D animation is sti not good enough to generate natura-ooking human motion. As a resut, DVT provides an option to show video recordings, when needed.
7. 3D Avatar Animator Whie paying, a vioinist executes intricate motions of fingers, hands, and arms. The eft hand shifts among different positions aong the neck of the vioin; the fingers press the strings and contro the pitch of the sound produced. Simutaneousy the right arm moves the bow back and forth on the strings to produce the sound. Producing this coordinated motion reaisticay is a chaenging task, due to the compexity of human anatomy. From the shouder to the tips of fingers, each imb consists of many joints with neary 30 degrees of freedom (dofs), some of which are interdependent (Figure ). To pay the desired music notes, a the joints have to move in a coordinated fashion to pace the fingertips at the intended positions on the strings. The difficuty of synthesizing human motion has attracted strong interest (see, e.g., the references in [5]). There is aso recent work on animating guitar and vioin paying. Observing the compexity of motion in paying music instruments, previous work attempts to capture this compexity using sophisticated modes such as neura networks [7] and minimization of a goba cost function [4]. In contrast, we beieve that despite the intricacy required of vioin paying, the range of motion executed is fairy restricted. A simpe procedura mode can capture the motion effectivey without much sacrifice in visua reaism. The key observation is that to a reasonabe approximation, every note has a unique posture of the eft hand and arm for producing the note. Given the pitch of a note to be payed, we can determine the string for paying the note, the finger, and the fingertip position. This one-to-one reationship is sufficient for beginners and aows us to pre-compute a database of hand postures for a the notes by soving inverse kinematics (IK) and interpoate between the pre-computed postures to synthesize the continuous motion. The bowing motion of the right arm is synthesized in a simiar way, but the arm postures are computed on the fy rather than pre-computed and stored in a database. Finay, to render the 3D avatar on the screen, the fingers, hands, and arms are a controed by a ist of joint-ange parameters on a skeeta mode. See [4] for detais. Figure : Modeing the dofs in the arm and the hand. 7.2 2D Fingerboard Animator Athough 3D avatar animation is vivid and entertaining, it is sometimes more effective to show an iconography of the fingerboard (Figure 2). At the suggestion of our coaborators from the conservatory, we have incorporated in DVT a modue for 2D fingerboard animation, which ceary demonstrates when, which string, and where on the string a finger shoud press. The animation is again driven by a note tabe. During the animation, we change the coor of the string and highight the finger position on the string for the note being payed. An optiona Suzuki band can aso be dispayed to imitate the fingerboard of vioin often used by beginners. The 2D animation and 3D animation are synchronized by a note tabe and shown simutaneousy (Figure 2). Figure 2: 3D avatar animation (eft) and 2D fingerboard animation (right). 8. SYSTEM EVALUATION DVT is impemented in Microsoft Visua C++, and OpenGL is used for 3D graphics rendering. It runs on any Win32 operating system. In a of our tests, the DVT transcriber achieves rea-time performance (Figure 6). The performance evauator takes very itte time compared with the transcriber and returns the resut amost instantaneousy. The 3D animator generates animation at the rate of 25 frames/second using an ordinary graphics card. 8. System Evauation Method We conducted a piot user study and used 2 subjects to evauate our DVT system. Chidren: we have three chid vioin beginners. The first one is 5 years od with 7 months earning experience. The second one is years od with 2.5 years experience. The third one is 7 years od with year experience. Parents: we have three parents, the first chid s mother, the second and third chid s fathers. Amateurs: we have three amateur vioin payers. They are about 22 years od with an average 2 years of vioin experience. Conservatory students: we have two vioin students from the Conservatory of Music with and 4 years experience, respectivey. Teacher: we have one vioin teacher with 8 years of teaching experience, who is currenty teaching 20-30 students. The subjects used their own acoustic vioin. Since DVT is intended for home use, we used a PC with a Pentium III processor and a cheap microphone in a ordinary room for the study. To evauate DVT, we designed a questionnaire (Tabe 4). Each subject goes through the procedure outined beow, whie the parent of the subject (if avaiabe) observes the process how his/her chid uses the system.
) We give a brief introduction of the system to the subject. 2) We ask the subject to use the tuner to tune the four strings of the vioin. 3) We pay the 3D avatar and 2D fingerboard animations of a prerecorded piece of vioin music to the subject. We demonstrate the abiity to view the 3D avatar from different anges and to zoom and shift 2D fingerboard. 4) We ask the subject to pay a piece that he ikes. DVT transcribes the pay. 5) The subject ooks at the raw transcription resut (before note aignment), in both note mode and cent mode. We ask the subject if he can find any of his own mistakes. 6) The subject chooses to pay one of three pieces, E-Major Scae, Twinke Twinke Litte Star, and Lighty Row. DVT transcribes the subject s pay and compares it with pretranscribed teacher s pay. DVT aigns the transcribed note tabes and dispays the mistakes in the subject s pay. 7) The subject pays the same piece for two or three more times and observes whether the system gives a higher score, if the subjects improves the performance. Q Is it easy to use the tuner to tune the vioin? (-very difficut...5- very easy) Q2 Do you think 3D avatar is usefu and effective? (-disagree...5- agree) Q3 Is the 2D finger board hepfu to memorize the finger position sequence? (-useess...5-very hepfu) Q4 Is DVT abe to find out the mistakes you make during practice? (-not at a...5-yes) Q5 Is the score given by performance evauator reasonabe? (- no...5-yes) Q6 Does DVT ook fun and coo to pay with? (-boring...5- fascinating) Q7 Overa, do you think DVT is usefu? (-no...5-yes) Q8 Among the 4 modues (tuner, score, finger board, avatar) in the system, which is the most important and usefu? What s their rank? Why? (expanatory) Q9 How do you wish to improve DVT? Why? (expanatory) Tabe 4: Questionnaire for the user study. 8.2 Evauation Resuts 8.2. Overa Resuts The resuts of the first seven questions are isted in Tabe 5: Subject Q Q2 Q3 Q4 Q5 Q6 Q7 Conservatory 5 2 5 4 3 4 5 Conservatory 2 4 4 5 4 3 3 4 Amateur 5 3.5 4.5 5 3 3 4 Amateur 2 5 4 5 4 3 4 Amateur 3 4 2 4 4 4 4 Chid 2 2 4 5 5 3 Parent 4 3 3 4 3 5 5 Chid 2 5 5 5 5 5 5 5 Parent 2 5 3 5 5 5 4 5 Chid 3 5 5 5 5 5 5 5 Parent 3 5 5 5 5 5 5 5 Teacher 4 4 4 4 3 4 5 Average 4.42 3.2 4.29 4.58 3.67 4.7 4.50 Tabe 5: Resut of the first seven questions. The response to Q shows that the DVT tuner is effective and easy to use. Both earners and parents who have itte music knowedge find it easy and convenient to use the tuner. Our history-based visuaization is particuary wecomed. It enabed a five years od chid, who does not know how to tune the vioin, to successfuy tune his vioin for the first time by himsef. Previousy, he reied soey on the teacher to tune the vioin once a week, during vioin essons. Therefore, in most of the time during the week between two essons, the chid practices on the untuned vioin, which adversey affects the earner s progress. DVT tuner is of great vaue in these situations. From Q4, we observe that DVT is capabe of identifying the mistakes in earner s pay. The intuitive visuaization can ceary show the note progression. By comparing the earner s pay with the teacher s, the mistakes are visuaized ceary. This provides effective feedback to the earner so that he or she can take corrective actions. This feature is particuary wecomed by parents who cannot read musica notation. Q5 gets an average evauation of 3.67. In our step 7 of the experiment, we et the subject pay the same pieces severa times. If his performance improves, the overa score given by the system does increase as we. The numerica score, however, does not seem to be very attractive to earners. Further investigation is needed to reconsider the reevance of the three criteria used for scoring and how they are combined. The response to Q2 seems to suggest that our 3D avatar animation is not as usefu as we expected. Most subjects indicated that the 3D avatar cannot demonstrate the finger positions ceary. It seems that our 3D avatar impementation needs to be improved to achieve the desired reaism for effective earning. The 2D fingerboard animation, despite its simpicity, is quite effective. It highights the finger position as we as the active string and has received very positive responses from the users. See the responses to Q3. We aso had a case that an amateur earner stopped paying vioin for severa years. With the hep of our fingerboard animation, he managed to pick up a designated piece in 30 minutes. Q6 and Q7 are overa scores. Q7 shows that the overa evauation of the DVT is we above the average of the evauations for the individua components (Q Q5). This suggests that what the DVT performs we is the most important ones for users. Our system is perceived to be fun and coo, especiay by the younger users (see responses to Q6). These resuts show that DVT can not ony provide effective feedback to improve the earning process, but aso stimuate the earner s interest in practicing and potentiay increase their average practicing time. 8.2.2 Cassified Anaysis Here we try to further anayze the reationships among the different types of earners and their opinions on the system. Most subjects think that the tuner and the mistake visuaization are usefu, regardess of their backgrounds. Younger subjects seem to ike the avatar animation more, whie oder ones consider the fingerboard animation is sufficient and more effective. Amost a the users beieve that DVT is abe to identify and visuaize the mistakes that they make in the practice.
DVT is a computer-based system. Those who do not know how to use computers, often demonstrate itte interest in DVT. In our study, the itte 5 years od chid is an exampe. However, oder chidren who ike computers or computer games are particuary interested in DVT. For conservatory students, their vioin performance eve is aready very high. As a resut, they find DVT is not particuary hepfu. They aso find vioin paying performance shoud be evauated in more ways than the three criteria in our current system. The vioin teacher and parents are very enthusiastic about DVT. They master the usage of DVT very quicky. They aso beieve that it is a very usefu too to hep their students or chidren earning to pay the vioin. As a whoe, we concude that DVT system is best suited for vioin beginners at the age of 0 or above, who know some famiiarity with computers. DVT is aso good for parents who itte knowedge in music or vioin, but wish to hep their chidren. 8.2.3 Wish List We got some vauabe feedback from the expanatory question Q9. They are grouped and isted beow. The first group of suggestions is about the graphic dispay of scores: ) To refect the pitch accuracy in the note eve score 2) To have a rea music score ike sheet music The second group of suggestions is about the avatar: ) To incude more avatar characters (such as spider man) into the system, so that the user can choose his favorite avatar Finay, we aso have another important suggestion from a parent. Most modues of the system provide visua feedback to audio inputs. However, it is aso important to provide audio feedback to the earners, to et them isten to the difference between teacher s paying and their paying. For exampe, after aigning the teacher s score and the earner s one, the system can generate MIDI notes and pay them back aong with the earner s paying. MIDI notes are very accurate in pitch. By comparing the difference in sound of MIDI notes and the actua notes the earner payed, the earner can graduay earn how to hear the difference in pitch, and thus improve the abiity to adjust the finger positions just by hearing the sound. 8.2.4 Discussion In this piot study, we took an anecdota approach, and the resuts indicate the strong potentia of DVT as a usefu too for beginning vioin earners. However, systematic controed experiments are needed to further vaidate these resuts. Such experiments are in the panning. 8.3 Comparison with Other Systems Now we compare DVT with reated and recent music educationa systems, Famiy Ensembe [0] and pianoforte [3]. Famiy Ensembe is a music edutainment system, with which a parent and his/her chid can pay ensembe on a MIDI keyboard. Using a music database and score tracking techniques, wrong notes payed by the parent can be substituted to the correct ones automaticay by the system. Thus it enabes parents with itte or no experience in paying the piano to accompany the chid during the practice. PianoFORTE is a music education system which provides visua feedback of student s mistakes. Student s performance on a MIDI keyboard is recorded. Notes, tempo, dynamics and articuation are dispayed on the computer screen with graphic notations, which enabes the student and the teacher to review the performance and point out mistakes. A detaied comparison of our system with the other two systems is isted in Tabe 6. Item DVT Famiy Ensembe PianoFORTE Input Acoustic Signa MIDI MIDI Device Needed Cheap Microphone MIDI Keyboard MIDI Keyboard Score Database Optiona Required Not Required Tuner Yes No No Score Visuaization Yes No Yes Animation Yes No No Automatic Error Yes Yes No Identification Feedback Mutimoda Feedback N/A Limited Tabe 6: System comparison. Tabe 6 shows that the input of DVT is acoustic signa, not MIDI as in the other two systems. This makes our system advantageous due to the fact that it can be extended to other acoustic instruments, especiay the strings, therefore is not restricted to keyboard instruments. Famiy Ensembe requires a score database in order to track and correct the parent s paying. These scores must be pre-stored into the database to enabe the system to work on these specific pieces. DVT does not require pre-stored database. The teacher and the earner can just record and transcribe to get an anayzabe score. However optionay, we can incude some teaching materias with standard scores and ship them together with DVT. In comparison to the other two systems, a distinct feature of DVT is its mutimoda feedback to earners. The feedback of DVT is mainy visua, which is simiar to PianoFORTE. We beieve adding audio feedback is a usefu approach as used in Famiy Ensembe. 9. CONCLUSION AND FUTURE WORK We have presented the initia design of and experience with Digita Vioin Tutor, an integrated system for heping beginning vioin earners. DVT combines fast and accurate vioin audio transcription with intuitive visuaization to provide much-needed feedback to earners, when human teachers are not avaiabe. The
entire system has been impemented with off-the-shef hardware and shown to be practica in home environments. Athough we have chosen vioin as the instrument for our system, we beieve that the main design considerations generaize to other reated string instruments. DVT can be improved in many ways. Based on our user study, we have identified two most important ones as the immediate next step. First, incorrect bowing is a common mistake among beginners. It does not ead to wrong pitch or timing, which our current transcription method focuses on. To find bowing mistakes, we pan to extend our method to detect the difference in the timbre of the sound. Second, we pan to perform poyphonic transcription in order to make the system usefu for more advanced earners. Another important issue is to dea with transcription or note aignment errors, which may ead to incorrect feedback to the earner. We beieve that it is possibe to buid statistica modes of transcription and note aignment errors. DVT can then use such modes to provide the earner a confidence estimate of the performance evauation. 0. REFERENCES [] D. Bendor, M. Sander. Time Domain Extraction of Vibrato from Monophonic Instruments. Internationa Symposium on Music Information Retrieva, 2000. [2] R. B. Dannenberg. An onine-ine agorithm for rea-time accompaniment. Proc. ICMC 984, ICMA (984), 93-98. [3] R. B. Dannenberg, M. Sanchez, A. Joseph, P Cape, R. Joseph, R. Sau. A Computer-Based Muti-Media Tutor for Beginning Piano Students. Journa of New Music Research, 9(2-3), 990. [4] G. EKoura, K. Singh. Handrix: Animating the Human Hand, ACM Symp. Computer Animation, pp. 0-9, 2003. [5] P. Faoutsos, M. van de Panne, D. Terzopouos. Composabe Controers for Physics-based Character Animation. SIGGRAPH Conference Proceedings, pp. 25-260, 200. [6] B. Kedem. Spectra anaysis and discrimination by zerocrossings. Proceedings of the IEEE, 74(): 477-493, 986. [7] J. Kim, F. Cordier, N.M. Thamann. Neura Networks Based Vioinist s Hand Animation. Computer Graphics Internationa, pp. 37-4, 2000. [8] A. P. Kapuri. Automatic Transcription of Music. Proceedings of the Stockhom Music Acoustic Conference. 2003. [9] P. M. Martins, J. S. Ferreira. PCM to MIDI Transposition. Audio Engineering Society 2 th Convention Paper, May 2002. [0] C. Oshima, K. Nishimoto, M. Suzuki. Famiy Ensembe: A Coaborative Musica Edutainment System for Chidren and Parents. ACM Mutimedia 04, 2004. [] M. Piszczaski, B.A. Gaer. Automatic Music Transcription. Computer Music Journa, (4):24-3, 977. [2] S. Rossigno, X. Rodet, J. Soumagne, J. L. Coete, and P. Depae. Automatic characterization of musica signas: feature extraction and tempora segmentation. Journa of New Music Research, 999. [3] S. W. Smoiar, J. A. Waterworth, P. R. Keock. pianoforte: A System for Piano Education Beyond Notation Literary. ACM Mutimedia 95, 995. [4] J. Yin, A. Dhanik, D. Hsu, Y. Wang. The Creation of a Music-Driven Digita Vioinist. ACM Mutimedia 04, 2004. [5] J. Yin, T. Sim, Y. Wang, A Shenoy. Music Transcription Using an Instrument Mode. IEEE Internationa Conference on Acoustics, Speech, and Signa Processing, 2005. [6] Amazing MIDI, http://www.puto.dti.ne.jp/~araki/amazingmidi/index.htm [7] inteiscore, http://www.inteiscore.ne