engin erzin the use of speech processing applications is expected to surge in multimedia-rich scenarios

engin erzin Associate Professor Department of Computer Engineering Ph.D. Bilkent University http://home.ku.edu.tr/ eerzin eerzin@ku.edu.tr Engin Erzin s research interests include speech processing, multimodal signal processing, pattern recognition and human-computer interfaces. Prof. Erzin is a member of Multimedia, Vision and Graphics Laboratory (MVGL), where he is actively part of many national and international research projects. E. Erzin. Improving Throat Microphone Speech Recognition by Joint Analysis of Throat and Acoustic Microphone Recordings. IEEE Transactions on Audio, Speech and Language Processing, 2009. The speech processing research area, which refers to analysis, synthesis and recognition of speech signals, is playing a key role in the state-of-art digital speech communication and multimedia services. While Internet and wireless telephony is expected to remain one of the most important application for several years to come, the use of speech processing applications, such as automatic speech recognition (ASR), text-to-speech synthesis (TTS), speaker identification/verification, emotion and mood analysis from speech, is expected to increase in multimedia-rich scenarios. M. E. Sargın, Y. Yemez, E. Erzin, and A. M. Tekalp. Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation. IEEE Transactions on Pattern Analysis and Machine Learning, 2008. the use of speech processing applications is expected to surge in multimedia-rich scenarios Multimodal signal processing refers to combined processing of signals from multiple modalities such as speech, still images, video, and other sources. It plays a key role in the design of future human-computer interfaces and intelligent systems, such as intelligent vehicles. The ultimate goal of human-computer interface research is to develop a machine that is able to identify humans, to analyze and understand them from biometric input signals and to synthesize a human-like output in response, in a similar way to human-to-human communication. The study of relations and correlations between different modality signals plays an important role in effective use of multimodal information. Prof. Erzin s active research activities in the area of multimodal signal processing include speech/speaker recognition, body motion analysis, speech-driven face gesture analysis and synthesis, speaker animation, audio-driven body animation and driver behavior modeling. More details on Prof. Erzin s research activities and current research projects are available under: http://mvgl.ku.edu.tr. U. Bag cı and E. Erzin. Automatic classification,of musical genres using intergenre similarity. IEEE Signal Processing Letters, 2007. H. E. C etingu l, E. Erzin, Y. Yemez, and A. M. Tekalp. Multimodal speaker/speech recognition using lip motion, lip texture and audio. Signal Processing, 2006. E. Erzin, Y. Yemez, and A. M. Tekalp. Multimodal speaker identification using an adaptive classifier cascade based on modality reliability. IEEE Transactions on Multimedia, 2005. Multimodal recognition system 1

2 Graduate Students

can yagli M.S. Koç University, 2010 Can Yagli. Artificial bandwidth extension of speech using temporal clustering. Master s thesis, Koç University, 2010. In this thesis, we investigate the Artificial Bandwidth Extension problem, which aims to reconstruct the missing frequency in wideband speech from narrowband speech. To solve the problem, we utilize the well-known source-filter reproduction of the human voice production system.. C. Yagli and E. Erzin. Artificial bandwidth extension using linear prediction within temporal clusters. submitted to ICASSP 11, 2011. 3 ferda ofli Ph.D. Koç University, 2010 Advisor: Murat Tekalp, Yücel Yemez, Engin Erzin Ferda Ofli. Learning Statistical Music-to-Dance Mappings for Choreography Synthesis. PhD thesis, Koç University, 2010. We propose many-to-many statistical mappings from music measures (music segments) to dance figures (dance segments) towards generating plausible music-driven dance choreographies. We assume that dance figures (dance segment boundaries) coincide with music measures (music segment boundaries).. F. Ofli, E. Erzin, Y. Yemez, and A.M. Tekalp. Multi-modal analysis of dance performances for music-driven choreography synthesis. In ICASSP 10, Dallas, USA, 2010. F. Ofli, E. Erzin, Y. Yemez, A.M. Tekalp, A.T. Erdem, C. Erdem, T. Abaci, and M. Ozkan. Unsupervised dance figure analysis from video for dancing avatar animation. In ICIP 08, San Diego, USA, 2008. F. Ofli, C. Canton-Ferrer, J. Tilmanne, Y. Demir, E. Bozkurt, Y. Yemez, E. Erzin, and A.M. Tekalp. Audio-driven human body motion analysis and synthesis. In ICASSP 08, Las Vegas, USA, 2008. F. Ofli, Y. Demir, E. Erzin, Y. Yemez,, and A. M. Tekalp. Multicamera audio-visual analysis of dance figures. In IEEE Int. Conf. on Multimedia Expo, ICME-2007., 2007. F. Ofli, Y. Demir, C. Canton-Ferrer, J. Tilmanne, K. Balcı, E. Bozkurt, I. Kızıloğlu, Y. Yemez, E. Erzin, A.M. Tekalp, L. Akarun, and A.T. Erdem. Çok bakışlı işitsel-görsel dans verilerinin analizi ve sentezi (analysis and synthesis of multiview audio-visual dance figures). In SIU 08, Didim, Turkey, 2008.

elif bozkurt M.S. Koç University, 2010 4 Elif Bozkurt. Emotion recognition from speech. Master s thesis, Koç University, 2010. We present formant position based weighted Mel Frequency Cepstral Coefficient (WMFCC) features for the emotion recognition problem and compare performance results with commonly used feature sets. Since, the Line Spectral Frequency (LSF) features are positioned close to each other around formant frequencies, we propose normalized inverse harmonic mean function to weight critical band energies for the extraction of MFCC features.. E. Bozkurt, C. Eroglu Erdem, T. Erdem, and E. Erzin. Formant position based weighted spectral features for emotion recognition. submitted to Speech Communication, 2010. E. Bozkurt, E. Erzin, C. Eroglu Erdem, and T. Erdem. Improving automatic emotion recognition from speech signals. In INTERSPEECH 09, UK, 2009. F. Ofli, Y. Demir, C. Canton-Ferrer, J. Tilmanne, K. Balcı, E. Bozkurt, I. Kızıloğlu, Y. Yemez, E. Erzin, A.M. Tekalp, L. Akarun, and A.T. Erdem. Çok bakışlı işitsel-görsel dans verilerinin analizi ve sentezi (analysis and synthesis of multiview audio-visual dance figures). In SIU 08, Didim, Turkey, 2008. emre öztürk M.S. Koç University, 2010 Emre Öztürk. Driver status identification from driving behavior signals. Master s thesis, Koç University, 2010. Driving behavior signals differ in how and under which conditions the driver use vehicle control units, such as pedals, driving wheel, etc. In this study we investigate how the driving behavior signals differ among drivers and among different driving tasks.. E. Ozturk and E. Erzin. Driving status identification under different distraction conditions from driving behaviour signals. In 4th Biennial Workshop on DSP for In-Vehicle Systems and Safety, UTD, TX, USA, 2009.

yasemin demir Ph.D. student at University of California, Berkeley M.S. Koç University, 2008 Yasemin Demir. Music - driven dance synthesis by multimodal dance performance analysis. Master s thesis, Koç University, 2008. We present a framework for evaluation of audio feature and dance figure correlation for audio - visual analysis and synthesis of dance figures. Dance figures are performed synchronously with the musical rhythm.. Y. Demir, E. Erzin, Y. Yemez, and A. M. Tekalp. Evaluation of audio features for audio-visual analysis of dance figures. In EUSIPCO 08, Lausanne, Switzerland, 2008. F. Ofli, C. Canton-Ferrer, J. Tilmanne, Y. Demir, E. Bozkurt, Y. Yemez, E. Erzin, and A.M. Tekalp. Audio-driven human body motion analysis and synthesis. In ICASSP 08, Las Vegas, USA, 2008. 5 F. Ofli, Y. Demir, E. Erzin, Y. Yemez,, and A. M. Tekalp. Multicamera audio-visual analysis of dance figures. In IEEE Int. Conf. on Multimedia Expo, ICME-2007., 2007. F. Ofli, Y. Demir, C. Canton-Ferrer, J. Tilmanne, K. Balcı, E. Bozkurt, I. Kızıloğlu, Y. Yemez, E. Erzin, A.M. Tekalp, L. Akarun, and A.T. Erdem. Çok bakışlı işitsel-görsel dans verilerinin analizi ve sentezi (analysis and synthesis of multiview audio-visual dance figures). In SIU 08, Didim, Turkey, 2008. emre sargın MTS at Google Ph.D. student at University of California, Santa Barbara M.S. Koç University, 2006 Advisor: Murat Tekalp, Yücel Yemez, Engin Erzin Emre Sargın. Audio-visual correlation modeling for speaker identification and synthesis. Master s thesis, Koç University, 2006. This thesis addresses two major problems of multimodal signal processing using audiovisual correlation modeling: speaker recognition and speaker synthesis. We address the first problem, i.e., the audiovisual speaker recognition problem within an open-set identification framework, where audio (speech) and lip texture (intensity) modalities are fused employing a combination of early and late integration techniques.. M. E. Sargın, Y. Yemez, E. Erzin, and A. M. Tekalp. Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006. M. E. Sargın, Y. Yemez, and A.M. Tekalp. Audio-visual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 9(7):1396 1403, November 2007. M. E. Sargın, Y. Yemez, E. Erzin, and A. M. Tekalp. Prosody-driven head-gesture animation. In IEEE Int. Conf. on Acoustic, Speech, Signal Proc. (ICASSP 07), 2007.

ulaş bağcı Ph.D. student at University of Nottingham, UK M.S. Koç University, 2005 6 Ulaş Bağcı. Boosting classifiers for automatic music genre classification. Master s thesis, Koç University, 2005. Music genre classification is an important tool for music information retrieval systems and has been finding important applications in various media platforms. Two important problems of the automatic music genre classification are feature extraction and classifier design.. U. Bağcı and E. Erzin. Automatic classification of musical genres using inter-genre similarity. IEEE Signal Processing Letters, Vol. 14, No. 8, pp. 521-524, August 2007. U. Bağcı and E. Erzin. Boosting classifiers for music genre classification. In 20th International Symposium on Computer and Information Sciences (ISCIS 2005), Berlin, 2005. U. Bağcı and E. Erzin. Müzik türlerinin sınıflanmasında benzer kesişim bilgileri uygulamaları. In SIU 2006, Antalya, 2006. ertan çetingul Ph.D. student at Johns Hopkins University, Baltimore M.S. Koç University, 2005 Advisor: Murat Tekalp, Engin Erzin, Yücel Yemez Ertan Çetingul. Discrimination analysis of lip motion features for multimodal speaker identification and speech-reading. Master s thesis, Koç University, 2005. In this thesis a new multimodal speaker/speech recognition system that integrates audio, lip texture, lip geometry, and lip motion modalities is presented. There have been several studies that jointly use audio, lip intensity and/or lip geometry information for speaker identification and speech recognition applications.. H.E. Cetingul, E. Erzin, Y. Yemez, and Tekalp A.M. Multimodal speaker/speech recognition using lip motion, lip texture and audio. Signal Processing, Special Section: Multimodal Human-Computer Interfaces, 86:3549 3558, December 2006. H.E. Cetingul, E. Erzin, Y. Yemez, and Tekalp A.M. Discriminative analysis of lip motion features for speaker identification and speech-reading. IEEE Transactions on Image Processing, 15:2879 2891, October 2006. H.E. Cetingul, E. Erzin, Y. Yemez, and Tekalp A.M. Robust lip-motion features for speaker identification. In IEEE Int. Conf. on Acoustic, Speech and Signal Processing, Philadelphia, March 2005.

alper kanak TUBITAK-UEKAE M.S. Koç University, 2004 Advisor: Murat Tekalp, Engin Erzin, Yücel Yemez Alper Kanak. Multimodal speaker identification with audio-video processing. Master s thesis, Koç University, 2004. In this these we present a multimodal text=dependent speaker identification system. The objective is to improve the recognition performance over conventional unimodal or bimodal schemes.. A. Kanak, E. Erzin, Y Yemez, and A.M. Tekalp. Speaker identification using multimodal audio-video processing. IEEE Int. Conf. on Image Processing, 2003. A. Kanak, E. Erzin, Y Yemez, and A.M. Tekalp. Joint audio-video processing for biometric speaker identification. IEEE Int. Conf. on Acoustic, Speech and Signal Processing, 2003. 7