Availability of Artificial Voice for Measuring Objective QoS of CELP CODECs and Acoustic Echo Cancellers

Similar documents
A Comparison of Speech Coding Algorithms ADPCM vs CELP. Shannon Wichman

Analog-to-Digital Voice Encoding

QoS Mapping of VoIP Communication using Self-Organizing Neural Network

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

Subjective SNR measure for quality assessment of. speech coders \A cross language study

Introduction and Comparison of Common Videoconferencing Audio Protocols I. Digital Audio Principles

MPEG Unified Speech and Audio Coding Enabling Efficient Coding of both Speech and Music

Digital Speech Coding

Voice Communication Package v7.0 of front-end voice processing software technologies General description and technical specification

Telephone Speech Quality Standards. for. IP Phone Terminals (handsets) CES-Q September 30, 2004

White Paper. PESQ: An Introduction. Prepared by: Psytechnics Limited. 23 Museum Street Ipswich, Suffolk United Kingdom IP1 1HN

Monitoring VoIP Call Quality Using Improved Simplified E-model

Figure1. Acoustic feedback in packet based video conferencing system

White Paper. ETSI Speech Quality Test Event Calling Testing Speech Quality of a VoIP Gateway

HEAD acoustics. Standards on Audio Quality - from a system-level view. H. W. Gierlich HEAD acoustics GmbH.

Perceived Speech Quality Prediction for Voice over IP-based Networks

Voice---is analog in character and moves in the form of waves. 3-important wave-characteristics:

Audio Coding Algorithm for One-Segment Broadcasting

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Simple Voice over IP (VoIP) Implementation

Advanced Speech-Audio Processing in Mobile Phones and Hearing Aids

AC : MULTIMEDIA SYSTEMS EDUCATION INNOVATIONS I: SPEECH

Understanding the Transition From PESQ to POLQA. An Ascom Network Testing White Paper

TRAFFIC MONITORING WITH AD-HOC MICROPHONE ARRAY

Evolution from Voiceband to Broadband Internet Access

Speech Compression. 2.1 Introduction

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31

L9: Cepstral analysis

Objective Speech Quality Measures for Internet Telephony

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

The Optimization of Parameters Configuration for AMR Codec in Mobile Networks

UNBIASED ASSESSMENT METHOD FOR VOICE COMMUNICATION IN CLOUD VOIP-TELEPHONY

Introduction to Packet Voice Technologies and VoIP

A STUDY OF ECHO IN VOIP SYSTEMS AND SYNCHRONOUS CONVERGENCE OF

Speech Signal Processing: An Overview

Tech Note. Introduction. Definition of Call Quality. Contents. Voice Quality Measurement Understanding VoIP Performance. Title Series.

Voice Quality Planning for NGN including Mobile Networks

Adjusting Voice Quality

LOW COST HARDWARE IMPLEMENTATION FOR DIGITAL HEARING AID USING

Statistical Measurement Approach for On-line Audio Quality Assessment

Voice over IP Protocols And Compression Algorithms

Ericsson T18s Voice Dialing Simulator

INTERNATIONAL TELECOMMUNICATION UNION $!4! #/--5.)#!4)/. /6%2 4(% 4%,%0(/.%.%47/2+

GSM speech coding. Wolfgang Leister Forelesning INF 5080 Vårsemester Norsk Regnesentral

Basic principles of Voice over IP

Voice Encoding Methods for Digital Wireless Communications Systems

1 Multi-channel frequency division multiplex frequency modulation (FDM-FM) emissions

MultiDSLA. Measuring Network Performance. Malden Electronics Ltd

Packet Loss Concealment Algorithm for VoIP Transmission in Unreliable Networks

Linear Predictive Coding

Application Notes. Contents. Overview. Introduction. Echo in Voice over IP Systems VoIP Performance Management

Analysis of Immunity by RF Wireless Communication Signals

VoIP Technologies Lecturer : Dr. Ala Khalifeh Lecture 4 : Voice codecs (Cont.)

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

White Paper. Comparison between subjective listening quality and P.862 PESQ score. Prepared by: A.W. Rix Psytechnics Limited

Packet Loss Concealment of Voice-over IP Packet using Redundant Parameter Transmission Under Severe Loss Conditions

ETSI TS V1.1.1 ( )

The PESQ Algorithm as the Solution for Speech Quality Evaluation on 2.5G and 3G Networks. Technical Paper

Voice Service Quality Evaluation Techniques and the New Technology, POLQA

Establishing the Uniqueness of the Human Voice for Security Applications

MULTI-STREAM VOICE OVER IP USING PACKET PATH DIVERSITY

Application Note. Using PESQ to Test a VoIP Network. Prepared by: Psytechnics Limited. 23 Museum Street Ipswich, Suffolk United Kingdom IP1 1HN

Assessing the quality of VoIP transmission affected by playout buffer scheme and encoding scheme

Thirukkural - A Text-to-Speech Synthesis System

Lecture 1-6: Noise and Filters

Voice Encryption over GSM:

From Concept to Production in Secure Voice Communications

QMeter Tools for Quality Measurement in Telecommunication Network

Advanced Signal Processing and Digital Noise Reduction

Manual Analysis Software AFD 1201

Intelligent Learning Techniques applied to Quality Level in Voice over IP Communications

ADAPTIVE ALGORITHMS FOR ACOUSTIC ECHO CANCELLATION IN SPEECH PROCESSING

David Tipper Associate Professor Department of Information Science and Telecommunications University of Pittsburgh Slides 2.

Accuracy Analysis on Call Quality Assessments in Voice over IP

Voice Quality Evaluation and the Impact of Wireless Packet Communication Systems

Voice over IP. Overview. What is VoIP and how it works. Reduction of voice quality. Quality of Service for VoIP

QAM Demodulation. Performance Conclusion. o o o o o. (Nyquist shaping, Clock & Carrier Recovery, AGC, Adaptive Equaliser) o o. Wireless Communications

Department of Electrical and Computer Engineering Ben-Gurion University of the Negev. LAB 1 - Introduction to USRP

TCOM 370 NOTES 99-6 VOICE DIGITIZATION AND VOICE/DATA INTEGRATION

ADAPTIVE SPEECH QUALITY IN VOICE-OVER-IP COMMUNICATIONS. by Eugene Myakotnykh

Log-Likelihood Ratio-based Relay Selection Algorithm in Wireless Network

Performance Analysis of Interleaving Scheme in Wideband VoIP System under Different Strategic Conditions

ANALYZER BASICS WHAT IS AN FFT SPECTRUM ANALYZER? 2-1

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers

Emotion Detection from Speech

Audio Coding, Psycho- Accoustic model and MP3

Non-Data Aided Carrier Offset Compensation for SDR Implementation

AUDIO CODING: BASICS AND STATE OF THE ART

Voice over IP. Abdus Salam ICTP, February 2004 School on Digital Radio Communications for Research and Training in Developing Countries

School Class Monitoring System Based on Audio Signal Processing

Speech Property-Based FEC for Internet Telephony Applications *

INTER CARRIER INTERFERENCE CANCELLATION IN HIGH SPEED OFDM SYSTEM Y. Naveena *1, K. Upendra Chowdary 2

INTERNATIONAL TELECOMMUNICATION UNION

STUDY OF MUTUAL INFORMATION IN PERCEPTUAL CODING WITH APPLICATION FOR LOW BIT-RATE COMPRESSION

PHASE ESTIMATION ALGORITHM FOR FREQUENCY HOPPED BINARY PSK AND DPSK WAVEFORMS WITH SMALL NUMBER OF REFERENCE SYMBOLS

Voice over IP in PDC Packet Data Network

Digital Transmission of Analog Data: PCM and Delta Modulation

Computer Networks 53 (2009) Contents lists available at ScienceDirect. Computer Networks. journal homepage:

Adaptive Equalization of binary encoded signals Using LMS Algorithm

Transcription:

Availability of Artificial Voice for Measuring Objective QoS of CELP CODECs and Acoustic Echo Cancellers Nobuhiko Kitawaki*, Feng Wei*, Takeshi Yamada*, and Futoshi Asano** University of Tsukuba* and AIST**, Japan 1-1-1, Tennoudai, Tsukuba-shi, 305-8573 Japan. Tel. & Fax. +81 298 53 5526, E-mail: kitawaki@is.tsukuba.ac.jp Topic: 2.1 Current standards comparison and evaluation Key Words: artificial voice, Rec. P.50, CELP CODEC, echo canceller, PSQM, PESQ, Rec. P.861, P.862 Abstract We proposed an artificial voice reflecting average characteristics of comprehensive human voice as the test signal instead of a real voice for objective quality measurement of coded speech. The artificial voice was standardized as ITU-T Recommendation P.50 in 1988. However, the performance of artificial voice was verified to the waveform codecs using an LPC cepstrum distance measure as a criterion. Recently, there has been much progress concerning speech coding techniques based on CELP (Code Excited Linear Prediction). On the other hand, there has also been much progress concerning objective quality measures for speech coding. Those are PSQM (Perceptual Speech Quality Measure) and PESQ (Perceptual Evaluation of Speech Quality) standardized as P.861 in 1996 and P.862 in 2000, respectively. This paper describes verification test results of the artificial voice P.50 by using PSQM method P.861 for objective QoS measurement of CELP codecs. To expand the application area for P.50, this paper also describes objective QoS measurements by the artificial voice P.50 for acoustic echo cancellers in hands-free communications. 1. Introduction We proposed an objective QoS measurement scheme composed of objective measure and test speech signal for voiceband codecs in 1982 [1]. However, because of less information the coded speech quality at a low-bit-rate depends on the talker. The real speech signal should be selected taking account of talker dependency. In practice, only an insufficient number of speech signals were used for the objective quality measurement. Therefore, we proposed another approach which uses an artificial voice reflecting average characteristics of comprehensive human voice as the test signal instead of a real voice shown in Fig. 1 [2]. The artificial voice was standardized as ITU-T Recommendation P.50 in 1988 [3]. However, at that time, the performance of artificial voice was only verified to the waveform codecs using an LPC cepstrum distance measure as a criterion. Recently, there has been much progress concerning speech coding techniques [4]. The CELP (Code Excited Linear Prediction) codecs have already been used in wireless personal communications and internet communications shown in Table 1, and are expected to be widely used in the third generation wireless personal multimedia communications and the broadband networks. On the other hand, there has also been much progress concerning objective quality measures for speech coding. The PSQM (Perceptual Speech Quality Measure) and PESQ (Perceptual Evaluation of Speech Quality) based on internal representations of the speech signals which make use of the psychophysical equivalents of frequency and intensity domains were standardized as Recommendations P.861 in 1996 and P.862 in 2000, respectively [5][6]. This paper describes verification test results of the artificial voice P.50 by using PSQM method P.861 for objective QoS measurement of CELP codecs. To expand the application area for P.50, this paper also describes objective QoS measurements by the artificial voice P.50 for acoustic echo cancellers in hands-free communications [7]. TEST SIGNAL Speech Artificial Voice + Transmission System - MEASURE Distance Similarity Predicted QoS MOS Fig.1 Objective QoS measuring scheme composed of an artificial voice and a criterion.

Table 1 G-series speech coding standardized in ITU. Rec. G.711 G.722 G.723.1 G.726 G.727 G.728 G.729 Coding Notes Year PCM SB-ADPCM ACELP MP-MLQ ADPCM Em-ADPCM LD-CELP CS-ACELP 64 kbit/s 7 khz, 64 kb/s dual rate 5.3 kbit/s, 6.3 kbit/s 40, 32, 24, 16 kbit/s 5, 4, 3, 2 bit/sample 16 kbit/s 8 kbit/s 2. Artificial Voice 1972 1988 1996 1984 1990 1992 1996 Figure 2 shows a generation diagram of the artificial voice recommended by P.50. The artificial voice generation method employs vector quantization to analyze a multilingual data-base composed of twenty languages and PARCOR speech synthesis. The artificial voice reflects average characteristics of comprehensive human voice: (a) long-term averaged spectra, (b) instantaneous amplitude distribution, (c) level distribution of segmental power, (d) spectral distribution of segmental power, and (e) voiced/unvoiced structure of speech waveform. Consequently, the artificial voice at most 10 seconds long can be expected to cause the same effects on objective speech quality measurement as the use of a large amount of real speech samples, and to generate the test signal by using the defined algorithm at any location. 3. Objective QoS Measurement of CELP CODECs into account the effects of languages, talkers, and tandeming of the codecs. Other factors were fixed. That is, the number of all conditions are 29. Real speech samples are same Japanese, English, and Italian as those used in the verification tests for standardizaion of P.861. Each spoken language is composed of two females and two males. Artificial voices are a female and a male. The input level to a codec was set to 26 dbov relative value to the overload level of linear PCM. We assumed there was no channel degradation between a coder and a decoder. The objective quality measure is PSQM based on internal representations of the speech signal which make use of the psychophysical equivalents of frequency and intensity domains standardized as Recommendations P.861. Amplitude Pitch Frequency T 0 F c F s 23 5 T v (c) Pitch Frequency Variation Pattern Time Glottal Excitation Random Noise Time Input (d) Short-term Power Variation Patterns - Four Units Elements- (b) Glottal Excitation Signal (a) PARCOR Synthesis Filter k 12 k 11 k 1 Z -1 Z -1 Z -1 Power Output z -1 : Unit Delay Artificial Multiplier Spectrum Voice ~ Shaping Filter Low Envelope Emphasis Voiced/ Filter Generator Unvoiced Coefficients Switching Memory Unvoiced Voiced T uv T v Time Fig. 2 Generation of the artificial voice (AV). Rec. Table 2 CODEC used in experiment. T Speech Coding 3.1 Experimental Conditions To thoroughly investigate the performance of artificial voice for CELP codecs by using PSQM measure, the PSQM values derived using the artificial voice should be compared with those of real speech for various CELP codecs. The codecs were G.729, G.726, G.728, G.711, GSM-FR, IS-54, JDC-HR, and their tandeming connections shown in Table 2. In our investigations, we used seven different waveform/celp codecs with bitrates from 5.6 to 64 kbit/s. From the viewpoints of investigating the basic performance of the artificial voice and objective quality measure, we took G.729 G.726 G.728 G.711 GSM-FR IS-54 JDC-HR 8 kb/s CS-ACELP 32 kb/s ADPCM 16 kb/s LD-CELP 64 kb/s PCM Full Rate GSM North America VSELP 5.6 kb/s PSI-CELP

3.2 Performance Index Conventionally, the performance of an artificial voice is evaluated in terms of the consistency between the objectively estimated value (PSQM) calculated by the artificial voice and that of real speech. The evaluation is done with performance indexes such as correlation coefficients and root mean square error (RMSE). 3.3 Performance Evaluation of Artificial Voice To verify the performance of artificial voice, this section compares the objective values derived from using the artificial voice with those of real speech for various CELP and waveform codecs. Table 3 shows the RMSE and correlation coefficients of PSQM by the consistency of artificial voice and real speech. Figure 3 shows an example of male voice. It was shown that the objective measurement PSQM for CELP and waveform codecs by artificial voice P.50 closely approximated that of real speech. Table 3 RMSE and correlation coefficient. Condition Talker RMSE Correlation Single CODEC Tandeming Total male female male female male female 0.1031 0.5366 0.3174 0.3681 0.3141 0.4317 0.9976 0.8965 0.9548 0.8524 0.9696 0.8695 4. Objective QoS Measurement of Acoustic Echo Cancellers 4.1 Application Expanding of Artificial Voice Performance of the echo canceller is evaluated by residual echo characteristics expressed in echo return loss enhancement (ERLE). The ERLE can be conventionally measured by putting white Gaussian noise into the echo canceller system. However, white Gaussian noise is not adequate as the test signal for measuring the performance of the echo canceller, since the performance may depend on the characteristics of input test signal, and the characteristics of the white Gaussian noise differ from those of real spoken language. Therefore, this section discusses appropriate characteristics of spoken language required for objective quality evaluation of echo cancellers to expand the application area of the artificial voice P.50. 4.2 Echo Cancellers used in Verification Test Figure 4 shows schematic diagram for measuring residual echo characteristics. Test signal is input to the telephone network system equipped with echo canceller, and emitted to the room from loudspeaker at the receiving end. Acoustic echo and ambient noise are input to the microphone at the receiving room. Adaptive filter of the echo canceller generates echo replica by estimating impulse response in the acoustic echo path and cancels a room acoustic echo. Residual echo characteristics are expressed in ERLE. Input Loudspeaker Test signal Output Adaptive Filter Echo replica y(k) { Acoustic echo path Echo y(k) Error Ambient noise n(k) Echo Canceller e(k)=y(k)-y(k)+n(k) Microphone Fig. 4 Measurement of residual echo cancellers. Fig. 3 An example of male voice. Table 4 shows echo cancellers used in this verification test. They are EC-A ( Natural Talk PRO ) and EC-B ( Smooth Talk 200 ) produced by NTT (Nippon Telegraph and Telephone Corporation). The algorithm of echo canceller is based on exponentially weighted step-size projection (ES-Projection) method [8][9]. This algorithm uses a different step size for each coefficient of an adaptive transversal filter, and these step sizes are time-invariant and weighted in proportion to the

expected variation of a room impulse response. For speech input, ES-Projection method has a convergence speed four times that of the normalized LMS (NLMS) but its computational load is almost the same. Table 4 Acoustic echo cancellers used in tests. EC-A EC-B Pass Band 0.1-7.3 khz 0.1-3.6 khz Algorithm ES-Projection ES-Projection Table 5 Test signals studied in this experiment. Test Signal Symbol Rec. Note Real Voice White Noise Freq. Weighted N. Artificial Voice Comp. Source S. RV WN FWN AV CSS P.64 P.50 P.501 reference conventional long-term spect. time and freq. AV & WN Eli. Time 250 ms 100 ms Eli. Mag. 35 db Unkwon 4.3 Test Signals Table 5 shows test signals studied in this experiment. To study of the characteristics of spoken languages required for objective quality measurement of echo cancellers, following five test signals are examined: real voice (RV), white Gaussian noise (WN), frequency weighted Gaussian noise (FWN), artificial voice (AV), and composite source signal (CSS). WN having flat frequency spectrum is conventional test signal for echo canceller performance. FWN reflects the long-term average spectra of the spoken language to the white Gaussian noise. AV reflects the average characteristics of the spoken language such as long-term average spectra, instantaneous amplitude distribution, level distribution of segmental power, spectral distribution of segmental power, voiced/unvoiced structure of speech waveform, and short-term spectral characteristics. CSS is composed of above artificial voice (AV) sequence in the voiced intervals, white Gaussian noise (WN) sequence, and silent sequences, mixed in random. RV is Japanese short sentences, and is thought as reference criterion for performance evaluation of each test signal. AV is recommended as P.50 used for measuring coded speech performances [3]. CSS specified by P.501 was proposed for measuring performance of telephonometry devices [10], and FWN specified by P.64 is used for measuring Loudness Rating (LR) in ITU-T [11]. Real voices composed of Japanese short sentences spoken by 6 males and 5 females. 4.4 Measurement of Residual Echo Characteristics ERLE is defined as logarithmic power ratio of the output from the loudspeaker and residual echo. The loudspeaker and microphone arrangement was set up conforming to ITU-T Recommendation P.340 which specifies acoustic measurement of hands-free telephone system. Sound level from the loudspeaker was 73 db, and ambient noise level was 45 db. ERLE characteristics were measured by five test signals, and compared with those of RV as the reference criterion. Figure 5 (a)-(c) shows an example of ERLE characteristics of each test signal and that of RV for echo canceller EC-A. In the Figure, the dark colored curves are for real voices (RV) as the reference, and the light colored curves are for each test signal other than real voices. Smooth curves approximate each ERLE characteristics, in which dotted curve shows for RV, and solid curve shows for test signal. ERLE characteristics measured by artificial voice according to P.50 are almost equivalent to those of real voices. However, ERLE characteristics measured by other test signal than P.50 differ from those of real voices, and consequently ERLE characteristics are very rapidly recovered compared with those of real voices. This may cause over-estimation for the echo canceller system if CSS, white Gaussian noise, and frequency weighted Gaussian noise are used as a test signal. 4.5 Performance Index The performance of a test signal is evaluated in terms of the consistency between the approximate curve for each test signal and that for real voice. The evaluation is done with performance index as root mean square error (RMSE). Table 6 shows RMSE for each test signal. It is concluded that female and male artificial voices reveal best performance among test signals studied in this experiment. Another performance index is convergence time. In this paper, convergence time is defined as the time to reach ERLE of 15 db from the least value. Table 7 shows convergence time of the echo canceller system by each test signal. It is concluded that convergence time measured by artificial voice is the nearest to that of real voice.

Table 6 RMSE between test signal and real voice. Test Signal WN FWN AV-Male AV-Female CSS EC-A 39.8 38.0 17.0 17.9 30.2 EC-B 72.1 56.9 17.0 17.7 40.4 (a) White Noise Table 7 Convergence time by each test signal. ] pm RV WN FWN AV-Male AV-Female CSS EC-A 8.7 1.7 3.8 7.3 7.2 3.6 EC-B impossible 2.2 3.9 impossible impossible 4.4 5. Conclusion (b) Female Artificial Voice This paper verified the availability of the artificial voice recommended by ITU-T Rec. P.50 for both objective QoS measurements of CELP codecs by PSQM and acoustic echo cancellers. It is concluded that artificial voice having average characteristics of spoken language in time and frequency domain are required for and satisfied with objective QoS measurement of CELP codec and acoustic echo canceller. Therefore, we propose that current recommendations should be specified on this point. Acknowledgments We would like to thank Ms. Mari Saito, Ms. Keiko Kawaguchi, Mr. Yasuhiro Hotta, University of Tsukuba, and Ms. Junko Sasaki, NTT-AT, for their cooperation to this work.. Composite Source Signal Fig. 5 An example of ERLE characteristics of each test signal and real voice for EC-A. Reference [1] Nobuhiko Kitawaki, Kenzo Itoh, Masaaki Honda, and Kazuhiko Kakehi: COMPARISON OF OBJECTIVE SPEECH QUALITY MEASURES FOR VOICEBAND

CODECS, IEEE Int. Conf. Acoust. Speech, Signal Processing, ICASSP'82, pp. S9.5.1-9.5.4, 1982. [2] Kenzo Itoh, Nobuhiko Kitawaki, Hiromi Nagabuchi, and Hiroshi Irii: A New Artificial Speech Signal for Objective Quality Evaluation of Speech Coding Systems, IEEE Trans. Commun., vol.42, no.2/3/4, pp.664-672, 1994. [3] ITU-T Recommendation P.50: Artificial voices, 1993. [4] Nobuhiko Kitawaki : Perceptual QoS Assessment for Wireless Personal Communications, IEEE The Fourth International Symposium on Wireless Personal Multimedia Communications, Proc. WPMC 01, pp. 541-546, September 2001. [5] ITU-T Recommendation P.861: Objective quality measurement of telephone-band (300-3400 Hz) speech codecs, 1998. [6] ITU-T Recommendation P.862: Perceptual evaluation of speech quality (PESQ), an objective method for endto-end speech quality assessment of narrow-band telephone networks and speech codecs, 2001. [7] Nobuhiko Kitawaki, Takeshi Yamada, and Futoshi Asano: Objective QoS Measurement for Hands-Free Telecommunications, 2001 Asia-Pacific Symposium on Information and Telecommunication Technologies, Proc. APSITT'01, November 2001. [8] Shoji Makino, Yutaka Kaneda and Nobuo Koizumi: Exponentially weighted step-size NLMS adaptive filter based on the statistics of a room impulse response, IEEE Trans. Speech & Audio, vol. 1, no. 1, 1993. [9] Shoji Makino and Yutaka Kaneda: Exponentially weighted step-size projection algorithm for acoustic echo cancellers, Trans. IEICE Japan, E75-A, pp. 1500-1508, Nov. 1992. [10] ITU-T Recommendation P.501: Test signals for use in telephonometry on CD-ROM, 1996. [11] ITU-T Recommendation P.64: Determination of sensitivity/frequency characteristics of local telephone systems, 1997.