Subjective SNR measure for quality assessment of speech coders \A cross language study Mamoru Nakatsui and Hideki Noda Communications Research Laboratory, Ministry of Posts and Telecommunications, 4-2-1, Nukuikita, Koganei, 184 Japan (Received 7 January 1994) The subjective speech-to-noise-ratio (SNR), derived from the forced-choice pair-comparison test using the psychometric analysis procedure, has well represented overall speech quality of speech coders in a single dimension. No significant speaker and listener variation has been found for a wide range of waveform coders at the tests conducted in two separate sessions 14 months apart using different groups of English speakers and listeners. The purpose of this study is to investigate reproducibility of the measure conducting the same framework test using Japanese speakers and listeners. The test result shows the subjective SNR measure gives quite reliable scores evaluated in different laboratories with different language backgrounds. Keywords: Subjective quality, Speech coder, Pair-comparison test, Cross language study PACS number: 43. 71. Gv 1. INTRODUCTION The ultimate performance measure for evaluating voice communication systems is the subjective quality of the received speech. Modern digital speech-coding techniques achieve high intelligibility. The high level of speech intelligibility is a necessary but insufficient condition for user acceptance of the systems. Quality, as well, must meet acceptability criteria. Two general surveys of the methods of speech quality assessment have been given by Munson and Karlin1) and Hecker and Guttman,2) with different aspects of classifications. Munson and Karlin have classified the subjective assessment methods into two broad classes depending upon whether listeners are exposed only to the new system to be evaluated or to both the test and reference systems. They report that in the former case, called "indirect comparison," responses of listeners depend upon their prior experience with other systems to make the desired decisions or categorizations and in the latter case, called "direct comparison," responses of listeners depend upon their overall preference judgments alone, and there is no requirement for categorizing the reasons for their preference. Hecker and Guttman have divided the methods for measuring overall quality into two types of approaches depending upon whether or not the method investigates the underlying psychological factors that govern the overall judgments obtained. The former approach, called "analytic," attempts to explore the psychological components of speech quality and to discover the acoustical correlates of these components and the latter approach, called "utilitarian," attempts to obtain a unidimensional measure of speech quality. They conclude that, even though the unidimensional scaling of speech quality is limited in the accuracy of the measure obtained, this utilitarian approach is of primary interest from an engineering point of view. Among the measures obtained through the methods taking the utilitarian approach, the most widely used are categorical ratings depending upon
indirect comparison.3) Comparability of ratings obtained with these methods on different occasions is, in general, small since there is considerable variability due to speakers and listeners. Another widely used measure is the relative preference rating, derived from preference judgments on paired testreference systems.4) This relative rating may be less influenced by speaker and listener variability than the absolute rating scale obtained through the categorical judgments, but a direct comparability of ratings obtained in different studies is difficult to ensure. The subjective assessment method expected to provide a practical engineering criterion on overall speech quality should satisfy the following requirements of engineering practice: (1) test administration and data reduction are simple enough to be carried out in most speechcommunication laboratories, without assistance of specialist in psychological experiments or of trained or professional listeners, (2) the measure provides and adequate single absolute scale which is intuitively understandable by most communication engineers, (3) reproducibility of the results across studies is high enough to enable one to compare directly the measures obtained on different occasions or at different laboratories. The subjective speech-to-noise ratio (SNR) has been proposed as one of such measures and experimental results of subjective tests with the measure have been reported.5) The test results show that 1) the subjective SNR measure provides an adequate single absolute scale for a wide range of speech waveform coders, 2) reliable score is available using reasonably small number of listeners and speakers, and 3) no significant speaker and listener variation is found in the scores of two separate test sessions 14 months apart using different groups of English speakers and listeners. The purpose of this study is to investigate whether reproducibility of the results across tests is high enough to enable one to compare directly the measures obtained at different laboratories with different nationalities and language backgrounds. 2. SUBJECTIVE SNR MEASURE The concept of a subjective SNR has been found in the iso-preference method originally introduced by Munson and Karlin.1) The subjective SNR is derived from the forced-choice pair-comparison test using the psychometric analysis procedure commonly used in the method of constants. A speech signal degraded by varying amounts of multiplicative white noise6) is selected as the reference system in our tests. 2.1 Reference System The sampled reference signal r(i) corrupted multiplicative white noise n(i) is defined as by (1a) (1b) where s(i) is original speech signal which is also served as input signal to the speech coders evaluated, k is a coefficient for SNR control, and e(i) is a random variable taking on values of +1 and -1 with equal probability and independently of s(i) for each sample. Since this reference signal is identical to one of the speech signals processed by the modulated-noise reference unit (MNRU) of ITU-T (CCITT)7) and the SNR of the speech signal degraded by MNRU is called Q, our subjective SNR measure can be called as Equivalent Q measure. 2.2 Subjective Test Format Each test signal to be evaluated is paired with five or six reference signals selected so that preference ranging from 0% to 100% would result. During the test, the listeners are presented with repeated signal pairs in the order of ABAB as shown in Fig. 1. The listeners are asked to mark that member of each pair which they will prefer as a source of information. 2.3 Test Data Analysis The proportion of listeners preferring the test signal p(i) is converted to unit normal deviate z(i) using the equation Applying Muller-Urban weighting to the converted Fig. 1 Time pattern of the test pair sequence. (2)
M. NAKATSUI and H. NODA: SUBJECTIVE SNR MEASURE Table 1 Experimental frameworks. Fig. 2 An example of the least-square fit of the preference data: black circles denote the rates preferring the test signal and a solid curve denote the normally fitted ogive. data, a weighted least square algorithm8) is used to fit a straight line to the data points: (3) where S represents SNR of the reference signal. Thus the subjective SNR (SNRsubj) and its standard deviation s are given by (4a) (4b) Figure 2 shows an example of the least-square fit of the preference data. The 95% confidence interval of SNRsubj as an estimate of population mean Đ is defined by (5) where t(ė, x) is the t distribution with ė degrees of freedom and x is the significance level (5%). 3. EXPERIMENTAL PROCEDURE Table 1 summarizes the experimental setups of the current test (Test III) and the previous tests (Tests I and II).5) Five kinds of coder configuration were simulated and evaluated in the current test, including 40 and 56 kb/s Đ-255 companded pulse-code modulations (PCMs), 16 and 32 kb/s dual-adaptive delta modulators (DADMs),9) and a 16 kb/s adaptive delta modulator with one-bit memory (ADM).10) Two PCM and a ADM configurations were served as anchor points connecting the current tests in Japan with the previous tests in Canada.5) Four short sentences spoken by two male and two female speakers were bandlimited from 200 to 3,400 Hz and digitized to 12 bits at sampling frequencies of 8, 16 and 32 khz. Those digitized speech samples were served as inputs to the two PCM, 32 kb/s DADM, and 16 kb/s ADM configurations to produce the test signals to be evaluated. The speech samples having non-standard handlimitation from 200 to 2,400 Hz were served as inputs to the 16 kb/s DADM. The reference signals were also processed by Eq. (1) using these speech samples. Two separate sessions of pair-comparison tests, one session using speech samples of standard bandwidth and the other session using those of nonstandard (narrow) bandwidth, were conducted with eleven untrained listeners. Participants in both sessions were native Japanese speakers. 4. RESULT AND DISCUSSION A subjective SNR of each test signal and its 95% confidence interval were estimated from the test data pooled over four utterances and all the listeners and shown in Fig. 3 together with the results of the previous tests.5) Estimates obtained from the speech data having non-standard (narrow) bandwidth are indicated by an arrow in the figure (please refer to the previous report5) concerning coder configurations of APC, ADPCM-V and ADPCM-F in the figure). No statistically significant difference between the subjective SNR estimates at the 5% level is found for the following test signal pairs: (1) 40 kb/s PCMs in tests I, II, and III, (2) 56 kb/s PCMs in the tests I and III, and (3) 16 kb/s ADMs in tests I and III. The subjective SNR measure gives quite reliable scores evaluated in different laboratories with different language backgrounds. No significant variation due to the factors of speaker and listener is found for the tests signals of the current test as has also been shown in the previous tests for the wave-
Excited Linear Prediction (CELP),12) whose distortions differ significantly from those of the reference signal in our test. In order to overcome such limitation, reference signal having colored noise that reflects distortions of the coder to be evaluated13) should be introduced in the test. ACKNOWLEDGMENTS The authors are grateful to Dr. Paul Mermelstein, Bell-Northern Research, Canada for his cooperation and suggestion in the initial phase of this series of tests and express their sincere thanks to all the subjects involved in the tests. Fig. 3 Subjective SNR and its 95 % confidence interval of the test signals evaluated in three tests. form coders.5) A mean opinion score (MOS), which is the most widely used subjective measure on overall speech quality and is the absolute scale derived from absolute (categorical) judgements, has shown remarkable variations in test results obtained in different countries for the same speech transmission system.3) A MOS equivalent Q measure11) aiming at interpreting a large set of MOS data pooled into subjective SNR scale, has shown saturation (non-linearity) in high quality range (56 to 64 kb/s PCMs) reflecting inadequacy of the MOS measure in the range. 5. CONCLUSION Subjective SNR measure, which is the absolute scale derived from relative judgements well represents wide range of overall speech quality in a single dimension. A reproducibility of the results across tests is high enough to enable one to compare directly the measures obtained at different laboratories with different language backgrounds. A limitation of the measure may be in preventing its extension to low-bit-rate coders, such as Code REFERENCES 1) W.A. Munson and J.E. Karlin, "Isopreference method for evaluating speech-transmission circuits," J. Acoust. Soc. Am. 34, 762-774 (1962). 2) M.H.L. Hecker and N. Guttman, "Survey of methods for measuring speech quality," J. Aud. Eng. Soc. 15, 400-403 (1967). 3) D.J. Goodman and R.D. Nash, "Subjective quality of the same transmission conditions in seven different countries," IEEE Trans. Commun. COM- 30, 642-654 (1982). 4) P. Mermelstein, "Evaluation of a segmental SNR measure as an indicator of the quality of ADPCM coded speech," J. Acoust. Soc. Am. 66, 1664-1667 (1979). 5) M. Nakatsui and P. Mermelstein, "Subjective speech-to-noise ratio as a measure of speech quality for digital waveform coders," J. Acoust. Soc. Am. 72, 1136-1144 (1982). 6) M.R. Schroeder, "Reference signal for signal quality studies," J. Acoust. Soc. Am. 44, 1735-1736 (1968). 7) ITU-T(CCITT) Recommendation, p.81, V, Blue Book (1988). 8) J.P. Gilford, Psychometric Methods (McGraw- Hill, New York, 1954). 9) M. Natatsui and K. Nakata, "Dual adaptive delta modulation for mobile voice channel and its DSP implementation," Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 1684-1687 (1985). 10) N.S. Jayant, "Adaptive delta modulation with a one-bit memory," Bell Syt. Tech. J. 49, 321-342 (1970). 11) T. Watanabe, K. Itoh and N. Kitawaki, "Comparison of performance on voiceband codecs by several speech quality measures," Tech. Rep. Speech, Acoust. Soc. Jpn. S82-48 (1982) (in Japanese). 12) M.R. Schroeder and B.S. Atal, "Code-exited linear prediction (CELP): high-quality speech at very low bit rates," Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 937-941 (1985).
M. NAKATSUI and H. NODA: SUBJECTIVE SNR MEASURE 13) H. Nagabuchi, "A study on reference signals in subjective quality evaluation of low-bit-rate speech codings," Trans. Inst. Electr. Inf. Commun. Eng. J69-A, 896-903 (1986) (in Japanese). Mamoru Nakatsui received the B. Eng. degree in 1963 from the University of Electro-Communications and the Dr. Eng. degree in 1976 from the Tohoku University. In 1963 he joined the Radio Research Laboratory (now called the Communications Res. Lab.) of MPT, where he is now the Associate Director General. From 1978 to 1980 he was an invited professor of INRS-Telecommunications, University of Quebec, Canada. His main interests lie in speech communications. Hideki Noda received B.E. and M.E. degrees in electronics engineering from Kyushu University in 1973 and 1975 respectively. He received Dr. Eng. degree in electrical engineering from Kyushu Institute of Technology in 1993. He worked at the Daini-Seikosha Ltd. from 1975 to 1978 and at the National Research Institute of Police Science, Japan National Police Agency from 1978 to 1989. From 1984 to 1985, he was a visiting scientist of the National Research Council of Canada. In 1989, he joined the Communications Research Laboratory, Japan Ministry of Posts and Telecommunications, where he is now the chief of the Auditory and Visual Informatics Section. His research interests include speech and image processing.