Voice Service Quality Evaluation Techniques and the New Technology, POLQA

Voice Service Quality Evaluation Techniques and the New Technology, POLQA White Paper Prepared by: Date: Document: Dr. Irina Cotanis 3 November 2010 NT11-1037 Ascom (2010) All rights reserved. TEMS is a trademark of Ascom. All other trademarks are the property of their respective holders.

Contents 1 Today s Voice Service Challenges... 3 2 Speech Quality Evaluation Techniques... 3 2.1 Intrusive Techniques... 4 2.2 Non-Intrusive Techniques... 4 2.3 Standardization Status and Evolution Related to the Listening Quality of Voice Service... 5 3 POLQA Technology... 6 3.1 POLQA Algorithm s Overview... 7 3.2 Operability Requirements... 8 3.3 Telecommunication Test and Application Scenarios... 9 3.4 Understanding POLQA Limitations... 10 3.5 POLQA Algorithm s Performance Evaluation... 10 4 Beyond the MOS Score... 11 5 Ascom Network Testing Presence in the Standardization Work on Objective Evaluation Metrics for Listening Speech Quality... 12 6 Conclusions... 12 7 References... 13 NT11-1037 2(13)

1 Today s Voice Service Challenges Almost 10 years ago, operators and infrastructure vendors were struggling to provide speech quality on 2G networks at the level expected by users accustomed to PSTN levels of quality. Network optimization and troubleshooting, as well as advanced speech processing techniques and an in-depth understanding of speech transport on wireless networks, helped operators bring the level of speech quality on 2G networks to that of fixed networks. With the 3G network evolution, with the move to all IP, and with the transition from narrowband (NB) to wideband (WB) speech, it was expected that wireless voice services would even supersede traditional PSTN quality. However, today s voice services are still raising a set of challenges for operators as they attempt to continue meeting their users expectations. The roots of these challenges lie mainly in the convergence and coexistence of voice, data, and multimedia application services, which involve a multitude of factors that invariably produce new types of distortions that dynamically, variably, and sometimes randomly affect speech quality. These factors range from the increased demand for capacity generated by high and dynamic traffic patterns with various application-dependent patterns to low and adaptive bit rate codecs with different bandwidths (NB, WB, and super wideband (SWB)) and complex error concealment solutions as well as voice enhancement devices (e.g., noise suppressors, automatic gain control, echo cancellers) designed to counter speech degradation with speech processing techniques that, if not well designed and implemented, could have an effect opposite that of the desired speech quality enhancement. In addition, with next generation network (NGN) (LTE/SAE-SON) evolution, network vendors as well as operators are looking to a challenging change from traditional CS, and then from VoIP to VoIP over IMS (VoLTE). Details on these challenges can be found in [1]. 2 Speech Quality Evaluation Techniques Providing voice service on NGNs at the quality level demanded by subscribers while supporting backward compatibility with 3G/2G networks as well as integrating voice with a myriad of multimedia and data services increases the need for voice quality testing. Likewise, providing and ensuring a high quality level for testing and evaluating speech quality comes with its own series of challenges. The need for cost efficient speech quality evaluation techniques to replace subjective testing while ensuring high accuracy on a larger variety of network configurations and conditions, codecs, bandwidths, and applications continues to drive network testing tools and infrastructure vendors, as well as operators and standardization organizations, to collaborative work on speech quality evaluation techniques. Extensive work has been performed during the last decade by both the ITU-T and the telecommunication industry in developing speech quality NT11-1037 3(13)

evaluation algorithms designed to accurately evaluate any network degradation impact on subscriber perception as well as to cope with the complex testing conditions of the 3G environment. These speech quality evaluation algorithms have been developed with different scopes and applications. They can be either intrusive perceptual solutions performing end-to-end speech quality evaluation [2], [3], [6] on different types of networks (wireless, VoIP, or fixed) based on the speech signal, or non-intrusive perceptual (single-ended algorithms) [4] and nonintrusive parametric [5], which can evaluate speech quality at different nodes of the network (including the end node) based on the degraded speech signal and, respectively, on network parameters. 2.1 Intrusive Techniques These algorithms provide speech quality scores by comparing reference (transmitted) and degraded (received) speech samples. Therefore, intrusive assessment techniques require access to both the transmission and reception ends of communication. Comparing time-frequency processed reference and degraded speech samples based on human perception and cognition models facilitates an accurate estimation of the subjective perception of speech quality received by the terminal. An accurate estimation, however, is performed at the cost of sending the test samples through the network under test. The connection under test is therefore withdrawn from normal service and rendered unavailable to the customer. During peak hours, and for some technologies and certain areas, this situation may generate artificially low quality scores. Intrusive perceptual metrics estimate end-to-end speech quality, and thus are useful and meaningful to network operators for monitoring the quality experienced (QoE) by their voice service subscribers. 2.2 Non-Intrusive Techniques Non-intrusive metrics can be network parameter based or speech based. Parametric methods can use RF and/or IP parameters for predicting quality. Their limitation comes from the fact that these algorithms can actually predict quality affected either by the radio access network or by the IP-core network. Just a few studies are going on investigating the possibility of combining the effects of both RF and IP parameters on speech quality. The non-intrusive speech based methods need to use predictions regarding the transmitted original speech based on the degraded signal. Strong degradations could easily affect the accuracy of these predictions and, therefore, the overall speech quality evaluation. As a result, even though they are based on the processing of the speech signal using human perception and cognition models, these algorithms are recommended only when large amount of samples are available for averaging [4]. Although less accurate than intrusive perceptual metrics, non-intrusive perceptual and parametric algorithms have an important role in network NT11-1037 4(13)

monitoring for SLA agreements as well as troubleshooting and optimization of different network elements. 2.3 Standardization Status and Evolution Related to the Listening Quality of Voice Service Techniques for objective and subjective evaluation of voice service quality are developed within ITU-T Study Group 12 Performance, QoS and QoE. Standardization organizations such as ETSI/3GPP and other industry forums work in liaison with ITU-T. For almost a decade, the intrusive perceptual solution for listening speech quality evaluation has been PESQ standard P.862 (along with P.862.1, 2, and 3) [2]. With the 3G network evolution towards all IP, particularly NGN (LTE/SAE-SON), ITU-T recognized the industry s immediate need for a new standard that would both improve current PESQ performance under certain specific network conditions (e.g., CDMA networks, EVRC codecs) and cover 3G network evolution for voice service: from traditional CS to VoIP and VoIP over IMS, from NB to WB and SWB, and from low codec rates to very low and adaptive codec rates. As a result, POLQA was developed [3], [6]. POLQA development and the wireless technology evolution toward NGN showed that more than a subjective mean opinion score (MOS) is needed for infrastructure vendors and operators to understand subscriber perception and to appropriately troubleshoot and optimize their networks for the voice service. Details related to new study items initiated in ITU-T are presented in [1]. The non-intrusive solution is covered by the perceptual metric P.563 and by the IP parametric based P.564. Comprehensive summaries of standardized speech quality evaluation metrics, their characteristics, and their applications are presented in Figure 1 for perceptual based metrics and in Figure 2 for parametric based metrics. NT11-1037 5(13)

Perceptual (signal based) E2E QoE monitoring Troubleshooting in correlation with perception metric Intrusive: Uses test original and degraded speech signals to provide quality score Advantages: Highly accurate estimator of subscriber s opinion Reflects the quality ensured by the entire network as perceived by users Requires access only to the end point Disadvantages: Uses test stimuli that could artificially load the network Limited space-time granularity defined by the speech/video sample length requirement Algorithms: ITU-T P.862, 1-3 series (PESQ) ITU-T P.863 (POLQA) (ITU-T consented on 17 September 2010) Non-intrusive: Uses impaired, received speech to predict quality Advantages: Normal usage of the network Troubleshooting the problem generating node High time and space granularity Disadvantages: Low accuracy (high-order averaging is required and therefore possible problems could be smoothed out) Algorithms: ITU-T P.563 Figure 1. Perceptual (Signal-Based) No Reference (parametric-based) Troubleshooting in correlation with the network parameters and perceptual metric Non-intrusive: Uses IP / transport parameters (or could possibly use RF, too) Advantages: Normal usage of the network Troubleshooting the problem generating node (if access enabled) High time and space granularity Possibility for quick correlation with network behavior Disadvantages: Low accuracy (high-order averaging is required and therefore possible problems could be smoothed out) Quality evaluation is one-dimensional, taking into consideration metrics belonging to a single segment of the entire network (such as IP) Algorithms: ITU-T P.564 (IP parameter based) Figure 2. No Reference (Parametric Based) 3 POLQA Technology Today, voice service quality is determined by more than speech codecs used or frames lost. Networks and devices now integrate many new components ranging from voice enhancement devices (e.g., automatic gain controllers, noise reduction, and smart loss concealment schemes) to new techniques and features such as time scaling (stretching and compression of the speech signals in the time domain). All these components have been designed to ensure, maintain, and possibly even increase user experience NT11-1037 6(13)

of the perceived voice service quality. However, due to the complexity of the speech processing involved, these components might cause new and unexpected degradation effects. POLQA is especially designed to handle disruptive effects caused by these multicomponent distortions. 3.1 POLQA Algorithm s Overview As an intrusive perceptual metric, POLQA processes and compares the transmitted original speech signal and the degraded received speech signal in order to provide a prediction of the quality that would be perceived by subjects (regular subscribers) in a subjective listening test. The high level architecture of the algorithm is presented in Figure 3. POLQA processes both the original signal and the degraded signal before performing the comparison. The processing of the original signal is based on the fact that since the subjective testing is carried out without a direct comparison against an original (Absolute Category Rating), the ideal signal assumption on which the subject bases his or her opinion is unknown during the test. The processing of the degraded signal is related to highlevel cognitive processes (e.g., relative insensitivity to linear frequency response distortion and to steady state wideband noise [3]). POLQA runs a time alignment of the degraded signal against the original speech signal before the comparison process. The determined delay is used both for estimating and using the proper sampling frequency as well as for delay compensation in the comparison process performed based on a perceptual model [3]. The accuracy of the comparison process is determined by the transformation applied to the original and degraded signals to an internal representation that is similar to the psychophysical representation of audio signals in the human auditory system. The transformation is applied in the perceptual frequency (Bark) and the loudness domains (Sone), and runs in several steps: time alignment, level alignment to a calibrated listening level, time-frequency mapping, frequency warping, and compressive loudness scaling [3]. The internal representation takes into account several factors impacting the perceived quality, such as playback level mapping from the digital signal representation level, local gain variations, rapid variations, linear filtering, and noise levels. In addition, it applies different levels of compensation for these factors depending on their final contribution to the overall perceptual disturbance. Therefore, minor and stationary differences between the original and degraded speech signals are compensated, while more severe effects known to have a greater impact on the perceived quality are only partially compensated [3].. The final quality perception at the output of the module calculates the difference between the original and degraded internal representations based on a small number of quality indicators that are used to model all related subjective effects. The cognitive model calculates the following parameters: frequency response indicator, noise indicator, room reverberation indicator, and three more indicators describing the internal differences in the time-pitch-loudness domain. All these indicators are NT11-1037 7(13)

combined to give an objective listening quality expressed by the raw POLQA score [3]. The raw POLQA score is then mapped to the subjective MOS domain, MOS-LQO. The mapping is a third order polynomial mapping developed based on a large set of databases (tens of thousands of speech samples) containing a broad range of network types (fixed, IP, and mobile) and conditions (simulated error patterns and live degradations), codecs (e.g., AMR NB,/WB, G.722.1, ilbc, EVRC, EVRC-WB, EVRC-A/B, AAC/AAC LD, Skype, MP3 low bit rate, G.726, EFR), various BGN types and levels, different languages (American and British English, German, Swedish, French, Dutch, Czech, Chinese, and Japanese) and three speech bandwidths (NB, WB, and SWB). Original speech Environment modeling Time alignment Perceptual model Delay estimates Listening conditions / cognitive perception Internal representation of original (transmitted) speech signal Difference between internal representations (user perceived) Cognitive model Raw POLQA Speech databases (NB/WB/SWB; variety of codecs, wireless / VoIP simulated / live conditions, acoustic / electrical, BGN conditions, languages) Mapping to subjective domain POLQA MOS-LQO Degraded speech Perceptual model Internal representation of degraded (received) speech signal Possibly various speech based diagnostic (e.g., delay, gain levels, noise) Psycho-acoustic model Figure 3. High Level Architecture of POLQA Algorithm 3.2 Operability Requirements The POLQA algorithm is designed to predict overall listening speech quality under NB, WB, and SWB (50 to 14000Hz) conditions in 3G/4G (LTE-SAE) networks, including advanced speech processing technologies, acoustical interfaces, and hands-free applications. It should be noted that POLQA has two operational modes: SWB and NB. The main difference is the bandwidth of the original speech signal used by the model. In SWB mode, the received (and potentially degraded) speech signal is compared with an SWB reference. Therefore, band limitations are considered to be degradations and are scored accordingly. The listening quality is modelled as perceived by a human listener using a diffuse-field equalized headphone with diotic presentation (same signal at both ear-caps). In NB mode, the received (and potentially degraded) speech signal is compared to an NB (300 to 3400Hz) original. Thus, normal telephone band limitations are not considered to be severe degradations. NB mode maintains compatibility to the previously developed ITU-T Recommendation P.862.1 (PESQ) [2]. The listening quality is modelled as perceived by a human listener using a loosely coupled IRS type handset at one ear (monotic presentation). NT11-1037 8(13)

3.3 Telecommunication Test and Application Scenarios The telecommunication scenarios include current transmission technologies [3] Public switched networks (e.g., fixed wire PSTN, GSM, WCDMA, CDMA) Push-over-Cellular, Voice over IP, and PSTN-to-VoIP interconnections, Tetra Commonly used speech processing components (e.g., codecs such as AMR NB/WB, G.722.1, ilbc, EVRC, EVRC-WB, EVRC-A/B, AAC/AAC LD, Skype, MP3 low bit rate, G.726, and EFR; noise reduction systems for different types of BGN such as office, street, car, and babble; adaptive gain control; comfort noise; and other types of voice enhancement devices) and their combinations. The tested distortion types [3] cover: Single speech codecs and speech codecs used in tandem, as currently used in telecommunication scenarios Packet loss and concealment strategies (packet-switched connections) Frame errors and bit errors (wireless connections) Interruptions (such as unconcealed packet loss or handover in GSM) Front-end clipping (temporal clipping) Amplitude clipping (overload, saturation) Variable delay (VoIP, video-telephony) / time warping Gain variations Influence of linear distortions (spectral shaping), being also time variant Non-linear distortions produced by the microphone / transducer at acoustical interfaces Reverberations caused by hands-free test setups in defined acoustical environments The application scenarios cover both electrical and acoustical measuring interfaces as well as different terminal types (handset, headphone, or hands-free). NT11-1037 9(13)

3.4 Understanding POLQA Limitations It should be noted that there are several conditions and applications for which POLQA was not designed. POLQA scores obtained in these types of conditions are not reliable and should not be considered for any kind of speech quality evaluation. These conditions include: Other dimensions of speech quality such as conversational aspects and talking quality. Speech quality per call. POLQA is not intended to score longer sequences of speech. It is focused on prediction of quality for shorter speech utterances of 6 to 12 seconds. Noisy listening environments. POLQA does not predict perceived speech quality in these environments; it is designed in accordance with P.800, ACR testing. Music (including multimedia). Evaluation of performance or ranking of voice enhancement devices (e.g., noise suppressors). Other technologies or components such as speech storage formats or non-telephony applications such as public safety networks or professional mobile radio connections. Although yet not tested or evaluated, POLQA could be cautiously applied for the following applications: Other languages (e.g., Russian, Arabic, etc.) Longer speech samples Subjective tests for confirming POLQA performance on these types of applications are recommended. 3.5 POLQA Algorithm s Performance Evaluation Understanding POLQA performance as an estimator of subscriber perception relies on the fact that results from a subjective experiment reflect the relative quality between the tested speech samples, while the absolute values could vary from experiment to experiment depending on the listener group and the design of the subjective test. Unlike subjective results, POLQA is independent of test context and individual voter behavior. POLQA estimates the average subjective score obtained from a group of voters listening to the same speech sample. Although it does not provide an exact absolute score of an individual experiment; POLQA does reproduce the relative quality ranking [3]. Therefore, POLQA performance evaluation involves comparison to subjective scores as well as consideration of the variability that exists within a listening panel. In addition, the differences between individual subjective experiments must be removed. This is achieved by determining and NT11-1037 10(13)

applying an optimal regression function (3 rd order polynomial) between the subjective and objective scores. Due to the large numbers and types of databases, as well as their content variability, a rigorous and extensive evaluation procedure has been developed for POLQA testing. A series of different statistical metrics as well as statistical significance testing have been used [3], but the core one against which the algorithm has been optimized is the epsilon insensitive root mean square error that brings statistical significance and accuracy in the sense that it best emulates the usability of POLQA and its performance in real life scenarios. The epsilon insensitive root mean square error expresses POLQA error against the average MOS of individual voters considering only differences related to an epsilon-wide band around the target average value. Therefore, the uncertainty of a MOS panel is taken into account by the epsilon value defined as the 95% confidence interval of the averaged MOS. The Perror is defined as: rmse 1 * Perror N d ² N i Perror ( i ) max(0, MOSLQS ( i ) MOSLQO ( i ) ci 95( i )) where the index i denotes the condition of the speech sample, N denotes the number of conditions or speech samples, and d denotes the degrees of freedom (d = 4 in the case of a 3 rd order regression). The results reported in [12] provide general information on the POLQA performance on a broad range of databases containing a large variety of technologies, codecs and bandwidths. These results representing an overall performance might be misleading to a certain extent. Due to the variety of databases and the statistical aggregation procedure of the results [3], [12], a weaker or better performance for a specific application and/or bandwidth could be smoothed out or hidden. Therefore, additional analysis is expected for more detailed analysis or for a particular application. This analysis is planned by ITU-T during the POLQA characterization phase and the results are expected to be published in the forthcoming POLQA Application Guide (estimated for June 2011). 4 Beyond the MOS Score Due to the complexity of the NGN environment, as well as the challenges in supporting voice service on LTE-SAE/SON networks, several solutions for providing voice service are currently envisioned. Therefore, test and evaluation of speech quality in the NGN environment must be comprehensive. In order to understand and cost efficiently control the speech degradation of different implementation solutions, evaluation techniques need to go beyond the MOS score. NT11-1037 11(13)

To a large extent, as in the PESQ case, interim calculations of POLQA as well as the six degradation parameters used as input to the POLQA algorithm s cognitive model would allow some network diagnosis based on speech quality evaluation. Details are discussed in [1], but generally the main diagnosis could regard aspects such as latency, jitter (variable delay), gain variations, speech signal and BGN level measurements, level clipping, dropouts (e.g., generated by packet loss), operability of VAD, and shortterm spectra (linear degradations caused by either the frequency response of the devices and/or by the VoIP landline connection). 5 Ascom Network Testing Presence in the Standardization Work on Objective Evaluation Metrics for Listening Speech Quality For more than 10 years, Ascom Network Testing has been an active member within ITU-T Study Group 12, which develops objective speech quality evaluation metrics. Our contributions to the standardization work cover different areas and stages of objective metric development. Ascom Network Testing contributed live recorded speech databases needed for accurate training and tuning of the algorithms running in real life scenarios typical of network troubleshooting, optimization, and operation applications performed by operators. Within ITU-T, we were the initiator and developer of the statistical evaluation procedure for objective metrics that was first applied to PESQ and that was later applied in a modified form to POLQA [8]. Recently, based on our initial work as well as work performed for POLQA performance evaluation, Ascom Network Testing introduced a new study item within ITU-T on a more general statistical evaluation procedure to be applied to various types of objective metrics [9]. This type of evaluation becomes more and more a must for all kinds of objective metrics (e.g., speech, video, audio, multimedia) that are designed for testing in real life networks and therefore for their implementation in network testing tools. We also developed a technique for objective quality metrics calibration to the MOS scale. As a result, we co-authored two standards in relation to PESQ: P.862.1 (Mapping PESQ to MOS domain) and P.862.3 (Guidance for PESQ usage) [2]. Additionally, Ascom Network Testing recently wrote a white paper contribution [10] on aspects related to POLQA implementation in field testing tools, as well as a white paper contribution related to topics that are required to be studied during the POLQA characterization phase [11]. 6 Conclusions The convergence and coexistence of voice, data, and multimedia application services, which involve a multitude of factors that invariably produce new types of distortions that dynamically, variably, and sometimes randomly affect voice service quality. Today, speech quality is determined by more than speech codecs used or frames lost. Networks and devices now integrate many new components ranging from voice enhancement devices to new techniques such as time scaling. NT11-1037 12(13)

Extensive work has been performed during the past decade by both the ITU-T and the telecommunication industry in developing speech quality evaluation algorithms designed to accurately evaluate any network degradation impact on subscriber perception as well as to cope with the complex testing conditions of the 3G networks and beyond. The new technology POLQA was developed to cope with the evolving networks complexities. Like with all new technologies, extensive life testing is expected to complete POLQA algorithm s performance picture. Ascom Network Testing, a proved veteran in ITU-T on the objective quality metrics evaluation, continues to play an active role in the standardization work on this topic. 7 References [1] I. Cotanis, Voice Services in the Next Generation Networks/LTE- SON as Perceived by Users, Ascom Network Testing white paper, November 2010. [2] ITU-T P.862.x series; P.862 (PESQ algorithm), P.862.1 (Mapping to MOS domain), P.862.2 (WB-PESQ), P.862.3 (PESQ-Application guide); PESQ algorithm. [3] ITU-T P.863, Perceptual Objective Listening Quality Assessment (POLQA), Geneva, January 2011. [4] ITU-T P.563, Single-ended method for objective speech quality assessment in narrow-band telephony applications. [5] ITU-T P.564, Conformance testing for voice over IP transmission quality assessment models. [6] ITU-T TD SG 12 Gen 345, Final report of Working Party 2, Geneva, May 2010. [7] ITU-T P.800, Subjective testing of overall listening speech quality. [8] I Cotanis, ITU-T SG12/Q9 C137, A procedure for statistical evaluation of the objective quality metrics performance, May 2008. [9] I. Cotanis, ITU-T C151, Proposal on statistical evaluation framework for objective quality algorithms, submitted for ITU-T January 2011 meeting. [10] I. Cotanis, ITU-T SG 12 C112, Some aspects related to P.OLQA standard, May 2010. [11] I. Cotanis, ITU-T C142, Proposed study items for POLQA characterization phase, September 2010. [12] Opticom, TNO, SwissQual, ITU-T C148, Performance of the joint POLQA model, September 2010. [13] POLQA coalition, www.polqa.info, July 2010. NT11-1037 13(13)