Understanding the Transition From PESQ to POLQA. An Ascom Network Testing White Paper

Understanding the Transition From PESQ to POLQA An Ascom Network Testing White Paper By Dr. Irina Cotanis Prepared by: Date: Document: Dr. Irina Cotanis 6 December 2011 NT11-22759, Rev. 1.0 Ascom (2011) All rights reserved. TEMS is a trademark of Ascom. All other trademarks are the property of their respective holders.

Contents 1 Summary... 3 2 Today s Voice Service Challenges... 3 3 Understanding the Need for PESQ to POLQA Transition... 4 3.1 The Two MOS Scales: Narrowband and Super Wideband... 4 3.2 The Impact of Various Codecs... 5 3.3 Network-Centric Types of Speech Degradation... 5 3.4 Devices/Phones... 6 3.5 Voice Enhancement Devices... 6 3.6 Acoustical and Electrical Interfaces... 7 3.7 PESQ Weaknesses... 7 4 Coexistence of Two Standards... 7 4.1 Which POLQA vs. PESQ Modes Should Be Compared... 8 4.2 When POLQA vs. PESQ Should Be Compared... 8 4.3 What Is the Expected Output of the Comparison?... 8 4.4 Why Are Differences Expected?... 9 5 Conclusions... 10 6 Glossary... 10 7 Appendix: References... 11 NT11-22759, Rev. 1.0 2(11)

1 Summary There is little doubt POLQA represents the eventual replacement of PESQ 1 as the de facto standard for voice quality measurements. After all, POLQA, or Perceptual Objective Listening Quality Analysis, was designed specifically to fix some of the reigning PESQ (Perceptual Evaluation of Speech Quality) algorithm s known weaknesses, such as inaccuracies with wideband scenarios and CDMA codecs, sensitivity to certain GSM/WCDMA network conditions, and VoIP limitations (variable delay larger than one second; and time scaling). POLQA also shows greater flexibility than PESQ in providing consistent measurements across the three available bandwidths (narrowband, wideband, and super wideband, or NB, WB, and SWB), different technologies such as GSM/WCDMA, LTE, both circuit switch and packet switch (VoIP, and VoIP over IMS), and a variety of interface combinations (electrical-electrical, acoustical-acoustical, electrical-acoustical, and acoustical-electrical). And, as the speech quality evaluation algorithm designed for 4G networks, POLQA is best positioned to cope with today s and tomorrow s variety of terminals, new speech codecs and sophisticated error concealment schemes, voice enhancement devices, as well as 4G network types of degradations. However, like PESQ, POLQA has its own set of limitations as described in the ITU-T P.863 standard 2, such as long conversational sequences and noisy listening environments. In addition, as for now, due to the lack of speech databases, POLQA remains insufficiently tested on some languages such as Russian, Arabic, Spanish, and several Asian languages. But an important challenge for POLQA is the VoLTE scenarios, especially the real life ones. Therefore, work is going on within ITU-T to better characterize POLQA behavior in these scenarios. The work is performed during the POLQA characterization phase and evaluations are expected in the forthcoming P.863/POLQA Application Guide currently under ITU-T development. We at Ascom Network Testing are in a unique position to not only assist with the rollout of 4G/LTE (voice and data) networks but to test POLQA s effectiveness in measuring voice quality over these nextgeneration wireless networks. In this white paper, we discuss aspects related to the two coexisting standards PESQ (P.862/P.862.1, 1 ) and POLQA (P.863, 2 ) and provide recommendations for when continued use of PESQ makes sense and when the transition to POLQA becomes the inevitable choice. 2 Today s Voice Service Challenges Today s voice services are still raising a set of challenges for operators as they attempt to continue meeting their users expectations, while moving from circuit switch based networks with voice as the main service to packet switch with voice as one of the many services. The convergence and coexistence of voice, data, and multimedia application services involve a multitude of factors that invariably produce new types of distortions that dynamically, variably, and sometimes randomly affect speech quality. NT11-22759, Rev. 1.0 3(11)

These factors cover a large spectrum, from codecs to voice enhancement devices. The low and adaptive bit rate codecs support standard (NB) speech and high definition voice (WB and SWB). At the same time, the new codecs used for voice services supported by 3G/4G networks need to use smart loss concealment schemes, such as time scaling, which is expected to impact the human perception in a way completely different than previous solutions. The new voice enhancement devices (e.g., noise suppressors, automatic gain control, and echo cancellers) are designed to counter 4G network speech degradation with advanced speech processing techniques that, if not well designed and implemented, could have an effect opposite that of the desired speech quality enhancement. One noticeable aspect is related to acoustic echo cancellers designed for 4G/LTE devices. The LTE air interface is known to be a highly non-linear/time variant environment, which could result in a larger range of delays and attenuations than the ones for which 2G/3G echo cancellers have been designed. In addition, with the introduction of 4G (LTE/SAE-SON) networks, infrastructure vendors as well as operators face the transformation of the voice service into either a more data type of service, such as VoIP or VoIP over IMS (VoLTE), or other solutions such as circuit switch fallback and third party over-the-top voice applications such as Skype. 3 Understanding the Need for PESQ to POLQA Transition For more than a decade the current P.862/P.862.1 (PESQ) standard has proven to be a well performing and stable solution for voice quality evaluation on 2G/3G, circuit switch based networks, and, to a certain extent, on packet switch based networks. In addition, results obtained during the POLQA evaluation and validation process showed that in a large number of NB voice test scenarios, PESQ performs statistically equal with POLQA 2. However, as described above, the complexity of new network scenarios might cause new and unexpected degradation effects. The P.863 (POLQA) standard has been designed to handle disruptive effects caused by these new multicomponent distortions. Details on POLQA operability requirements, test and application scenarios, and limitations are presented in a separate paper 3. Therefore, the transition from PESQ to POLQA should be based on a good understanding of both technical and business aspects. To gain this understanding, it is necessary to take a look at the various applications for which POLQA has been designed. In the following paragraphs, we discuss these application-related aspects. 3.1 The Two MOS Scales: Narrowband and Super Wideband POLQA has two operational modes: NB and SWB. In NB mode, the received (and potentially degraded) speech signal is compared with the NB (300 to 3400Hz) original. Thus, normal telephone band limitations are not considered to be severe degradations. NB mode NT11-22759, Rev. 1.0 4(11)

aims to maintain backward compatibility to P.862.1 (PESQ). 1 The listening quality is modeled as perceived by a human listener using a loosely coupled IRS type handset at one ear (monotic presentation). For a large number of NB scenarios, PESQ and POLQA NB mode showed statistically equal performance. In SWB mode, the received (and potentially degraded or band limited) speech signal is compared with an SWB reference. The listening quality is modeled as perceived by a human listener using a diffuse-field equalized headphone with diotic presentation (same signal in both ears). Therefore, NB or WB limitations are considered to be degradations and are scored accordingly on a unique SWB MOS scale. In the case of NB quality, it is likely that quality will be compressed at the lower end of the MOS scale, which could impact POLQA SWB performance in predicting NB scenarios. Another reason why NB scenarios in SWB mode are less accurately predicted than in NB mode is the different optimizations that POLQA uses for each of the modes. The NB scenarios are more accurately predicted in the NB mode, since the POLQA optimization is NB focused in this mode. Therefore, as long as user applications are still mainly NB scenarios, which is the case in the majority of today s up and running wireless networks, and no comparison with higher definition bandwidths (WB or SWB) is required, then transition to POLQA wouldn t be of imminent need. 3.2 The Impact of Various Codecs POLQA has been evaluated and validated for a large range of multiband codecs, standardized (e.g., AMR-WB, EVRC-WB, ilbc, AMB+, AAC, G.711.x, and G.723, G.729.x) and commercial (Skype/SILK), which are used or are planned to be used for the voice service supported by different network solutions (e.g., wireless: GSM/WCDMA/LTE, WiMAX fixed, and Bluetooth). In addition, CS, VoIP, and VoIP over IMS scenarios have been either simulated or live collected. A large number of 3G networks still use only the traditional standardized NB codecs (e.g., AMR and EVRC). In addition, the 4G/LTE voice service solutions (CSFB or VoIP over IMS) are or could be generally launched with NB codecs (AMR) more likely in the second half of 2012. However, operators choosing an over the top (OTT) voice solution (e.g., Skype and Viber) will have mainly commercial codecs involved (e.g., Skype/SILK). Therefore, it could be concluded that if the user application does not yet involve any new commercial codec or high definition (HD), commercial, or standardized codec, then the transition to POLQA wouldn t be a must in the very short term. 3.3 Network-Centric Types of Speech Degradation The P.863 standard has been designed to cope with a large range of degradations that have emerged at both acoustical and electrical interfaces on a variety of networks (e.g., wireless or fixed, CS or PS/VoIP, or Bluetooth) and under different noise conditions 3. NT11-22759, Rev. 1.0 5(11)

Some of these types of degradations are not yet present in today s networks (e.g., wireless SWB, VoIP over IMS, and VoIP time scaling). In these particular cases, only simulated conditions have been used, which for some scenarios do not have the ability to emulate real network behavior since these real network conditions are not yet exactly known. An important example is the time scaling (or speech frequency re-sampling) generated by the jitter buffer adaptation, which compensates for variable delays on the voice connection. This condition is typical for VoIP over IMS, which today can be evaluated using simulations and running time scaling algorithms only 4 5. This challenge emerged from the fact that there is currently no live experience available on which time scaling characteristics can be defined (e.g., time scaling length or ratio, its time distribution, and the frequency within the speech). The various types of time scaling (e.g., lengths, distribution, and frequency), along with other aspects (e.g., speech content and speaker/gender dependency, reference sample s time shifting, and reference vs. encoded speech scores variation) are under investigation during the POLQA characterization phase and expected to be finalized and published in the forthcoming ITU-T POLQA Application Guide (estimated for the second half of 2012). Therefore, users handling applications as described in the previous paragraphs might consider waiting until there is an application guide that answers how all of these cases should be properly and accurately addressed. 3.4 Devices/Phones Generally, there are main categories of degradations related to the devices: the time variant linear distortions (e.g., spectral shaping implemented in some of the older phones); and the non-linear distortions produced by the microphone / transducer or by reverberations caused by hands-free set-ups at acoustical interfaces. POLQA has been designed and proven to accurately predict the speech quality affected by these distortions. However, if the user is concerned about NB applications that do not involve evaluations at an acoustical interface that is specific to device testing, then PESQ could safely serve this need. 3.5 Voice Enhancement Devices The speech processing systems dedicated to noise reduction and automatic gain control (AGC) i.e., voice enhancement devices that possibly exist either in the core network or in the terminals have been considered in the POLQA design. However, like PESQ, POLQA should not be used to evaluate the performance of these devices, but rather to accurately evaluate the speech quality when these systems are present. Although not designed to cope with voice enhancement devices, PESQ could provide along with the MOS score outputs output metrics which could be used to understand the behavior of the voice enhancement devices. NT11-22759, Rev. 1.0 6(11)

Therefore, as long as the user application does not involve voice enhancement devices focused or known not to be affected by strong speech processing solutions (e.g., strong automatic gain level adjustments), an immediate transition to POLQA would be not required. 3.6 Acoustical and Electrical Interfaces POLQA has been designed to also allow measurements at various combinations of interfaces: electrical-electrical, acoustical-acoustical, electrical-acoustical, and acoustical-electrical, providing the opportunity to be used for the testing of terminals and hands-free applications. However, if the user applications are focused on NB voice service quality evaluation and monitoring on wireless networks, then a short term POLQA transition would be not required. 3.7 PESQ Weaknesses POLQA has been designed not only to provide an accurate MOS estimate for a large set of conditions specific to new codec and network technologies, but to also ensure higher accuracy for scenarios for which PESQ showed a weaker performance. These scenarios are mainly related to WB measurements, strong AGC level variations, strong frequency shaping (implemented in some of the phones), and VoIP connections characterized by long variable delay values (generally more than about 1-2 seconds). In recent years, there have been few proven cases of strong AGC behavior (e.g., some CDMA devices) or strong frequency shaping (e.g., some older versions of GSM phones). In addition, the new devices/phones are mainly characterized by flat frequency responses (unfiltered up to 14kHz). Therefore, NB circuit switch or VoIP with lower variable delay applications could still use PESQ safely, in the short term at least. 4 Coexistence of Two Standards As it has been discussed in section 3, the transition to the new standard P.863 could be timely based on the user s application and its goals. Therefore, it is expected that at least for a while PESQ and POLQA will coexist. Consequently, users will compare PESQ to POLQA trying to better understand the new standard s behavior and how this could be related to their previous experience developed based on PESQ speech quality databases. The new P.863 standard released by ITU in January 2011 was made available for market use a few months later. Therefore, not enough experience could be gathered to be able to cover a large range of POLQA vs. PESQ behavioral scenarios. Ascom Network Testing is actively involved as an independent expert in the work going on within ITU related to this topic. Results are expected to be co-authored by several parties (including Ascom Network Testing) in the forthcoming POLQA Application Guide. Without being exhaustive, we present in this section a set of main guiding rules for how POLQA scores could be related to PESQ scores. It should be NT11-22759, Rev. 1.0 7(11)

noted that the recommended guidance is focused on evaluation and real time monitoring of speech quality on wireless networks. Before going into details, it should also be noted that it is quite well known by experts and industry users of speech quality evaluation metrics that comparing two different objective quality assessment metrics is generally not recommended. There are two main reasons for this. First, the two metrics have been designed to cope with different types of speech degradations, which significantly impact the filtering and processing of the speech as well as the perceptual model implemented in the algorithm 6. Second, different subjective tests have been used for training, evaluation, and validation of the two metrics. Therefore, different subjective to objective mappings have been used, resulting in obvious different MOS estimated values. 4.1 Which POLQA vs. PESQ Modes Should Be Compared As already mentioned, the POLQA NB mode is the mode that is backward compatible with PESQ NB. Therefore, the only feasible comparison is PESQ NB with the POLQA NB mode. This is possible because both have been designed based on subjective NB tests and therefore show scores on the MOS NB scale. It is not feasible to compare PESQ NB scores on the NB MOS scale with POLQA NB scores obtained in the SWB mode (or on an SWB MOS scale). This is expected since the SWB differs significantly from the NB mode in terms of speech filtering and processing as well as human perception (simulated by the perceptual model) of the degradations. The same type and level of NB degradation will be more sensitively perceived on the SWB than on the NB scale, resulting in lower MOS predictions on the SWB scale. 4.2 When POLQA vs. PESQ Should Be Compared A valid comparison can be performed only if the test set-up PESQ NB and POLQA NB requirements are carefully met as described in the standards 1 2. The main concerns relate to the level and the frequency shaping of the reference samples and to the NB limited (3.4kHz) transmission channel. The main difference between PESQ and POLQA is that PESQ has to be fed with IRS sent, filtered references, while POLQA must be fed with flat filtered (unfiltered up to 14kHz) ones since, unlike PESQ, POLQA applies the filtering. In addition, using the same speech references (speech content and talkers) and running the algorithms simultaneously is recommended. 4.3 What Is the Expected Output of the Comparison? POLQA NB mode has been designed to be not only backward compatible to PESQ NB, but also as an improved solution that predicts subscriber perception accurately in a larger range of test conditions 3, such as VoIP time scaling, new commercial codecs (e.g., Skype/SILK), and involving both acoustical and electrical interfaces. Therefore, the same type and level of degradation will be qualified by PESQ NB and POLQA NB with different predicted MOS scores. NT11-22759, Rev. 1.0 8(11)

The difference should be analyzed based on statistical metrics only. Recommended metrics are average or mean values as well as percentiles. The expected statistical differences might show larger or smaller values as well as positive or negative trends, depending on the tested network as well as the device used. For example, in the case of CDMA live networks, preliminary results show the POLQA average score superior to the PESQ NB score by almost 0.4MOS. This is expected since PESQ NB is known to underpredict this particular case, while POLQA proved more accurate. Some preliminary results on a very small number of WCDMA wireless live collected speech databases show differences which might reach about +/- 0.25MOS. In addition, the same preliminary results show that differences recorded at the 95 th percentile might reach about +/-0.15MOS. The preliminary results also show that the differences fall below +/-0.1MOS, in the range below about 3MOS. These are very preliminary results and investigations on this topic are still going on within ITU-T. Ascom Network Testing, as an independent third party, is driving these ITU-T investigations to ensure that a thorough understanding of this comparison as well as guides of how these differences could be dealt with are going to be provided in the forthcoming POLQA Application Guide. 4.4 Why Are Differences Expected? As already mentioned, one of the main reasons for an expected difference between POLQA NB and PESQ NB has its roots in the algorithms themselves. As one would imagine, there are several significant algorithmic differences at all of the phases: time alignment, level scaling and filtering, and perceptual and cognitive modeling and calibration (mapping of the raw POLQA score to the MOS scale). Here are some of the most important aspects. The POLQA time alignment procedure needs to deal with time scaling which is translated into variable frequency sampling of the speech. In addition, this POLQA procedure needs to take into consideration frequency re-sampling with and without pitch preservation. None of these was needed in the case of PESQ. The level scaling and filtering takes into consideration both the electrical and the acoustical interfaces, which is not the case with PESQ. The level scaling needs to account for strong AGC variations as well as various types and levels of the noise; something that was less deployed in the PESQ algorithm. Unlike in PESQ, the POLQA perpetual model uses both time and frequency masking which significantly increase the accuracy in emulating the human perception of various and more complex distortions (e.g., speech re-sampling or time scaling, and reverberations or noise conditions). Unlike in PESQ, the POLQA cognitive model also takes into consideration a reverberation, a frequency, and a noise indicator generated based on the perceptual model. Different speech signal processing techniques associated with both the perceptual and the cognitive model require that POLQA runs various internal optimizations different from PESQ. In addition, the statistical evaluation metric against which the two algorithms have been optimized NT11-22759, Rev. 1.0 9(11)

are different (for PESQ, the Pearson correlation coefficient; and for POLQA, the modified root mean square error). As one would expect, these completely different optimizations will result in different behaviors. The final difference related to the algorithms is the calibration, which has been performed to a large extent on different databases, resulting in different mapping functions and coefficients. Thus, different predicted MOS scores are expected. Another equally important reason why a score difference should be expected between PESQ and POLQA is their performance. Although proven to show similar accuracy for a significantly large number of scenarios, POLQA NB slightly outperforms PESQ NB in several of the test cases (e.g., CDMA wireless networks). 5 Conclusions Detailed analysis of the PESQ to POLQA transition shows that the new ITU-T P.863 standard is the long term solution for the evaluation of voice service quality as perceived by subscribers. However, the analysis also suggests that, depending on the application, PESQ NB still can be a short term solution. In most cases, operators are expected to move to POLQA once their WB voice service is widely deployed or full VoLTE voice service solution is implemented in both networks and devices. The expected time frame for this is 2012-2013. Regardless of when POLQA becomes widely adopted, operators will face the co-existence of two standards. This paper provides general guidance on how this scenario could be handled and what the expectations would be. However, like with all new technologies, extensive life testing is expected to build more experience and possibly new aspects would be later revealed. Ascom Network Testing, a proven veteran in ITU-T on the objective quality metrics evaluation, continues to play an active role in the standardization work on this topic. 6 Glossary AAC Advanced audio coding ACR Absolute category rating AGC Automatic gain control AMB Advanced memory buffer AMR Adaptive multi-rate EFR Enhanced full rate EVRC Enhanced variable rate codec ilbc Internet low bit rate codec IMS IP multimedia subsystem NT11-22759, Rev. 1.0 10(11)

IRS Intermediate reference system NB Narrowband QoE Quality of experience QoS Quality of service SAE System architecture evolution SON Self optimizing network SWB Super wideband VoLTE Voice over LTE WB Wideband 7 Appendix: References 1 2 3 4 5 6 ITU-T P.862.x series; P.862 (PESQ algorithm), P.862.1 (Mapping to MOS domain), P.862.2 (WB-PESQ), P.862.3 (PESQ-Application guide); PESQ algorithm. ITU-T P.863, Perceptual Objective Listening Quality Assessment (POLQA), Geneva, January 2011. I. Cotanis, Voice Service Quality Evaluation Techniques and the New Technology, POLQA, Ascom Network Testing white paper, November 2011. I. Cotanis, ITU-T SG 12 C112, Some aspects related to P.OLQA standard, Geneva, May 2010. I. Cotanis, ITU-T C142, Proposed study items for POLQA characterization phase, September 2010. John Beerends et all, Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement, October 2011, to be published in Journal of Audio Engineering Society. NT11-22759, Rev. 1.0 11(11)