Voice Service Quality Evaluation Techniques and the New Technology, POLQA



Similar documents
Understanding the Transition From PESQ to POLQA. An Ascom Network Testing White Paper

The PESQ Algorithm as the Solution for Speech Quality Evaluation on 2.5G and 3G Networks. Technical Paper

Active Monitoring of Voice over IP Services with Malden

White Paper. PESQ: An Introduction. Prepared by: Psytechnics Limited. 23 Museum Street Ipswich, Suffolk United Kingdom IP1 1HN

Application Note. Introduction. Definition of Call Quality. Contents. Voice Quality Measurement. Series. Overview

Nokia Networks. Voice over LTE (VoLTE) Optimization

White Paper. ETSI Speech Quality Test Event Calling Testing Speech Quality of a VoIP Gateway

MOS Technology Brief Mean Opinion Score Algorithms for Speech Quality Evaluation

TEMS PRODUCTS TEMS AUTOMATIC AUTONOMOUS NETWORK MONITORING SYSTEM

Fundamentals of VoIP Call Quality Monitoring & Troubleshooting. 2014, SolarWinds Worldwide, LLC. All rights reserved. Follow SolarWinds:

ETSI TS V1.1.1 ( )

The Voice Evolution VoLTE, VoHSPA+, WCDMA+ and Quality Evolution. April 2012

Troubleshooting Common Issues in VoIP

White Paper. Comparison between subjective listening quality and P.862 PESQ score. Prepared by: A.W. Rix Psytechnics Limited

Telephone Speech Quality Standards. for. IP Phone Terminals (handsets) CES-Q September 30, 2004

Achieving PSTN Voice Quality in VoIP

12 Quality of Service (QoS)

Voice Quality with VoLTE

A Generic Algorithm for Midcall Audio Codec Switching

Quality of Service Testing in the VoIP Environment

Voice and Fax/Modem transmission in VoIP networks

Quality of Service (QoS) and Quality of Experience (QoE) VoiceCon Fall 2008

Tech Note. Introduction. Definition of Call Quality. Contents. Voice Quality Measurement Understanding VoIP Performance. Title Series.

Application Notes. Introduction. Contents. Managing IP Centrex & Hosted PBX Services. Series. VoIP Performance Management. Overview.

MultiDSLA. Measuring Network Performance. Malden Electronics Ltd

Service Level Agreements for VoIP Alan Clark CEO, Telchemy

EVALUATION OF VOICE QUALITY IN 3G MOBILE NETWORKS

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Delivering reliable VoIP Services

Voice over IP (VoIP) and QoS/QoE

The Triple Play Analysis Suite - VoIP. Key Features. Standard VoIP Protocol G.711 SIP RTP / RTCP. Ethernet / PPP. XDSL, Metro Ethernet

QoS in VoIP. Rahul Singhai Parijat Garg

Application Note. Pre-Deployment and Network Readiness Assessment Is Essential. Types of VoIP Performance Problems. Contents

How to Measure Network Performance by Using NGNs

TEMS automatic autonomous network monitoring system

Basic principles of Voice over IP

Testing Voice Service for Next Generation Packet Voice Networks

VoIP Technologies Lecturer : Dr. Ala Khalifeh Lecture 4 : Voice codecs (Cont.)

ARIB STD-T V Codec for Enhanced Voice Services (EVS); Voice Activity Detection (VAD) (Release 12)

Voice over IP. Overview. What is VoIP and how it works. Reduction of voice quality. Quality of Service for VoIP

PERFORMANCE ANALYSIS OF VOIP TRAFFIC OVER INTEGRATING WIRELESS LAN AND WAN USING DIFFERENT CODECS

VoIP Provisioning Test Solutions

Monitoring VoIP Call Quality Using Improved Simplified E-model

Clearing the Way for VoIP

VoIP QoS. Version 1.0. September 4, AdvancedVoIP.com. Phone:

Delivering Network Performance and Capacity. The most important thing we build is trust

Introduction and Comparison of Common Videoconferencing Audio Protocols I. Digital Audio Principles

Automated Speech Quality Monitoring Tool based on Perceptual Evaluation

Proactive Video Assurance through QoE and QoS Correlation

Padma Charan Das Dept. of E.T.C. Berhampur, Odisha, India

Agilent Technologies Performing Pre-VoIP Network Assessments. Application Note 1402

Synchronization Essentials of VoIP WHITE PAPER

Objective Speech Quality Measures for Internet Telephony

Curso de Telefonía IP para el MTC. Sesión 2 Requerimientos principales. Mg. Antonio Ocampo Zúñiga

ADAPTIVE SPEECH QUALITY IN VOICE-OVER-IP COMMUNICATIONS. by Eugene Myakotnykh

Goal We want to know. Introduction. What is VoIP? Carrier Grade VoIP. What is Meant by Carrier-Grade? What is Meant by VoIP? Why VoIP?

Performance Evaluation of VoIP Services using Different CODECs over a UMTS Network

Traffic Characterization and Perceptual Quality Assessment for VoIP at Pakistan Internet Exchange-PIE. M. Amir Mehmood

Performance Analysis Proposal

A Quality of Experience based Approach for Wireless Mesh Networks*

Application Notes. Contents. Overview. Introduction. Echo in Voice over IP Systems VoIP Performance Management

Application Notes. Introduction. Sources of delay. Contents. Impact of Delay in Voice over IP Services VoIP Performance Management.

White Paper ON Dual Mode Phone (GSM & Wi-Fi)

How To Understand The Differences Between A Fax And A Fax On A G3 Network

1. Public Switched Telephone Networks vs. Internet Protocol Networks

Sample Reports - Version 9.0 Business Intelligence & Unified Communications Reporting

SIP Trunking and Voice over IP

Perceived Speech Quality Prediction for Voice over IP-based Networks

New Models for Perceived Voice Quality Prediction and their Applications in Playout Buffer Optimization for VoIP Networks

Performance Evaluation of AODV, OLSR Routing Protocol in VOIP Over Ad Hoc

Mobile Wireless Overview

WHITEPAPER. Quality of Service Testing in the VoIP Environment

Mobile VoIP Audio Quality CRASH COURSE THE INS AND OUTS OF MOBILE VOIP

Performance Analysis of Interleaving Scheme in Wideband VoIP System under Different Strategic Conditions

BRINGING VOIP TO THE CONFERENCE ROOM: HOW IT MANAGERS CAN ENHANCE THE USER EXPERIENCE

Indepth Voice over IP and SIP Networking Course

Analysis and Simulation of VoIP LAN vs. WAN WLAN vs. WWAN

Analog-to-Digital Voice Encoding

QoS issues in Voice over IP

Business case for VoIP Readiness Network Assessment

VOICE OVER IP AND NETWORK CONVERGENCE

Voice over Wi-Fi Voice Quality Assessment Test

Voice Communication Package v7.0 of front-end voice processing software technologies General description and technical specification

TESTING EXPERIENCE OF A1 - TELEKOM AUSTRIA

TEMS PRODUCTS TEMS AUTOMATIC AUTONOMOUS NETWORK MONITORING SYSTEM

Case in Point. Voice Quality Parameter Tuning

VoLTE with SRVCC: White Paper October 2012

ATA: An Analogue Telephone Adapter is used to connect a standard telephone to a high-speed modem to facilitate VoIP and/or calls over the Internet.

Assessing the quality of VoIP transmission affected by playout buffer scheme and encoding scheme

PARAMETRIC SIMULATION OF IMPAIRMENTS CAUSED BY TELEPHONE AND VOICE OVER IP NETWORK TRANSMISSION

Evaluating Data Networks for Voice Readiness

Transcription:

Voice Service Quality Evaluation Techniques and the New Technology, POLQA White Paper Prepared by: Date: Document: Dr. Irina Cotanis 3 November 2010 NT11-1037 Ascom (2010) All rights reserved. TEMS is a trademark of Ascom. All other trademarks are the property of their respective holders.

Contents 1 Today s Voice Service Challenges... 3 2 Speech Quality Evaluation Techniques... 3 2.1 Intrusive Techniques... 4 2.2 Non-Intrusive Techniques... 4 2.3 Standardization Status and Evolution Related to the Listening Quality of Voice Service... 5 3 POLQA Technology... 6 3.1 POLQA Algorithm s Overview... 7 3.2 Operability Requirements... 8 3.3 Telecommunication Test and Application Scenarios... 9 3.4 Understanding POLQA Limitations... 10 3.5 POLQA Algorithm s Performance Evaluation... 10 4 Beyond the MOS Score... 11 5 Ascom Network Testing Presence in the Standardization Work on Objective Evaluation Metrics for Listening Speech Quality... 12 6 Conclusions... 12 7 References... 13 NT11-1037 2(13)

1 Today s Voice Service Challenges Almost 10 years ago, operators and infrastructure vendors were struggling to provide speech quality on 2G networks at the level expected by users accustomed to PSTN levels of quality. Network optimization and troubleshooting, as well as advanced speech processing techniques and an in-depth understanding of speech transport on wireless networks, helped operators bring the level of speech quality on 2G networks to that of fixed networks. With the 3G network evolution, with the move to all IP, and with the transition from narrowband (NB) to wideband (WB) speech, it was expected that wireless voice services would even supersede traditional PSTN quality. However, today s voice services are still raising a set of challenges for operators as they attempt to continue meeting their users expectations. The roots of these challenges lie mainly in the convergence and coexistence of voice, data, and multimedia application services, which involve a multitude of factors that invariably produce new types of distortions that dynamically, variably, and sometimes randomly affect speech quality. These factors range from the increased demand for capacity generated by high and dynamic traffic patterns with various application-dependent patterns to low and adaptive bit rate codecs with different bandwidths (NB, WB, and super wideband (SWB)) and complex error concealment solutions as well as voice enhancement devices (e.g., noise suppressors, automatic gain control, echo cancellers) designed to counter speech degradation with speech processing techniques that, if not well designed and implemented, could have an effect opposite that of the desired speech quality enhancement. In addition, with next generation network (NGN) (LTE/SAE-SON) evolution, network vendors as well as operators are looking to a challenging change from traditional CS, and then from VoIP to VoIP over IMS (VoLTE). Details on these challenges can be found in [1]. 2 Speech Quality Evaluation Techniques Providing voice service on NGNs at the quality level demanded by subscribers while supporting backward compatibility with 3G/2G networks as well as integrating voice with a myriad of multimedia and data services increases the need for voice quality testing. Likewise, providing and ensuring a high quality level for testing and evaluating speech quality comes with its own series of challenges. The need for cost efficient speech quality evaluation techniques to replace subjective testing while ensuring high accuracy on a larger variety of network configurations and conditions, codecs, bandwidths, and applications continues to drive network testing tools and infrastructure vendors, as well as operators and standardization organizations, to collaborative work on speech quality evaluation techniques. Extensive work has been performed during the last decade by both the ITU-T and the telecommunication industry in developing speech quality NT11-1037 3(13)

evaluation algorithms designed to accurately evaluate any network degradation impact on subscriber perception as well as to cope with the complex testing conditions of the 3G environment. These speech quality evaluation algorithms have been developed with different scopes and applications. They can be either intrusive perceptual solutions performing end-to-end speech quality evaluation [2], [3], [6] on different types of networks (wireless, VoIP, or fixed) based on the speech signal, or non-intrusive perceptual (single-ended algorithms) [4] and nonintrusive parametric [5], which can evaluate speech quality at different nodes of the network (including the end node) based on the degraded speech signal and, respectively, on network parameters. 2.1 Intrusive Techniques These algorithms provide speech quality scores by comparing reference (transmitted) and degraded (received) speech samples. Therefore, intrusive assessment techniques require access to both the transmission and reception ends of communication. Comparing time-frequency processed reference and degraded speech samples based on human perception and cognition models facilitates an accurate estimation of the subjective perception of speech quality received by the terminal. An accurate estimation, however, is performed at the cost of sending the test samples through the network under test. The connection under test is therefore withdrawn from normal service and rendered unavailable to the customer. During peak hours, and for some technologies and certain areas, this situation may generate artificially low quality scores. Intrusive perceptual metrics estimate end-to-end speech quality, and thus are useful and meaningful to network operators for monitoring the quality experienced (QoE) by their voice service subscribers. 2.2 Non-Intrusive Techniques Non-intrusive metrics can be network parameter based or speech based. Parametric methods can use RF and/or IP parameters for predicting quality. Their limitation comes from the fact that these algorithms can actually predict quality affected either by the radio access network or by the IP-core network. Just a few studies are going on investigating the possibility of combining the effects of both RF and IP parameters on speech quality. The non-intrusive speech based methods need to use predictions regarding the transmitted original speech based on the degraded signal. Strong degradations could easily affect the accuracy of these predictions and, therefore, the overall speech quality evaluation. As a result, even though they are based on the processing of the speech signal using human perception and cognition models, these algorithms are recommended only when large amount of samples are available for averaging [4]. Although less accurate than intrusive perceptual metrics, non-intrusive perceptual and parametric algorithms have an important role in network NT11-1037 4(13)

monitoring for SLA agreements as well as troubleshooting and optimization of different network elements. 2.3 Standardization Status and Evolution Related to the Listening Quality of Voice Service Techniques for objective and subjective evaluation of voice service quality are developed within ITU-T Study Group 12 Performance, QoS and QoE. Standardization organizations such as ETSI/3GPP and other industry forums work in liaison with ITU-T. For almost a decade, the intrusive perceptual solution for listening speech quality evaluation has been PESQ standard P.862 (along with P.862.1, 2, and 3) [2]. With the 3G network evolution towards all IP, particularly NGN (LTE/SAE-SON), ITU-T recognized the industry s immediate need for a new standard that would both improve current PESQ performance under certain specific network conditions (e.g., CDMA networks, EVRC codecs) and cover 3G network evolution for voice service: from traditional CS to VoIP and VoIP over IMS, from NB to WB and SWB, and from low codec rates to very low and adaptive codec rates. As a result, POLQA was developed [3], [6]. POLQA development and the wireless technology evolution toward NGN showed that more than a subjective mean opinion score (MOS) is needed for infrastructure vendors and operators to understand subscriber perception and to appropriately troubleshoot and optimize their networks for the voice service. Details related to new study items initiated in ITU-T are presented in [1]. The non-intrusive solution is covered by the perceptual metric P.563 and by the IP parametric based P.564. Comprehensive summaries of standardized speech quality evaluation metrics, their characteristics, and their applications are presented in Figure 1 for perceptual based metrics and in Figure 2 for parametric based metrics. NT11-1037 5(13)

Perceptual (signal based) E2E QoE monitoring Troubleshooting in correlation with perception metric Intrusive: Uses test original and degraded speech signals to provide quality score Advantages: Highly accurate estimator of subscriber s opinion Reflects the quality ensured by the entire network as perceived by users Requires access only to the end point Disadvantages: Uses test stimuli that could artificially load the network Limited space-time granularity defined by the speech/video sample length requirement Algorithms: ITU-T P.862, 1-3 series (PESQ) ITU-T P.863 (POLQA) (ITU-T consented on 17 September 2010) Non-intrusive: Uses impaired, received speech to predict quality Advantages: Normal usage of the network Troubleshooting the problem generating node High time and space granularity Disadvantages: Low accuracy (high-order averaging is required and therefore possible problems could be smoothed out) Algorithms: ITU-T P.563 Figure 1. Perceptual (Signal-Based) No Reference (parametric-based) Troubleshooting in correlation with the network parameters and perceptual metric Non-intrusive: Uses IP / transport parameters (or could possibly use RF, too) Advantages: Normal usage of the network Troubleshooting the problem generating node (if access enabled) High time and space granularity Possibility for quick correlation with network behavior Disadvantages: Low accuracy (high-order averaging is required and therefore possible problems could be smoothed out) Quality evaluation is one-dimensional, taking into consideration metrics belonging to a single segment of the entire network (such as IP) Algorithms: ITU-T P.564 (IP parameter based) Figure 2. No Reference (Parametric Based) 3 POLQA Technology Today, voice service quality is determined by more than speech codecs used or frames lost. Networks and devices now integrate many new components ranging from voice enhancement devices (e.g., automatic gain controllers, noise reduction, and smart loss concealment schemes) to new techniques and features such as time scaling (stretching and compression of the speech signals in the time domain). All these components have been designed to ensure, maintain, and possibly even increase user experience NT11-1037 6(13)

of the perceived voice service quality. However, due to the complexity of the speech processing involved, these components might cause new and unexpected degradation effects. POLQA is especially designed to handle disruptive effects caused by these multicomponent distortions. 3.1 POLQA Algorithm s Overview As an intrusive perceptual metric, POLQA processes and compares the transmitted original speech signal and the degraded received speech signal in order to provide a prediction of the quality that would be perceived by subjects (regular subscribers) in a subjective listening test. The high level architecture of the algorithm is presented in Figure 3. POLQA processes both the original signal and the degraded signal before performing the comparison. The processing of the original signal is based on the fact that since the subjective testing is carried out without a direct comparison against an original (Absolute Category Rating), the ideal signal assumption on which the subject bases his or her opinion is unknown during the test. The processing of the degraded signal is related to highlevel cognitive processes (e.g., relative insensitivity to linear frequency response distortion and to steady state wideband noise [3]). POLQA runs a time alignment of the degraded signal against the original speech signal before the comparison process. The determined delay is used both for estimating and using the proper sampling frequency as well as for delay compensation in the comparison process performed based on a perceptual model [3]. The accuracy of the comparison process is determined by the transformation applied to the original and degraded signals to an internal representation that is similar to the psychophysical representation of audio signals in the human auditory system. The transformation is applied in the perceptual frequency (Bark) and the loudness domains (Sone), and runs in several steps: time alignment, level alignment to a calibrated listening level, time-frequency mapping, frequency warping, and compressive loudness scaling [3]. The internal representation takes into account several factors impacting the perceived quality, such as playback level mapping from the digital signal representation level, local gain variations, rapid variations, linear filtering, and noise levels. In addition, it applies different levels of compensation for these factors depending on their final contribution to the overall perceptual disturbance. Therefore, minor and stationary differences between the original and degraded speech signals are compensated, while more severe effects known to have a greater impact on the perceived quality are only partially compensated [3].. The final quality perception at the output of the module calculates the difference between the original and degraded internal representations based on a small number of quality indicators that are used to model all related subjective effects. The cognitive model calculates the following parameters: frequency response indicator, noise indicator, room reverberation indicator, and three more indicators describing the internal differences in the time-pitch-loudness domain. All these indicators are NT11-1037 7(13)

combined to give an objective listening quality expressed by the raw POLQA score [3]. The raw POLQA score is then mapped to the subjective MOS domain, MOS-LQO. The mapping is a third order polynomial mapping developed based on a large set of databases (tens of thousands of speech samples) containing a broad range of network types (fixed, IP, and mobile) and conditions (simulated error patterns and live degradations), codecs (e.g., AMR NB,/WB, G.722.1, ilbc, EVRC, EVRC-WB, EVRC-A/B, AAC/AAC LD, Skype, MP3 low bit rate, G.726, EFR), various BGN types and levels, different languages (American and British English, German, Swedish, French, Dutch, Czech, Chinese, and Japanese) and three speech bandwidths (NB, WB, and SWB). Original speech Environment modeling Time alignment Perceptual model Delay estimates Listening conditions / cognitive perception Internal representation of original (transmitted) speech signal Difference between internal representations (user perceived) Cognitive model Raw POLQA Speech databases (NB/WB/SWB; variety of codecs, wireless / VoIP simulated / live conditions, acoustic / electrical, BGN conditions, languages) Mapping to subjective domain POLQA MOS-LQO Degraded speech Perceptual model Internal representation of degraded (received) speech signal Possibly various speech based diagnostic (e.g., delay, gain levels, noise) Psycho-acoustic model Figure 3. High Level Architecture of POLQA Algorithm 3.2 Operability Requirements The POLQA algorithm is designed to predict overall listening speech quality under NB, WB, and SWB (50 to 14000Hz) conditions in 3G/4G (LTE-SAE) networks, including advanced speech processing technologies, acoustical interfaces, and hands-free applications. It should be noted that POLQA has two operational modes: SWB and NB. The main difference is the bandwidth of the original speech signal used by the model. In SWB mode, the received (and potentially degraded) speech signal is compared with an SWB reference. Therefore, band limitations are considered to be degradations and are scored accordingly. The listening quality is modelled as perceived by a human listener using a diffuse-field equalized headphone with diotic presentation (same signal at both ear-caps). In NB mode, the received (and potentially degraded) speech signal is compared to an NB (300 to 3400Hz) original. Thus, normal telephone band limitations are not considered to be severe degradations. NB mode maintains compatibility to the previously developed ITU-T Recommendation P.862.1 (PESQ) [2]. The listening quality is modelled as perceived by a human listener using a loosely coupled IRS type handset at one ear (monotic presentation). NT11-1037 8(13)

3.3 Telecommunication Test and Application Scenarios The telecommunication scenarios include current transmission technologies [3] Public switched networks (e.g., fixed wire PSTN, GSM, WCDMA, CDMA) Push-over-Cellular, Voice over IP, and PSTN-to-VoIP interconnections, Tetra Commonly used speech processing components (e.g., codecs such as AMR NB/WB, G.722.1, ilbc, EVRC, EVRC-WB, EVRC-A/B, AAC/AAC LD, Skype, MP3 low bit rate, G.726, and EFR; noise reduction systems for different types of BGN such as office, street, car, and babble; adaptive gain control; comfort noise; and other types of voice enhancement devices) and their combinations. The tested distortion types [3] cover: Single speech codecs and speech codecs used in tandem, as currently used in telecommunication scenarios Packet loss and concealment strategies (packet-switched connections) Frame errors and bit errors (wireless connections) Interruptions (such as unconcealed packet loss or handover in GSM) Front-end clipping (temporal clipping) Amplitude clipping (overload, saturation) Variable delay (VoIP, video-telephony) / time warping Gain variations Influence of linear distortions (spectral shaping), being also time variant Non-linear distortions produced by the microphone / transducer at acoustical interfaces Reverberations caused by hands-free test setups in defined acoustical environments The application scenarios cover both electrical and acoustical measuring interfaces as well as different terminal types (handset, headphone, or hands-free). NT11-1037 9(13)

3.4 Understanding POLQA Limitations It should be noted that there are several conditions and applications for which POLQA was not designed. POLQA scores obtained in these types of conditions are not reliable and should not be considered for any kind of speech quality evaluation. These conditions include: Other dimensions of speech quality such as conversational aspects and talking quality. Speech quality per call. POLQA is not intended to score longer sequences of speech. It is focused on prediction of quality for shorter speech utterances of 6 to 12 seconds. Noisy listening environments. POLQA does not predict perceived speech quality in these environments; it is designed in accordance with P.800, ACR testing. Music (including multimedia). Evaluation of performance or ranking of voice enhancement devices (e.g., noise suppressors). Other technologies or components such as speech storage formats or non-telephony applications such as public safety networks or professional mobile radio connections. Although yet not tested or evaluated, POLQA could be cautiously applied for the following applications: Other languages (e.g., Russian, Arabic, etc.) Longer speech samples Subjective tests for confirming POLQA performance on these types of applications are recommended. 3.5 POLQA Algorithm s Performance Evaluation Understanding POLQA performance as an estimator of subscriber perception relies on the fact that results from a subjective experiment reflect the relative quality between the tested speech samples, while the absolute values could vary from experiment to experiment depending on the listener group and the design of the subjective test. Unlike subjective results, POLQA is independent of test context and individual voter behavior. POLQA estimates the average subjective score obtained from a group of voters listening to the same speech sample. Although it does not provide an exact absolute score of an individual experiment; POLQA does reproduce the relative quality ranking [3]. Therefore, POLQA performance evaluation involves comparison to subjective scores as well as consideration of the variability that exists within a listening panel. In addition, the differences between individual subjective experiments must be removed. This is achieved by determining and NT11-1037 10(13)

applying an optimal regression function (3 rd order polynomial) between the subjective and objective scores. Due to the large numbers and types of databases, as well as their content variability, a rigorous and extensive evaluation procedure has been developed for POLQA testing. A series of different statistical metrics as well as statistical significance testing have been used [3], but the core one against which the algorithm has been optimized is the epsilon insensitive root mean square error that brings statistical significance and accuracy in the sense that it best emulates the usability of POLQA and its performance in real life scenarios. The epsilon insensitive root mean square error expresses POLQA error against the average MOS of individual voters considering only differences related to an epsilon-wide band around the target average value. Therefore, the uncertainty of a MOS panel is taken into account by the epsilon value defined as the 95% confidence interval of the averaged MOS. The Perror is defined as: rmse 1 * Perror N d ² N i Perror ( i ) max(0, MOSLQS ( i ) MOSLQO ( i ) ci 95( i )) where the index i denotes the condition of the speech sample, N denotes the number of conditions or speech samples, and d denotes the degrees of freedom (d = 4 in the case of a 3 rd order regression). The results reported in [12] provide general information on the POLQA performance on a broad range of databases containing a large variety of technologies, codecs and bandwidths. These results representing an overall performance might be misleading to a certain extent. Due to the variety of databases and the statistical aggregation procedure of the results [3], [12], a weaker or better performance for a specific application and/or bandwidth could be smoothed out or hidden. Therefore, additional analysis is expected for more detailed analysis or for a particular application. This analysis is planned by ITU-T during the POLQA characterization phase and the results are expected to be published in the forthcoming POLQA Application Guide (estimated for June 2011). 4 Beyond the MOS Score Due to the complexity of the NGN environment, as well as the challenges in supporting voice service on LTE-SAE/SON networks, several solutions for providing voice service are currently envisioned. Therefore, test and evaluation of speech quality in the NGN environment must be comprehensive. In order to understand and cost efficiently control the speech degradation of different implementation solutions, evaluation techniques need to go beyond the MOS score. NT11-1037 11(13)

To a large extent, as in the PESQ case, interim calculations of POLQA as well as the six degradation parameters used as input to the POLQA algorithm s cognitive model would allow some network diagnosis based on speech quality evaluation. Details are discussed in [1], but generally the main diagnosis could regard aspects such as latency, jitter (variable delay), gain variations, speech signal and BGN level measurements, level clipping, dropouts (e.g., generated by packet loss), operability of VAD, and shortterm spectra (linear degradations caused by either the frequency response of the devices and/or by the VoIP landline connection). 5 Ascom Network Testing Presence in the Standardization Work on Objective Evaluation Metrics for Listening Speech Quality For more than 10 years, Ascom Network Testing has been an active member within ITU-T Study Group 12, which develops objective speech quality evaluation metrics. Our contributions to the standardization work cover different areas and stages of objective metric development. Ascom Network Testing contributed live recorded speech databases needed for accurate training and tuning of the algorithms running in real life scenarios typical of network troubleshooting, optimization, and operation applications performed by operators. Within ITU-T, we were the initiator and developer of the statistical evaluation procedure for objective metrics that was first applied to PESQ and that was later applied in a modified form to POLQA [8]. Recently, based on our initial work as well as work performed for POLQA performance evaluation, Ascom Network Testing introduced a new study item within ITU-T on a more general statistical evaluation procedure to be applied to various types of objective metrics [9]. This type of evaluation becomes more and more a must for all kinds of objective metrics (e.g., speech, video, audio, multimedia) that are designed for testing in real life networks and therefore for their implementation in network testing tools. We also developed a technique for objective quality metrics calibration to the MOS scale. As a result, we co-authored two standards in relation to PESQ: P.862.1 (Mapping PESQ to MOS domain) and P.862.3 (Guidance for PESQ usage) [2]. Additionally, Ascom Network Testing recently wrote a white paper contribution [10] on aspects related to POLQA implementation in field testing tools, as well as a white paper contribution related to topics that are required to be studied during the POLQA characterization phase [11]. 6 Conclusions The convergence and coexistence of voice, data, and multimedia application services, which involve a multitude of factors that invariably produce new types of distortions that dynamically, variably, and sometimes randomly affect voice service quality. Today, speech quality is determined by more than speech codecs used or frames lost. Networks and devices now integrate many new components ranging from voice enhancement devices to new techniques such as time scaling. NT11-1037 12(13)

Extensive work has been performed during the past decade by both the ITU-T and the telecommunication industry in developing speech quality evaluation algorithms designed to accurately evaluate any network degradation impact on subscriber perception as well as to cope with the complex testing conditions of the 3G networks and beyond. The new technology POLQA was developed to cope with the evolving networks complexities. Like with all new technologies, extensive life testing is expected to complete POLQA algorithm s performance picture. Ascom Network Testing, a proved veteran in ITU-T on the objective quality metrics evaluation, continues to play an active role in the standardization work on this topic. 7 References [1] I. Cotanis, Voice Services in the Next Generation Networks/LTE- SON as Perceived by Users, Ascom Network Testing white paper, November 2010. [2] ITU-T P.862.x series; P.862 (PESQ algorithm), P.862.1 (Mapping to MOS domain), P.862.2 (WB-PESQ), P.862.3 (PESQ-Application guide); PESQ algorithm. [3] ITU-T P.863, Perceptual Objective Listening Quality Assessment (POLQA), Geneva, January 2011. [4] ITU-T P.563, Single-ended method for objective speech quality assessment in narrow-band telephony applications. [5] ITU-T P.564, Conformance testing for voice over IP transmission quality assessment models. [6] ITU-T TD SG 12 Gen 345, Final report of Working Party 2, Geneva, May 2010. [7] ITU-T P.800, Subjective testing of overall listening speech quality. [8] I Cotanis, ITU-T SG12/Q9 C137, A procedure for statistical evaluation of the objective quality metrics performance, May 2008. [9] I. Cotanis, ITU-T C151, Proposal on statistical evaluation framework for objective quality algorithms, submitted for ITU-T January 2011 meeting. [10] I. Cotanis, ITU-T SG 12 C112, Some aspects related to P.OLQA standard, May 2010. [11] I. Cotanis, ITU-T C142, Proposed study items for POLQA characterization phase, September 2010. [12] Opticom, TNO, SwissQual, ITU-T C148, Performance of the joint POLQA model, September 2010. [13] POLQA coalition, www.polqa.info, July 2010. NT11-1037 13(13)