Packet Loss Concealment Algorithm for VoIP Transmission in Unreliable Networks Artur Janicki, Bartłomiej KsięŜak Institute of Telecommunications, Warsaw University of Technology E-mail: ajanicki@tele.pw.edu.pl, bksiezak@op.pl Abstract In this paper the authors propose an algorithm for packet loss concealment (PLC) in transmission over IP-based networks with high packet loss rate. The algorithm is a sender-receiver-based extension of ANSI T1.521a Annex B PLC standard for G.711 voice codec. It consists in adding to a transmitted packet redundant parameters describing speech signal in another packet. Efficiency of the proposed algorithm was verified using subjective Absolute Category Rating (ACR) method and objective PESQ algorithm, and compared with original ANSI T1.521a Annex B standard. The intelligibility of speech was assessed using Semantically Unpredictable Sentences (SUS) tests. For high packet loss rates all assessment methods proved superiority of the proposed algorithm over the original ANSI standard. The ACR tests showed that the proposed method can maintain speech quality above 3 in MOS scale even for packet loss rates of 20%-25%. 1. Introduction Voice over IP (VoIP) technology nowadays provides users with speech transmission quality similar (or higher) to channel-switching telephony, on condition that it is realized via network with guaranteed Quality of Service (QoS). But quite often it is transmitted using best-effort networks such as majority of the Internet, or exposed to burst errors as in WLAN s. Such networks can be quite unreliable in the sense of packet loss and packet delays. If the network is highly congested the delay values increase and voice packets do not arrive on time, resulting again in packet loss. In this case the quality of the transmitted speech can be severely impaired. There are different strategies how to cope with packet loss in VoIP transmission they are called packet loss concealment (PLC) algorithms and some of them are briefly presented in the next section. In this paper the authors propose a PLC algorithm able to cope with high packet loss rates (PLR). On the receiver side it is an improved version of ANSI T1.521a Annex B algorithm elaborated for popular ITU-T G.711 codec (PCM 64 kbps), but it adds slight redundancy on the sender side, to improve the quality of transmitted speech in lossy IP-based network environment. Section 3 explains in details suggested algorithm. A comparative study between original ANSI T1.521a Annex B standard, the proposed PLC algorithm and no-plc version is then presented. Section 4 describes testing methodology and section 5 presents results of experiments and their discussion. 2. Packet loss concealment schemes PLC methods can be generally divided into coderdependent and coder-independent techniques [16]. The former ones are codec-oriented they use specificity of given coder operation and for example prevent them from loss of synchrony in case of packet loss. An example of PLC for G.722 codec is proposed in [13], where authors suggest adding site information in order to enable packet regeneration and to help the decoder to re-synchronize after a packet loss. Coder-independent techniques do not need to know much about the used codec. Depending on where the PLC algorithm operates they can be divided into: receiver-based techniques, such as ANSI T1.521a Annex B standard [1] developed for G.711 codec, which is reconstructing the lost packet(s) using parameters (linear prediction coefficients, F0, excitation) of the last correctly received packet. Another example using Hidden Markov Models (HMM) for estimating lost packet parameters is presented in [11];
sender-receiver-based algorithms, using e.g. Forward Error Correction (FEC), retransmission schemes or adding redundancy. Redundancy-based techniques can be quite effective as for quality improvement, but require additional bandwidth. The idea is that a voice packet carries ( piggybacks ) redundant data for the other packets. In the simplest approach it can carry duplicated neighboring packet. More sophisticated techniques consist in duplicating another packet, but encoded with a different voice codec, usually less bandwidth demanding. Such a method of combining G.711, ADPCM, GSM 06.10 is proposed in [3]. Other researchers proposed adding redundancy in the form of parameters of other packets: LPC (Linear Prediction Coding) coefficients [7], voicing and F0 (fundamental frequency) information [12], energy and zero crossing ratio [6]. In [14] the authors propose adding redundancy in the form of excitation parameters added to the preceding packet, but only for the most important packets. However, the proposed methods were usually meant for lower PLR values. 3.2. Idea of the proposed algorithm The proposed algorithm avoids the above described problem by adding some redundancy to the encoded signal. To do this we suggest introducing the following changes at the sender side: implementing buffering of τ packets; adding to the currently transmitted n-th packet some parameters of the packet number n+τ. Following other research results mentioned in section 2 it is proposed to choose parameters able to characterize speech signal contained in a given packet in the best possible way, and also in a concise form. Having that in mind and to be compliant with a G.711 decoder working with ANSI T1.521a Annex B PLC algorithm, the following parameters were selected: 20 LPC coefficients, characterizing the speaker s vocal tract transmittance in a given time instant; T0 parameter a pitch period, equal 1/F0, where F0 is a pitch of a packet being characterized; E short-time energy of the signal. 3. Proposed algorithm This section presents the idea of the proposed packet loss concealment algorithm able to cope with high PLR. On the receiver side the algorithm is in fact using a modified ANSI T1.521a Annex B standard, which takes different parameters. In addition the proposed solution involves changes on the sender side, so this technique becomes sender-receiver-based. 3.1. Drawbacks of ANSI T1.521a Annex B ANSI T1.521a Annex B is a receiver-based PLC technique for G.711 which performs satisfactory if single packets are lost. For unreliable networks with high PLR, when loss of consecutive 2-3 or more packets is quite likely, the output speech quality is impaired, because the algorithm keeps reproducing the same last correctly received packet over and over, only decreasing amplitude of the samples according to the scaling function. This causes a discomfort for a listener and also impacts speech intelligibility, as it lengthens some phonemes and often allows other ones to disappear. It can be especially noticeable and harmful when speech is changing from voiced to unvoiced or vice versa. Figure 1. Algorithm for selection of reconstruction parameters.
If a receiver realizes that the current packet is lost, it is trying to reconstruct it using the excitation signal of the last correctly received packet (as in ANSI T1.521a Annex B standard), but with LPC parameters taken from packet number n-τ and with T0 value proper for the lost packet, instead of repeating T0 of the last correctly received packet. E parameter is used to scale the whole packet to keep the same energy of the reconstructed signal as in the original one. If packet n-τ is not available (because it got lost, too), then the closest neighboring set of parameters should be used to reconstruct the speech signal (see Figure 1). pitchdetat algorithm [17], so this one was used. For unvoiced speech T0 value was chosen randomly, so that even if a unvoiced packet was reconstructed using excitation signal of a voiced one it was loosing its harmonicity. E energy parameter was encoded using a quantizer with logarithmic characteristics and stored as the last byte of total 20 redundant ones. Assuming 20 ms frame size and sending one speech frame in one Ethernet packet, the increase of payload of a G.711 packet is not high: the redundant parameters cause 12.5% overhead, see Figure 2. 3.3. Additional requirements of the proposed PLC algorithm The proposed algorithm has some drawbacks if compared with ANSI T1.521a Annex B. Firstly it requires changes on the transmitter side, so e.g. the transmitting softphone or a VoIP router has to be PLC-algorithm aware. It also introduces additional computational load, because the signal parameters, unlike as in ANSI T1.521a Annex B standard, would need to be calculated (on a sender side) for every packet, not only in case of packet loss (on a receiver side). The elaborated algorithm also forces the packets to carry increased payload, due to redundancy. Last but not least, it is introducing additional delay due to signal buffering in the encoder. The higher the τ parameter is, the higher the delay is. Too low τ can make the PLC algorithm not prove to losses of even small group of packets. Anyway the authors believe that potential benefits of this algorithm, when working in a network of high PLR, can overcome some drawbacks of this approach. 3.4. Implementation of the proposed PLC method The proposed PLC algorithm was implemented in Matlab environment, using Voicebox toolbox for speech processing [15]. The LPC parameters were encoded by first converting them to reflection coefficients (RF), so that they ranged between (-1; 1) for easier quantization. A linear 7-bit quantizer was used for quantization of RF values, so the prediction coefficients filled up 18 bytes. T0 (the pitch period in samples) occupied the next byte, ranging from 16 to 255, corresponding to F0 of 50-500 Hz. During experiments it turned out that the most precise F0 detection was achieved using Figure 2. Content of Ethernet packets for original G.711 (above) and for G.711 with redundant parameters for PLC. The value of τ had to be chosen as a trade-off between efficiency and delay and experimentally set to 5. This is a quite high value, but lower values met in other works (e.g. 1-3 in [3]) turned to be insufficient for higher PLR. Figure 3 presents visualization of packet loss concealment using both ANSI T1.521a Annex B and the proposed algorithm. Even by observing only the waveform one can notice that in case of severe signal damage (5 packets lost) the proposed method reconstructed the speech signal much better than the ANSI T1.521a Annex B standard, which kept copying the last properly received packet regardless of voicing loss, phoneme s boundary etc. 4. Testing methodology To verify if the proposed packet loss algorithm performs better than ANSI T1.521a Annex B standard, comparative experiments were run. 24 semantically unpredictable sentences (SUS) were recorded in Polish
Figure 3. Illustration of packet reconstruction: (A) original signal, (B) signal with 5 packets loss and no PLC, (C) signal reconstructed with ANSI T1.521a Annex B, (D) signal reconstructed with the proposed algorithm. by a male and female speaker. They were transmitted through software packet network simulator at different packet loss rate (0-35%), using Gillbert-Elliott model of packet loss [8]. For a given PLR the same pattern of loss packets was maintained when testing all PLC methods. Three different PLC variants were used: no PLC algorithm at all; lost packets reconstructed using PLC algorithm according to ANSI T1.521a Annex B standard; lost packets reconstructed by the proposed PLC algorithm. The reconstructed signal was recorded, thus forming 3 sets of output speech signal, henceforth called: no PLC, T1.521a, proposed PLC. As quality assessment methods the following measures were selected: PESQ (Perceptual Evaluation of Speech Quality) algorithm for objective quality measure; ACR (Absolute Category Rating) method for subjective quality assessment; SUS (Semantically Unpredictable Sentences) tests for intelligibility testing.
PESQ is an ITU-T standard for objective doublesided quality assessment for speech [9] which estimates listener s opinion by comparing transmitted and original speech signal, using human perception model. ACR is a subjective method which can be used for assessment of speech quality; among others it can be used for testing VoIP transmission quality [4]. It requires participation of listeners who assess the output speech quality, assigning scores on a 5-degree MOS (Mean Opinion Score) scale ranging from very bad to excellent. SUS method of intelligibility testing consists in transmitting nonsense (semantically unpredictable) sentences through a telecommunication channel. A sample SUS sentence in English can be: A table dances through the yellow future. Listeners also need to participate in this test they are asked to write down the sentence as they can hear it. Due to lack of sense in SUS sentences, they are able to correctly note them only if the speech is intelligible enough. This method has been previously successfully used in testing speech synthesis [2] and also in VoIP intelligibility assessment [10]. PESQ and ACR methods result in MOS score (0.5-4.5), while intelligibility tests result in percentage of correctly conveyed sentences and words. 48 subjects took part in SUS and ACR tests; they were exposed to 576 recordings (24 SUS sentences x 8 PLR values x 3 PLC algorithms), so that each version was assessed by 2 listeners. The results are presented and discussed in the next section. 5. Results of experiments Figure 4 presents average MOS results achieved both from calculations of PESQ algorithm (objective method) and from ACR technique during listening tests (subjective method). Both tests showed obviously the maximum score (4.5 MOS) for signal with no loss. They also demonstrated decreasing quality along with increasing packet loss rate and none of algorithms was able to prevent it. However, in both tests the proposed PLC algorithm significantly outperformed the original ANSI T1.521a Annex B standard. According to subjective assessment transmission over a network with PLR equal 25% resulted still in MOS above 3. The higher packet loss was, the higher MOS gain was observed, sometimes exceeding even 0.5 MOS. What is more, it turned out that for heavily unreliable (lossy) networks a receiver with ANSI T1.521a Annex B standard PLC algorithm according to subjective judgments in ACR tests performs even worse than lack of any PLC algorithm at all. Apparently listeners preferred silence in lost frames rather than several times repeated last correctly received packet, as it is according to the ANSI standard. The results of both subjective and objective methods were highly correlated the correlation coefficient between them was 0.9414, what confirms results reliability. The results for higher PLR values are better for ACR than for PESQ probably if the listeners were able to understand the utterance, then they also were more favorably disposed when assigning the MOS value. PESQ algorithm proved to be more demanding. Figure 5 presents an example of MOS variation in PESQ test for individual SUS sentences. It shows that some sentences were less damaged by the loss of packets and got higher MOS results than the others. Also here it shows that the proposed PLC helped almost in all the cases, apart from sentence number 15, where the gain was hardly noticeable. Figure 6 and Figure 7 present results of intelligibility tests, showing respectively word and sentence intelligibility. Here also the results worsen when packet loss ratio is increasing. Absolute sentence intelligibility is lower than the word one, because a SUS sentence is disqualified even if only one constituting word was wrongly received. Results of intelligibility confirm the conclusions from the speech quality tests. While the intelligibility results are very similar or identical for 0% and 5% PLR values, the proposed PLC algorithm performs significantly better than ANSI T1.521a Annex B standard for almost all PLR values higher than 5%. For the rate of 35% the difference exceeds even 10% and 20%, respectively for word and sentence intelligibility. A loss rate of 25% is a remarkable exception however in this case sentence intelligibility for the original PLC algorithm is more than 10% higher than for the proposed one, and also higher than for original algorithm at 20% packet loss rate. This local increase of sentence intelligibility was confirmed neither by word intelligibility result, nor by MOS scores. Closer analysis of this particular case showed that accidentally the wrongly heard words happened to be grouped together in the same SUS sentences, while the other SUS sentences were relatively intelligible, thus causing higher sentence intelligibility score. The confidence intervals marked in the intelligibility figures, estimated for confidence level of 0.9, prove satisfactory trustworthiness of the presented results.
5 4,5 no PLC - ACR ANSI T1.521a - ACR proposed PLC - ACR no PLC - PESQ ANSI T1.521a - PESQ proposed PLC - PESQ 4 3,5 MOS 3 2,5 2 1,5 1 0% 5% 10% 15% 20% 25% 30% 35% PLR Figure 4. MOS results for different PLR values using subjective ACR method and objective PESQ algorithm. 3,5 no PLC ANSI T1.521a proposed PLC 3 2,5 MOS 2 1,5 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 SUS sentence No. Figure 5. Sample MOS results using PESQ algorithm for individual SUS sentences (PLR = 25%).
100% 95% no PLC ANSI T1.521a proposed PLC 90% 85% intelligibility 80% 75% 70% 65% 60% 55% 50% 0% 5% 10% 15% 20% 25% 30% 35% PLR Figure 6. Word intelligibility results from SUS tests. 100% 90% no PLC ANSI T1.521a proposed PLC 80% 70% intelligibility 60% 50% 40% 30% 20% 10% 0% 0% 5% 10% 15% 20% 25% 30% 35% PLR Figure 7. Sentence intelligibility results from SUS tests.
6. Conclusions The results of quality and intelligibility tests showed that the proposed PLC algorithm improves significantly the quality of speech transmitted over an unreliable network with high packet loss rate. It performs also better than a receiver-based ANSI T1.521a Annex B algorithm; however, it introduces additional delay which needs to be considered. Anyway an increased delay is often a necessary expense if the signal quality is a priority. The proposed PLC algorithm is in fact an extension of ANSI T.521a Annex B to a sender-receiver-based version and it is compliant with its original version the receiver can switch to work according to ANSI T1.521a Annex B standard if it finds no redundant data in a G.711 packet. τ parameter is adjustable it can be modified, also dynamically depending on the network conditions: it can be decreased e.g. if the WLAN link quality improves. An optional additional byte (or part of it, as 3 bits are enough) can be set to inform about the current value of τ parameter. 7. References [1] ANSI American National Standard, Supplement to T1.521-1999, Packet Loss Concealment for Use with ITU-T Recommendation G.711, Alliance for Telecommunications Industry Solutions, Washington, 2000. [2] Ch. Benoit, M. Grice, and V. Hazan, The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences, Speech Communication 18, Elsevier Science B.V., 1996, pp. 381-392. [3] J. C. Bolot, and A. V. Garcia, Control mechanisms for packet audio in the Internet, In Proc. of IEEE INFOCOM, Apr. 1996, pp. 232-239. [4] S. Brachmański, VoIP quality assessment of speech transmission using ACR and DCR methods (in Polish), Przegląd Telekomunikacyjny i Wiadomości Telekomunikacyjne 8-9/2003, Warsaw, 2003. [5] G. Carle, T. Hoshi, L. Senneck, and H. Le, Active Concealment for Internet speech transmission, International Workshops on Active Networks (IWAN), Tokyo, 2000. [6] N. Erdol, C. Castelluccia, and A. Zilouchian, Recovery of missing speech packets using the short-time energy and zero-crossing measurements, IEEE Trans. on Speech and Audio Processing, 1(3), July 1993, pp. 295-303. [7] V. Hardman, M. A. Sasse, M. Handley, and A. Watson, Reliable audio for use over the Internet, In Proc. of Int l Networking Conf., June 1995, pp. 171-178. [8] ITU-T COM 12-D104-E, Modelling Burst Packet Loss within the E Model, Geneva, 27-31 January 2003. [9] ITU-T Recomendation P.862 Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codec, Geneva, 02.2001. [10] A. Janicki, B. KsięŜak, J. Kijewski, and S. Kula, Speech quality assessment in the Internet telephony using semantically unpredictable sentences (in Polish), Przegląd Telekomunikacyjny i Wiadomości Telekomunikacyjne 8-9/2006, Warsaw, 2006. [11] Ch. A. Rødbro, M. N. Murthi, S. V. Andersen, and S. H. Jensen, Hidden Markov Model-Based Packet Loss Concealment for Voice Over IP, IEEE Transactions On Audio, Speech, And Language Processing, Vol. 14, No. 5, 10. 2006 [12] H. Sanneck, Concealment of lost speech packets using adaptive packetization, In IEEE Int l Conf. on Multimedia Computing and Systems, 1998, pp. 140 149. [13] N. Shetty, and J, D. Gibson, Packet Loss Concealment for G.722 using Side Information with Application to Voice over Wireless LANs, Journal of Multimedia, vol. 2, No. 3, June 2007, pp. 66-76. [14] L. Tosun, and P. Kabal, Dynamically Adding Redundancy for Improved Error Concealment in Packet Voice Coding, In Proc. European Signal Processing Conf. (Antalya, Turkey), 4 pp., Sept. 2005 [15] Voicebox Speech Processing Toolbox for MATLAB, http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html [16] B. W. Wah, X. Su, and D. Lin, A survey of errorconcealment schemes for real-time audio and video transmissions over the internet, In Proc. Int'l Symposium on Multimedia Software Engineering, Taipei, Taiwan, December 2000, pp. 17-24. [17] A. Wong Pitchdetat, 01.04.2003 http://hebb.mit.edu/courses/9.29/2003/athena/auditory/birdso ng/pitchdetat.m