Universidad Politécnica de Madrid Escuela Técnica Superior de Ingenieros de Telecomunicación ANALYSIS, MONITORING, AND MANAGEMENT OF QUALITY OF EXPERIENCE IN VIDEO DELIVERY SERVICES OVER IP Tesis Doctoral Pablo Pérez García Ingeniero de Telecomunicación 2013
Universidad Politécnica de Madrid Departamento de Señales, Sistemas y Radiocomunicaciones Escuela Técnica Superior de Ingenieros de Telecomunicación Tesis Doctoral ANALYSIS, MONITORING, AND MANAGEMENT OF QUALITY OF EXPERIENCE IN VIDEO DELIVERY SERVICES OVER IP Autor: Pablo Pérez García Ingeniero de Telecomunicación Director: Narciso García Santos Doctor Ingeniero de Telecomunicación 2013
Tesis Doctoral ANALYSIS, MONITORING, AND MANAGEMENT OF QUALITY OF EXPERIENCE IN VIDEO DELIVERY SERVICES OVER IP Autor: Pablo Pérez García Director: Narciso García Santos Tribunal nombrado por el Mfgco. y Excmo. Sr. Rector de la Universidad Politécnica de Madrid, el día...... de........................ de 2013. Presidente:......................................................... Vocal:.............................................................. Vocal:.............................................................. Vocal:.............................................................. Secretario:......................................................... Realizado el acto de defensa y lectura de la Tesis el día...... de..................... de 2013 en.............................................................................. Calificación:........................................................ EL PRESIDENTE LOS VOCALES EL SECRETARIO
If you make listening and observation your occupation you will gain much more than you can by talk. Robert Baden-Powell
UNIVERSIDAD POLITÉCNICA DE MADRID Abstract TESIS DOCTORAL ANALYSIS, MONITORING, AND MANAGEMENT OF QUALITY OF EXPERIENCE IN VIDEO DELIVERY SERVICES OVER IP by Pablo Pérez García This thesis proposes a comprehensive approach to the monitoring and management of Quality of Experience (QoE) in multimedia delivery services over IP. It addresses the problem of preventing, detecting, measuring, and reacting to QoE degradations, under the constraints of a service provider: the solution must scale for a wide IP network delivering individual media streams to thousands of users. The solution proposed for the monitoring is called QuEM (Qualitative Experience Monitoring). It is based on the detection of degradations in the network Quality of Service (packet losses, bandwidth drops... ) and the mapping of each degradation event to a qualitative description of its effect in the perceived Quality of Experience (audio mutes, video artifacts... ). This mapping is based on the analysis of the transport and Network Abstraction Layer information of the coded stream, and allows a good characterization of the most relevant defects that exist in this kind of services: screen freezing, macroblocking, audio mutes, video quality drops, delay issues, and service outages. The results have been validated by subjective quality assessment tests. The methodology used for those test has also been designed to mimic as much as possible the conditions of a real user of those services: the impairments to evaluate are introduced randomly in the middle of a continuous video stream. Based on the monitoring solution, several applications have been proposed as well: an unequal error protection system which provides higher protection to the parts of the stream which are more critical for the QoE, a solution which applies the same principles to minimize the impact of incomplete segment downloads in HTTP Adaptive Streaming, and a selective scrambling algorithm which ciphers only the most sensitive parts of the media stream. A fast channel change application is also presented, as well as a discussion about how to apply the previous results and concepts in a 3D video scenario.
UNIVERSIDAD POLITÉCNICA DE MADRID Resumen TESIS DOCTORAL ANALYSIS, MONITORING, AND MANAGEMENT OF QUALITY OF EXPERIENCE IN VIDEO DELIVERY SERVICES OVER IP por Pablo Pérez García Esta tesis estudia la monitorización y gestión de la Calidad de Experiencia (QoE) en los servicios de distribución de vídeo sobre IP. Aborda el problema de cómo prevenir, detectar, medir y reaccionar a las degradaciones de la QoE desde la perspectiva de un proveedor de servicios: la solución debe ser escalable para una red IP extensa que entregue flujos individuales a miles de usuarios simultáneamente. La solución de monitorización propuesta se ha denominado QuEM (Qualitative Experience Monitoring, o Monitorización Cualitativa de la Experiencia). Se basa en la detección de las degradaciones de la calidad de servicio de red (pérdidas de paquetes, disminuciones abruptas del ancho de banda... ) e inferir de cada una una descripción cualitativa de su efecto en la Calidad de Experiencia percibida (silencios, defectos en el vídeo... ). Este análisis se apoya en la información de transporte y de la capa de abstracción de red de los flujos codificados, y permite caracterizar los defectos más relevantes que se observan en este tipo de servicios: congelaciones, efecto de cuadros, silencios, pérdida de calidad del vídeo, retardos e interrupciones en el servicio. Los resultados se han validado mediante pruebas de calidad subjetiva. La metodología usada en esas pruebas se ha desarrollado a su vez para imitar lo más posible las condiciones de visualización de un usuario de este tipo de servicios: los defectos que se evalúan se introducen de forma aleatoria en medio de una secuencia de vídeo continua. Se han propuesto también algunas aplicaciones basadas en la solución de monitorización: un sistema de protección desigual frente a errores que ofrece más protección a las partes del vídeo más sensibles a pérdidas, una solución para minimizar el impacto de la interrupción de la descarga de segmentos de Streaming Adaptativo sobre HTTP, y un sistema de cifrado selectivo que encripta únicamente las partes del vídeo más sensibles. También se ha presentado una solución de cambio rápido de canal, así como el análisis de la aplicabilidad de los resultados anteriores a un escenario de vídeo en 3D.
Acknowledgements This thesis would not have been possible without the help of all the people with whom I have been so lucky to share my way in these more than eight years. Let me express my gratitude to all of them in my mother tongue. La vida es un conjunto de relaciones; y enumerar todas las que se pueden forjar en los ocho años que ha durado este trabajo ocuparía más espacio del que, probablemente, sea razonable dedicar en una tesis doctoral. De modo que es probable que esté siendo injusto con algunas personas que, por descuido, olvido, o falta de espacio, no aparecerán aquí citadas. Vaya de antemano mi disculpa (y agradecimiento) también para ellas. Gracias ante todo a Narciso García, que sigue logrando sacar huecos en su cada vez más complicada agenda para acompañarme en esta aventura. Es un privilegio contar con él como director de tesis. Gracias también, muy especialmente, a Jaime Ruiz, que ha sido mucho más que un manager en estos ocho años. No exagero si digo que, si no fuera por él, difícilmente podría yo haber terminado este trabajo. Gracias al excepcional equipo humano y profesional con el que he tenido la suerte de trabajar a lo largo de estos años en Telefónica I+D y Alcatel-Lucent. A Jesús Macías, que me enseñó a mirar el vídeo de otra manera. A Álvaro Villegas, en cuyo trabajo se apoya buena parte del mío. A Silvia Varela, por ayudarme a encontrar el enfoque de este espinoso asunto de la calidad. A Enrique Estalayo y José M. Cubero, con los que he compartido tanto en tantos proyectos. A Ernesto Puerta, por las conversaciones sobre cuantificación y otros asuntos arcanos. A Javier López Poncela, por guiarme por los entresijos de los descodificadores. Gracias también a la gente del Grupo de Tratamiento de Imágenes, que me ha seguido acogiendo como en casa durante todos estos años. Muy en particular a Jesús Gutiérrez, por todo el trabajo de las pruebas de calidad subjetiva: sin él, acabar esta tesis habría resultado mucho más difícil. Gracias también a Julián Cabrera y Fernando Jaureguizar, siempre dispuestos a echar una mano en lo que hiciera falta. Mi sincero agradecimiento a todas aquellas personas que, a lo largo de estos años, han puesto también su granito de arena en esta tesis. A Juan Casal, por compartir su experiencia sobre codificación de vídeo. A Rocío Bravo, por la ayuda con las audiencias de televisión. A todos los socios del CENIT VISION, donde se gestó buena parte de la investigación que ahora presento. xiii
Finalmente, muchas gracias a mi familia y amigos. A mis hermanos Lucas y David, que marcaron el camino a seguir. A mi hermano Jesús, de quien he aprendido lo poco que sé de audio digital (y algún que otro truco de televisión). A mi madre Teresa, que tanto ha puesto de su parte para empujarme a terminar la tesis. A mi padre Juan, a quien seguro que le habría gustado verla acabada, y con quien también he discutido alguna de las ecuaciones que en ella aparecen. Y a Graciela, por todo lo que hemos compartido, y lo que queda por venir; tanto, que no se puede resumir en una frase. Gracias, en definitiva, a todos los que han hecho posible que esta tesis se haya escrito. Aun de aquellos que, por la falta de espacio, no he tenido ocasión de mencionar en estas líneas, guardo un buen recuerdo en el corazón. Gracias a ti, que te estás tomando el trabajo de leer estas páginas. Y gracias a Dios por habernos puesto en contacto.
Contents Abstract Resumen ix xi Acknowledgements xiii List of Figures List of Tables Abbreviations xix xxi xxiii 1 Introduction 1 1.1 Motivation................................... 1 1.2 Overview.................................... 3 2 Understanding Quality of Experience 7 2.1 Quality of Experience and its relatives.................... 7 2.2 A word about multimedia services...................... 8 2.2.1 Players................................. 10 2.2.2 Coding standards and transport protocols.............. 11 2.2.3 Artifacts................................. 14 2.3 Who is who in the QoE metrics........................ 16 2.3.1 Subjective quality assessment..................... 18 2.3.2 Full-Reference quality metrics..................... 20 2.3.3 Reduced-Reference quality metrics.................. 22 2.3.4 No-Reference quality metrics..................... 23 2.4 Other topics related to QoE in IPTV services................ 26 2.4.1 Media formats in IPTV deployments................. 29 2.5 Conclusions................................... 31 3 Designing QoE-Aware Multimedia Delivery Services 33 3.1 Introduction................................... 33 3.2 Delivering multimedia over IP......................... 36 3.2.1 Architecture of a multimedia service delivery platform....... 36 3.2.2 Impairing the Quality of Experience................. 41 xv
xvi CONTENTS 3.3 QuEM: a qualitative approach to QoE monitoring............. 44 3.3.1 Problem statement... 44 3.3.2 System design.............................. 45 3.3.3 Qualitative Impairment Detectors.................. 47 3.3.4 Severity Transfer Function....................... 48 3.4 A Subjective Assessment methodology to calibrate Quality Impairment Detectors.................................... 48 3.4.1 Design principles............................ 49 3.4.2 Test methodology........................... 50 3.4.3 Selection of impairments........................ 52 3.5 QoE enablers.................................. 53 3.5.1 Headend metadata architecture.................... 53 3.5.2 Intelligent Packet Rewrapper..................... 55 3.5.3 Edge Servers for IPTV and OTT................... 57 3.6 Conclusions................................... 58 4 Quality Impairment Detectors 59 4.1 Introduction................................... 59 4.2 Video Packet Loss Effect Prediction (PLEP) model............. 60 4.2.1 Description of the model........................ 62 4.2.2 Experiment............................... 65 4.2.3 Subjective analysis........................... 72 4.3 Audio packet loss effect............................ 74 4.3.1 Objective analysis........................... 74 4.3.2 Subjective analysis........................... 77 4.4 Coding quality and rate forced drops..................... 79 4.4.1 Analysis of feature-based RR/NR metrics as estimators of video coding quality............................. 80 4.4.2 Managing coding quality drops.................... 84 4.5 Outages..................................... 87 4.5.1 Detection of outages.......................... 87 4.5.2 Subjective impact of outages..................... 88 4.6 Latency..................................... 88 4.6.1 Lag................................... 89 4.6.2 Channel Change time......................... 91 4.6.3 Latency trade-offs........................... 94 4.7 Mapping to Severity.............................. 95 4.8 Conclusions................................... 97 5 Applications 99 5.1 Introduction................................... 99 5.2 Unequal Error Protection........................... 100 5.2.1 Priority Model............................. 101 5.2.2 Experimentation and results...105 5.2.3 Applications...111 5.3 Fine-grain segmenting for HTTP adaptive streaming............ 112 5.3.1 Description of the solution....................... 113
CONTENTS xvii 5.4 Selective Scrambling.............................. 116 5.4.1 Problem statement and requirements................. 117 5.4.2 Algorithms............................... 118 5.4.3 Results...119 5.5 Fast Channel Change.............................. 120 5.6 Application to 3D Video............................ 121 6 Conclusions 123 A Experimental setup 127 A.1 Introduction................................... 127 A.2 Subjective Assessment based on QuEM approach.............. 127 A.2.1 Selection and preparation of content................. 127 A.2.2 Selection of impairments........................ 128 A.2.3 Test sessions.............................. 133 A.3 Subjective quality assessment of H.264 video encoders........... 134 A.4 Test sequences from IPTV deployments................... 135 Bibliography 137
List of Figures 2.1 Layer and domain model for multimedia services.............. 10 2.2 Protocol stack for multimedia services over IP................ 13 2.3 Models for objective quality assessment: FR/RR/NR............ 17 2.4 Hierarchical GOP structure.......................... 31 3.1 Network architecture for IPTV and OTT services.............. 37 3.2 Delivery chain of a multimedia service.................... 45 3.3 QuEM architecture design........................... 46 3.4 Test sequences in ACR............................. 51 3.5 Test sequences in our proposed method................... 51 3.6 Questionnaire for subjective assessment tests................ 51 3.7 Structure of the content streams in the subjective assessment test session 53 3.8 Schematic representation of a modular headend............... 54 3.9 RTP header and extension introduced by the rewrapper processing.... 56 4.1 Video sequence used for qualitative analysis................. 67 4.2 MSE and PLEP for all sequences under study, varying the loss position. 69 4.3 Detail of MSE and PLEP for all sequences under study.......... 69 4.4 MSE vs PLEP (log scale) and linear fit.................... 70 4.5 % of different macroblocks vs PLEP and linear fit............. 70 4.6 % of different macroblocks and PLEP for all sequences under study, varying the loss position.............................. 71 4.7 Results of the subjective assessment for Video Loss impairments... 73 4.8 Detailed results for each of the individual segments for Video Loss.... 74 4.9 Waveform of a lossy audio file......................... 75 4.10 Effect of audio losses: measured vs. expected................ 76 4.11 Short-length audio losses............................ 77 4.12 Results of the subjective assessment for Audio Loss impairments..... 78 4.13 Detailed results for each of the individual segments for Audio Loss.... 79 4.14 Results of TI and Contrast NR metrics.................... 83 4.15 Results of the subjective assessment for Rate Drop impairments...... 86 4.16 Detailed results for each of the individual segments for Rate Drop.... 86 4.17 Results of the subjective assessment for Outage impairments....... 89 4.18 Detailed results for each of the individual segments for Outage...... 89 4.19 Simplified transmission chain for real-time video.............. 90 4.20 Decoding delay for video and audio components of a MPEG-2 Transport Stream...................................... 93 4.21 Results for all the QuIDs mentioned in the chapter............. 96 xix
xx LIST OF FIGURES 5.1 Example of the packet priority model..................... 103 5.2 Implementation of the prioritization model................. 104 5.3 Effect of the window size in packet prioritization results.......... 107 5.4 Values of MSE comparing random vs. priority-based packet loss..... 107 5.5 Effect of varying the loss burst size...................... 108 5.6 Contribution of each term to the prioritization equation.......... 109 5.7 Effects of a limited bit budget to encode the priority............ 110 5.8 Priority-based HTTP Adaptive Streaming segment structure....... 115 A.1 Structure of the content streams in the subjective assessment test session 132 A.2 Summary of the subjective quality assessment test results......... 133 A.3 Subjective MOS for a football sequence................... 135
List of Tables 2.1 ACR and DCR evaluation scales....................... 19 3.1 Priority values used in the RTP header extension.............. 56 4.1 Coefficient of determination (R 2 ) of MSE vs PLEP fit for several video sequences..................................... 71 4.2 PLEP impairments analyzed in the subjective assessment tests...... 72 4.3 Audio losses analyzed in the subjective assessment tests........... 78 4.4 Comparison of NR/RR results with subjective tests............ 82 4.5 Quality drops analyzed in the subjective assessment tests.......... 85 4.6 Outage events analyzed in the subjective assessment tests......... 88 4.7 Example Channel Change time ranges and their mapping to QoE..... 94 5.1 Priority value for each slice type....................... 102 5.2 Values of the Aggregated Gain Ratio...106 5.3 Bit budget assignation to encode priority.................. 111 5.4 Minimum scrambling rate required to completely loss the video signal.. 119 A.1 Video test sequences: bitrate and resolution................. 128 A.2 Bitrate drops.................................. 129 A.3 Frame rate drops................................ 129 A.4 Audio losses................................... 130 A.5 Macroblocking errors...130 A.6 Video freezing.................................. 131 A.7 Impairment sets................................. 131 A.8 Example of a sequence of impairments.................... 132 A.9 Test sequences................................. 136 xxi
Abbreviations 3G Third generation of mobile communication technology ACR Absolute Category Rating AL-FEC Application Layer Forward Error Correction ARQ Automatic Repeat request AVC Advanced Video Coding (also H.264 or MPEG-4 part 10) CA Conditional Access CABAC Context-Adaptive Binary Arithmetic Coding CBR Constant Bit Rate CDN Content Delivery Network CoD Content on Demand DCR Degradation Category Rating DRM Digital Rights Management DSL Digital Subscriber Line DTS DecodingTime Stamp DVB Digital Video Broadcasting FCC Fast Channel Change FEC Forward Error Correction FR Full Reference GOP Group Of Pictures GPON Gigabit-capable Passive Optical Network HAS HTTP Adaptive Streaming HDS HTTP Dynamic Streaming HLS HTTP Live Streaming HNED Home Network End Device HTTP Hypertext Transfer Protocol xxiii
xxiv ABBREVIATIONS IDR IP IPTV ITU LMB LTE MDI MOS MPEG MSE MVC NAL NR OTT PCR PLEP PLP PLR PSNR PTS QoE QoS QuEM QuID RAP RET RGW RR RTP SS STF TCP UDP Instantaneous Decoding Refresh Internet Protocol Television over Internet Protocol International Telecommunication Union Live Media Broadcast Long Term Evolution Media Delivery Index Mean Opinion Score Moving Picture Experts Group Mean Square Error Multi-view Video Coding Network Abstraction Layer No Reference Over The Top multimedia delivery services Program Clock Reference Packet Loss Effect Prediction metric Packet Loss Pattern Packet Loss Rate Peak Signal to Loss Ratio Presentation Time Stamp Quality of Experience Quality of Service Qualitative Experience Monitoring Quality Impairment Detector Random Access Point RETranmsission (synonym of ARQ) Residential Gateway Reduced Reference Real-Time Transport Protocol Smooth Streaming Severity Transfer Function Transmission Control Protocol User Datagram Protocol
ABBREVIATIONS xxv VBR VQEG Variable Bit Rate Video Quality Experts Group
To the loving memory of Juan To Teresa
Chapter 1 Introduction 1.1 Motivation There is little doubt about the social relevance of the audiovisual delivery services since the beginning of the first television broadcasts. During the second half of the 20th century, broadcast television channels controlled the audiovisual market and were the main communication path for information, culture, and entertainment. But in the last decades, though the traditional broadcasters are still quite relevant players in the content marketplace, their offer has been complemented by a plethora of new services: IP television, video on demand, web video portals, user-generated content... The way in which contents are consumed is rapidly changing, and there are two technological drivers which have made this possible: digital video and IP networks. With the standardization of MPEG video in the 1990s, it became possible to consume video products at home with high quality and at an affordable cost. The popularization of the internet, at about the same time, brought the possibility to easily interconnect any two points in the world. The combination of both events allowed that video contents could be managed, stored, and distributed homogeneously with the rest of the information. Somehow, the distribution of video to the households had just become a problem of digital data communication and storage. And the main problem to solve was, consequently, finding enough bandwidth to fit the transmission requirements of video assets. The first decade of the 21st century witnessed a quantitative change which resulted in a qualitative jump: improvements in video codec technologies and in the capacity of the xdsl access networks allowed to distribute real time video over IP networks with a quality that could compete with that of television and DVDs. This gave birth to 1
2 Chapter 1. Introduction the television over IP (IPTV), which introduced real interactivity and personalization into the audiovisual ecosystem. And in few years time, with subsequent generations of technological improvements, it has been possible to obtain a competing service of video distribution even over the standard best-effort internet, in what has been called over-the-top video delivery (OTT). This has significantly reduced the barriers to entry the multimedia business. And, as this happens, new services are appearing beyond the classic television channels, covering from huge video-clubs over the internet to the distribution of personalized, or even user-generated, video content. Together with the evolution of the services, it comes the problem of how to provide them with enough quality for the end users. The transmission of high quality video can be demanding for the capabilities of IP networks, especially in the access segment. Errors happen, and service providers struggle to have them under control. The monitoring of Quality of Service (QoS) parameters, such as bit rate, packet loss rate, or delay, is not straightforward when the service is distributed over a complex IP network topology. And even when a suitable QoS monitoring system has been set up in the delivery service network, it shows insufficient. The interesting concept to monitor is not strictly the QoS, but the QoE: the Quality of Experience perceived by the final customer. There has been an important effort in the last decade to characterize the perceived quality of an audiovisual content, as well as to find algorithms able to model it. A first method is using subjective quality assessment tests, where a panel of viewers evaluate the perceived quality of the video clips under study. This can provide quite accurate information about video quality and user preferences, but at the high cost of having a group of users involved in the assessment. The complementary approach is developing objective quality metrics: algorithms which try to emulate the responses of those viewers by computer analysis of the video sequences. It has been a very active field of research, especially during the last decade. Dozens of algorithms have been developed, from simple measures of mean square errors between images, up to complex metrics which include information about the Human Visual System (HVS) perception and about the visual structure of the impairments introduced in the video by the coding and transmission chain. However, few of those methods have impacted the market relevantly. There are commercially available quality probes which implement this kind of algorithms, but they are typically used just to measure the quality of the video compression process, and not always in real time. For the monitoring of the quality in the distribution and access network, only network-based measures are used: packet losses, router failures... Moreover, in the recent years, the manufacturers of measurement equipment seem to have reduced the efforts to introduce these complex metrics in their equipments.
Chapter 1. Introduction 3 There are good reasons for that. Video QoE metrics are complex to develop and expensive to deploy in the field. They also cover a very specialized field of interest, frequently critical in the video headend and video production departments, but much rarer in the service definition and in the network operation. In many cases the teams operating the network already have an overwhelming amount of QoS data which is hardly possible to manage; so that there is little use of increasing the complexity of this information. Besides, monitoring algorithms need to be implemented in heavily-loaded routers or low-processing user terminals, thus requiring to be extremely lightweight in processing power needs, what may disqualify a large number of the metrics available in the literature. Finally, some metrics are even impossible to apply due to the unavailability of the information at the monitoring point, as it is the case, for instance, when parts of the video stream are encrypted by digital rights management (DRM) or conditional access (CA) systems. In summary, service providers are still using mainly QoS metrics to monitor their networks, but it happens because they are the ones which are applicably under the budgetary, computing, and information availability restrictions that they have to cope with. There is still room for improvement. And this thesis wants to be a step in this direction, trying to reduce the gap between QoE expertise and multimedia delivery service providers. The focus of the work is precisely analyzing how to model, monitor and manage the Quality of Experience under the mentioned restrictions. The research of the thesis has been carried out along the last 8 years in the framework of the Grupo de Tratamiento de Imágenes research group at Universidad Politécnica de Madrid, in parallel to a professional career in the multimedia competence center of Alcatel-Lucent in Madrid. In this time, services, products, research areas and standardization efforts have evolved significantly. During the first years of the research, the line that we are proposing in this thesis was almost inexistent in the most relevant journals, save for a couple of remarkable exceptions. In the recent years, however, there has been an increasing interest in the research and standardization of monitoring strategies which are easier to apply in real operation environments. 1.2 Overview The aim of this thesis is providing architecture, models and results which make it possible for multimedia service providers to control the Quality of Experience offered by their service in a way which is relevant for their interests, practical and better than QoSonly monitoring schemes. It intends to answer the most frequently asked questions that
4 Chapter 1. Introduction a service provider can raise about the QoE it is offering: which elements determine the quality of the multimedia stream, which are the most relevant impairments in the perceived quality, what causes them, and how can they be monitored, prevented, and minimized. The thesis proposes a comprehensive strategy to address this problem as a whole, as well as detailed solutions for most of its elements. Part of the inputs taken to create the approach presented in this thesis have come from the day-by-day experience of assessing IPTV service providers, designing solutions for them, and developing products for the content delivery market. All the assumptions taken in the development of the thesis will be supported either by the work itself or by previous works published in the scientific literature. However, broader decisions, such as the relevance of the problem to study or the general approach to it, are influenced by the experience of listening to the customer, capturing their requirements and understanding the advantages and disadvantages of different measurement schemes from a service provider point of view. This fact has no effect on the scientific quality of this work, but it may help understand its underlying motivation. As a consequence, the work is probably biased towards this application-oriented approach in two different ways. On the one hand, there is a stronger focus on the ideas and concepts, rather than on training of mathematical models or extensive analysis of experimental results. As it is virtually unaddressable to simulate the conditions of work of any possible service provider in the world, the research has been aimed at building models which have as less dependency as possible on the context where they are applied, or that can be easily adapted to any specific deployment. In a word: clean and generic models have been preferred to trained and optimized ones. On the other hand, there has been an explicit effort to be sure that any architecture or algorithm proposed in this thesis can be directly applied to real multimedia delivery services. And, in fact, some of them have already been included in products which are currently deployed in the field. The study starts by analyzing several aspects of the state of the art (Chapter 2). It defines what a multimedia delivery service is, which technologies it implies, and which are the most relevant problems to its quality. Although the market applicability of the multimedia services is quite wide, its underlying technological problem is much more restricted. The existing techniques to model, analyze, and monitor the multimedia quality are covered, with special focus on their applicability to content delivery services, and including the published studies which support or formalize the knowledge obtained by work experience. Chapter 3 contains guidelines to design a multimedia delivery service which takes into account the Quality of Experience. It describes a reference architecture model for the service with some QoE-specific elements. It also proposes an specific design for a monitoring
Chapter 1. Introduction 5 system, which explicitly includes the most relevant requirements that any commercially deployable system should fulfill. The design is complemented with a methodology of subjective assessment tests that can be used to select, validate and calibrate its quality monitoring metrics. Chapter 4 dives into the quality metrics themselves. It presents a novel approach to predict the effect of packet losses on video quality, as well as some complementary metrics for audio losses, coding quality drops and outage. The effect of latency in the quality is analyzed as well. All the metrics also include the results of their respective subjective assessment tests. Chapter 5 shows some applications which derive from the previous work and go beyond the pure monitoring of quality. The knowledge of the effect of packet losses can be used as input to a packet prioritization model, usable for error protection in IPTV channels or to improve the error resiliency of HTTP Adaptive Streaming schemes. Other proposed applications are a method to increase the effect of selective scrambling and a system to reduce zapping time in IPTV and hybrid environments. Finally Chapter 6 presents the conclusions of the thesis, also summarizing which parts of it contain work which has been published in national and international scientific journals and conferences. There is also an appendix with some ancillary work: Appendix A, whichdescribesthe detail of some subjective and objective quality assessment tests used for several results in Chapters 3 and 4.
Chapter 2 Understanding Quality of Experience 2.1 Quality of Experience and its relatives Quality of Experience is defined as the overall acceptability of an application or service, as perceived subjectively by the end-user. It includes the complete end-to-end system effects (client, terminal, network, services infrastructure, etc.) and may be influenced by user expectations and context [43]. Some identifiable factors which impact in the QoE are the following [120, 121]: Individual interests of the viewer on the content. Audiovisual quality of the content. Viewing conditions, screen resolution and type... Interaction with the service or display device (e..g. EPG)... zap time, remote control, Individual experience and expectations (previous experiences... ). The concept of Quality of Experience is therefore quite wide, including aspects from the subjective preferences of each user to the objective technical conditions under which the service was provided. Roughly speaking, there are elements related to the content itself (the movie, TV show... ) and others related to the service (how the content is delivered and presented to the end user). Most of the analysis of the Quality of Experience are restricted to the service-related factors, which can be effectively monitored and managed 7
8 Chapter 2. Understanding Quality of Experience from an engineering point of view: media compression and synchronization, network transmission performance, channel zapping time... [43] A step down in the abstraction scale, we find the audiovisual quality or multimedia quality (MMQ), which is the study of the quality of the video and audio signals (either separately or together). Within the framework of multimedia services, themultimedia quality is by far the most relevant element of the QoE, up to the point that both terms are frequently exchanged. Likewise, the analysis of MMQ is typically focused on the video quality, which is the most critical in most multimedia services. An additional concept is the multimedia Quality of Service (M-QoS, or just QoS). By QoS we understand the complete and uninterrupted delivery of the multimedia stream through the network, from one end of the communication to the other one. It is the quality offered by the transmission chain (from the output of the multiplexer to the input of the demultiplexer) [32] without taking into account the contribution of the encoder, decoder, capture, and display devices into the final quality. These three quality concepts have a tight relationship. The QoS describes the capabilities of the communication network (bandwidth, delay... ) and their possible degradations (bit errors, packet losses, jitter... ). It therefore limits the level of MMQ that can be obtained in two senses: on the one hand, limitations in bandwidth result in limitations in the coding quality of the sequence; on the other, QoS degradations can cause impairments in the transmitted multimedia signal and, hence, in its MMQ. The final QoE will have to do with the final MMQ, as well as with other factors which are influenced by the QoS: interactivity, end-to-end latency, zap time... 2.2 A word about multimedia services The concept of multimedia service which will be used in this work is, basically, the possibility of watching an audiovisual content at home, usually assuming as well that the content is also delivered to the household at the time when it is going to be viewed. Multimedia services, thus, have been universally present at homes for the last half a century, first in the form of television broadcasting and, later, with the possibility to watch recorded contents in video recording systems. However, in the recent years, this scenario has been evolving rapidly, with the irruption of at least three significant technology changes, which have led to the three most relevant families of existing multimedia delivery services. The first one was the switch from analog to digital video, which increased the availability of different television channels to the households, fostering the growth of channels
Chapter 2. Understanding Quality of Experience 9 for specific target audiences (documentary, sport, children channels... ) and impacting strongly on the business models in the television marketplace. As a side (but relevant) effect, the experience of watching television changed, with increasing received quality (including high definition video), the appearance of new video defects, the raise of zapping times, the presence of Electronic Program Guides... This technology supports the existing television broadcast services: terrestrial, cable, and satellite. As a second step, some of those broadcast television services started their evolution towards all IP delivery networks [1]. IP delivery networks offer an easy integration with triple-play offers (voice, internet access and television), as well as inherent interactivity, which allows to deliver personalized services and, especially, Video on Demand (VoD), a remote access to stored video content (i.e. the experience of renting a film in a video-club integrated with the television service). A response to this evolution is the standardization of IPTV architectures, such as the DVB-IPTV [19], focusing on the delivery of continuous high-quality video services and covering the natural evolution of the television services (High Definition, stereoscopic video... ). And, in parallel, the deployment of IPTV platforms all over the world. The third technology change has been the irruption in the marketplace of the last generation of smartphones and tablets, which have given rise to new video delivery services, based on the streaming of multimedia content over unmanaged networks [23]. These services, which do not require a specialized end-to-end network for them, are experimenting a very fast growth. As an example, the website of the BBC delivered 106 million requests for online video during the recent Olympic Games of London 2012 [73]. The result is that, in the near future, multimedia services will have to handle a complex scenario comprising from 3.5-inch smartphone screens to 100-inch wall-mounted plasmas, covering the services coming both from the television and from the internet worlds [72][107]. Consequently, content sources will move in a wide range of formats and qualities, from the user-generated content in the social TV to the high-budget 3D movie produced by Hollywood studios. Nevertheless, the core of the multimedia delivery services is the same for all of them television broadcasting, IPTV, or internet video : taking a multimedia content and delivering it to an end user, providing the best possible Quality of Experience within the limitations imposed by the available network Quality of Service. In the rest of this section we will explore the common properties of all those multimedia services: the players or entities which take part in the service chain, the standards and protocols used to compress and transport the media stream, and the most relevant quality degradations or limitations that are present in those services. The focus will be on the multimedia
10 Chapter 2. Understanding Quality of Experience services over IP networks; but most of the concepts are applicable to other transmission means as well. 2.2.1 Players The first step in the analysis of multimedia services is characterizing the players and their roles. We will use the model proposed by the DVB-IPTV standard [19], and depicted in figure 2.1. This model is applicable to most service scenarios and it has the advantage of showing the relationship between the different players (or domains ) and their relationships regarding the OSI layer model. Figure 2.1: Layer and domain model for multimedia services The Content Provider is the entity that owns or is licensed to sell content or content assets. The Content Provided may have direct relationship with the end user for the management of usage rights to the content, or it can even be the entity which has the commercial agreement with the end user (the end user being then a direct customer of the Content Provider). However, regarding the content flow, the Content Provider delivers content assets only to the Service Provider. The content offered by the Content Provider is already finished, in the sense that it is a content asset which is deliverable to an end user (a TV channel, a live event, a movie... ). All the complexity of the content generation is outside this model and out of the scope of our work.
Chapter 2. Understanding Quality of Experience 11 The Service Provider is the entity providing a service to the end-user. This is the one with has direct logical connection with the end user for the purpose of delivering video content. The Service Provider is also the responsible of controlling the Quality of Experience offered to the end user, and therefore the subject of the quality monitoring services covered in our work. The Delivery Network is the entity connecting clients and service providers. According to DVB-IPTV, the delivery network is transparent to the IP traffic, although there may be timing and packet loss issues relevant for A/V content streamed on IP. In the practice, however, the Service Provider will need to impose specific requirements to the delivery network, what leads into two different delivery scenarios: Managed IPTV (or simply IPTV ). The Service Provider controls (and typically owns) the end-to-end IP distribution to the Home domain. The most relevant implication here is that it is possible to distribute UDP traffic over IP multicast with sufficient Quality of Service. This scenario has been the most important (sometimes the only one) for the last years, and therefore it has also been the main focus of our research and of this work. Over The Top content (or simply OTT ). Video delivery is done over the top of the internet, i.e., using a delivery network which is neither owned nor controlled by the Service Provider. As such, some of the IPTV-related delivery network features (multicast support, controlled QoS) are not available. In this context, however, Service Providers normally make use of (or even own) Content Delivery Networks (CDNs). CDNs are distributed networks which deliver the video content in an efficient way to points of presence which are closer to the end users, thus shortening the part of the delivery chain which goes really over the top. Home is the domain where the A/V services are consumed. The Home domain is property of the content consumer (the end customer or subscriber), and includes the User Terminal or Home Network End Device (HNED), using DVB-IPTV terminology. Due to the fact that IPTV is traditionally delivered to a TV screen, the Home domain is normally depicted as the end user s own home. However, the User Terminal may be also a mobile device with direct connection to the Delivery Network. The Home domain may, but does not need to, include a home local area network. 2.2.2 Coding standards and transport protocols The multimedia codec and transport technologies used in IPTV and OTT services result from the ones used in digital television. There are several families of digital television
12 Chapter 2. Understanding Quality of Experience standards around the world: Digital Video Broadcasting (DVB), adopted in Europe, Africa, Australia and parts of Asia; Advanced Television System Committee (ATSC), used mainly in North America; Integrated Services Digital Broadcasting (ISDB), used in Japan and most of Central and South America; and Digital Terrestrial Multimedia Broadcast (DTMB), adopted in China. All of them are quite similar in their basis: transport of audiovisual services, multiplexed in MPEG-2 Transport Stream, over different physical media and using different modulation techniques. When needed, we will take DVB as a reference, considering that the differences with other standards will be almost insignificant for the purposes of our work. DVB (and others) standardize the transport of audiovisual services multiplexed in MPEG-2 Transport Stream [36]. Video elementary streams are coded in MPEG-2 video [37] or MPEG-4 AVC/H.264 [38], while audio is coded in MPEG-1, MPEG-2, Dolby AC3, or MPEG-4 AAC [18]. Both video codecs use similar concepts for compression: motion prediction (to make use of temporal redundancy), block transformations (to make use of local spatial redundancy), quantification of transform coefficients, entropy coding of the resulting data, and package of data into a bitstream which add some headers of meta information (such as delimitation and characterization of the different video frames). Besides, audio codecs are also quite similar among them in the basic concepts (encoding of different frequency sub-bands of a block of audio samples). As a result, the key elements which affect multimedia quality will be very similar among all the different scenarios for digital television, regardless of the underlying transport. Both IPTV and OTT platforms may offer several different services around the distribution of multimedia content. However, we will focus here on the pure delivery of content assets to the Home domain. In both cases, there are two basic service types: Live content (Live Media Broadcast, or LMB, in DVB-IPTV terminology). The most typical examples are the live broadcast TV channels, which still are the main contributor in IPTV deployments and one of the most popular audiovisual services in any deployment. Its most important property is the real-time constraint: the end-to-end latency must remain constant for the whole play out of the stream to avoid discontinuities in the received multimedia session. Live content must be ingested, processed, and delivered by the Service Provider in real time. On-demand content (Content on Demand, or CoD, in DVB-IPTV terminology). This content is pre-loaded by the Content Provider into the Service Provider domain. It may take some time for the Service Provider to process it before it is ready for its delivery to the end user.
Chapter 2. Understanding Quality of Experience 13 Figure 2.2: Protocol stack for multimedia services over IP Those audiovisual services are delivered over IP. Figure 2.2 shows the protocol stack used for this purposes, where there is a clear differentiation between IPTV and OTT protocol families: MPEG-2 TS / RTP / UDP / IP. This is the standard scenario for an IPTV deployment over managed network, as considered in [19], [76], and [55]. It follows a push paradigm: the server controls the bit rate of the delivery. HTTP Adaptive Streaming (HAS) / TCP / IP. This is the upcoming scenario for OTT environments. It follows a pull paradigm: the client decides which video segments it downloads and when. HTTP Adaptive Streaming (HAS) is a solution used to deliver multimedia content to users where the bitrate is adapted to the network. Although the distribution of video over the internet can be done in dozens of different ways, the use of adaptive streaming is becoming the most popular one, especially in the context of OTT services offered by IPTV service providers [75]. It is also natively supported by most smartphones, tablets, and set-top-boxes. HAS works as follows: the content is encoded at a specific bitrate as a concatenation of small segments, each containing a few seconds of the stream, with the property that at the video segment boundaries the terminal can switch from one variant (at a particular bitrate) to another (at a different bitrate) without any visible effects on the screen or the audio. Each of these segments is accessible as an independent asset with its own URL, so once it is present in an HTTP server it can be retrieved by a standard web client using pure HTTP mechanisms.
14 Chapter 2. Understanding Quality of Experience There are several different HAS implementations. The most widespread distributed in the market come from the initiative of individual companies: Apple HTTP Live Streaming (HLS), Microsoft Smooth Streaming (SS), and Adobe HTTP Dynamic Streaming (HDS). All of them are based in the same principles and use similar codecs. Their main differences are the signaling of the segments and the multiplexing layer: HLS uses MPEG-2 Transport Stream while SS and HDS use extensions of the ISO base media file format. MPEG has also recently standardized a proposal for HTTP adaptive streaming called MPEG DASH (Dynamic Adaptive Streaming over HTTP) [39]. MPEG DASH supports both MPEG-2 TS and ISO file format profiles. 2.2.3 Artifacts The perfect possible media quality for a multimedia service is the quality of the audiovisual content just after the production process has finished. This reference production quality shows the product exactly as its creators wanted it to be. Of course, there might be defects in the capture, recording and production process, but, in a professional product, it is reasonable to assume that they will be very rare and with a small impact in the perceived quality. Producers must then deliver their products to the service provider. This is usually done encoding the content with a very lightweight compression, to avoid a perceptible loss of quality, giving as a result a product with contribution quality. It can be assumed that a product with contribution quality has the highest possible multimedia quality, with no perceptible visual or sound artifact or impairment. However, due to the impairments produced in the delivery chain, the final multimedia quality received by the end users may be far from the contribution quality. We will consider three main types of impairments, according to the place where they are generated: compression artifacts, transmission errors, and display errors [113]. Other terminologies and classifications are also possible [2, 7]. Compression artifacts are defects introduced when compressing the video from contribution to distribution quality, which must fit into the bitrate budget that the service provider has reserved for that specific media stream. In this compression process, several impairments can be introduced [105]: Blocking effect appears as a pattern of square-shaped blocks in the compressed image. It is caused by the independent quantification of adjacent groups of pixels, which are processed in 4x4, 8x8, or 16x16 blocks, which leads to discontinuities in the block boundaries. This effect is easy to appreciate due to the regularity of the
Chapter 2. Understanding Quality of Experience 15 generated pattern, and it is typically the most salient defect in MPEG-2 video. In AVC video it is partially mitigated by the use of smaller blocks and the effect of the deblocking filter. Blurring is the loss of spatial detail and edge sharpness in the image. It is generated by the application of strong quantification in the high frequency components, and it is emphasized by the application of deblocking filters, thus being typically the most relevant artifact in AVC video. Flickering is a defect introduced in highly textured regions which are compressed with different quantification factors along time (normally having higher quality in key frames than in predicted frames). As a result, the coding quality of those regions fluctuates periodically along time, and so does the perceived detailed level. Ringing (also known as Gibbs effect) produces ring-like periodic intensity variations around image edges in areas which should not have a perceptible texture. It is caused by a strong quantification of high frequency coefficients in edgy regions. Chromatic dispersion is produced by the suppression of high frequency components in the chrominance signal, resulting in cross-talk and loss of color definition in areas with strong color variation. Motion jerkiness is caused by the use of a smaller frame rate than the one needed to properly display the image motion. Transmission errors are produced by the loss, corruption or excessive delay of some packets in the transmission chain, which results in stream discontinuities or buffer underrun events in the receiver. They typically result in stronger versions of the compression defects: Macroblocking: a highly visible blocking effect produced by the loss of video information, which forces the receiver to build the picture using wrong references (normally repeating a correctly received frame instead of the lost one). The result is a strong blocking pattern, sometimes also causing other perceptual artifacts (parts of the image, typically blocks or horizontal stripes, with a different color or texture than what they should). Freezing or continual jerky motion, caused by the unrecoverable loss of video frames. Mute or audio glitches, caused by the loss of packets with audio information. Outage or temporal loss of service due to network problems.
16 Chapter 2. Understanding Quality of Experience Finally there is an heterogeneous set of errors that can be caused in the user terminal and display, such as an incorrect aspect ratio display [113] or a malfunction in the terminal itself. Transport errors are normally the most damaging for the perceived QoE. In an study done with a real IPTV deployment [7], it was shown that about 82% of the multimedia quality impairments reported by customers were directly related to them: Breaking Up Into Blocks (macroblocking, 29%), Screen Freezes (20%), Choppy Screen Transitions (or jerky motion, 18%) and Distorted audio (mute or glitches, 15%). As the customers were requested to report perceived errors, it is possible that a fraction of them were caused in the encoding process. However, the description of the errors as given by customers suggest that most of them refer to the stronger (and more visible) effects of the artifacts, i.e. the ones resulting from transmission errors. The additional 18% of the errors is divided into Edges Shimmer (11%), visible artifacts around edges in the image (caused by coding artifacts, as the edges are one of the places where they are more visible), and Error Stoppage (7%), or problems with the end terminal (which has to be reset ). 2.3 Who is who in the QoE metrics In contrast with the relatively fast standardization of audio [41] and speech [51, 52] qualities, the efforts to standardize video quality metrics have produced slower results [15]. The Video Quality Experts Group (VQEG) has been the most relevant contributor to this standardization process [111, 112], producing an extensive evaluation of quality metrics which has led to some standardization initiatives [45, 46, 47, 48, 49, 50]. The study of the multimedia quality, and more specifically of the video quality, has been of great interest for the last 15 years, and therefore it is relatively easy to find good surveys, reviews and classifications of the different existing metrics and approaches [15, 33, 78, 92, 121]. This section will present the most used classification of video quality assessment strategies, as well as some example methods which are relevant for our work. More detailed surveys can be found in the given references. The first division in the quality assessment approaches is between subjective and objective methods. Subjective quality assessment implies having a panel of users watching the target content and evaluating its quality by giving a score to each fragment of content under study. The result is normally presented in terms of Mean Opinion Score (MOS), which is the average of the results from the different users, maybe with some statistical processing such as the removal of outliers. Objective quality assessment is done
Chapter 2. Understanding Quality of Experience 17 automatically by computing processes which analyze the multimedia stream to produce some quality values. In most cases, the aim of objective metrics is providing MOS values which correlate well with those provided by subjective assessments, which are used as benchmark. Figure 2.3: Models for objective quality assessment: Full-reference method (top), Reduced-reference method (middle), No-reference method (bottom) Objective quality assessment methods can be classified into three different types, depending on how much information they use from the original signal (see Figure 2.3): Full-Reference (FR). The impaired signal is compared with the original one to obtain a quality value. This is the most appropriate method to use in cases where it is possible to have access to the original and impaired signals simultaneously (for instance, to analyze the compression defects introduced by a video encoder). Reduced-Reference (RR). A reduced description of the original and impaired signals are generated, and they are compared to produce a quality value. This model is useful when the original signal is not available in the measurement point (for instance, when they are at different points in the network), but it is possible to receive ancillary data through a lower bitrate channel. No-Reference (NR). The quality measure is generated only by analyzing the impaired signal, without having any information about the original. This is the most generic model, because it can be introduced in a non-intrusive way at any point of the transmission chain.
18 Chapter 2. Understanding Quality of Experience A second classification criterium for objective metrics refers to the type of data they use, having: Picture metrics, which operate in the baseband domain, analyzing the pixel values of the original and/or decoded frames to produce their results. Bitstream metrics, which operate in the coded domain, analyzing the video stream without fully decoding it or, in some cases, analyzing just the quality of service information (losses, delays... ). Bitstream metrics are usually No-Reference as well. 2.3.1 Subjective quality assessment The aim of the quality assessment is knowing, for a specific set of content assets and impairments, which would be the opinion of an average user. As such, the best way to know it is in fact asking the users. Subjective quality assessment methods provide guidelines about how to ask users about multimedia quality in the most effective way. There are several standards which provide these methods of subjective assessment, mainly the ITU-R BT.500 [42], ITU-T P.910 [53], and ITU-T P.911 [54]. All of them are quite similar in the way they propose to structure, perform and evaluate tests. Most of the subjective assessment tests reported in the literature are based on these standards, being the VQEG validation tests the most relevant example [119]. In test sessions, a number of subjects are asked to watch a set of audiovisual clips and rate their quality. The total number of viewers for a test must be between 4 and 40 (they can be effectively distributed in different viewing sessions). In general, at least 15 observers should participate in the experiment. They should not be professionally involved in multimedia quality evaluation, and they should have normal or correctedto-normal visual acuity and color vision. The location and the displays where the tests are conducted must comply with a set of requirements regarding lighting, screen brightness and contrast, distance and angle from viewers to screen... Guidelines are provided to work either with professional monitors or with domestic TV sets [42]. Sessions should not last more than half an hour. At the beginning of the session, viewers are presented with a set of example clips where they can see the type of defects that they are supposed to judge. The content samples to be evaluated may be preceded by about five dummy presentations, whose results are not taken into account, to stabilize the
Chapter 2. Understanding Quality of Experience 19 observers opinion. Besides, the video clips under study should be distributed randomly along the session. Table 2.1: ACR and DCR evaluation scales ACR DCR 5 Excellent Imperceptible 4 Good Perceptible but not annoying 3 Fair Slightly Annoying 2 Poor Annoying 1 Bad Very Annoying Different evaluation strategies are used. Although there are some variations in the details from one standard to another, they are basically the following [54]: Absolute Category Rating (ACR), or Single Stimulus method (SS). The test sequences are presented one at a time and are rated independently on a category scale. After each presentation, the subjects are asked to evaluate the quality of the sequence presented using an absolute scale, normally with five levels (see Table 2.1). Nine-level and eleven-level rating scales are also suggested to increase resolution, but they do not seem to produce significantly different results [35]. Degradation Category Rating (DCR), or Double Stimulus Impairment Scale method (DSIS). In this case, each presentation consists of two different video clips: the reference content (without impairments) and the processed or impaired version of the same content. Both videos are watched consecutively, and the subject is asked to rate the impairment of the second stimulus in relation to the reference. Five-level scales are also used (see Table 2.1). Pair Comparison method (PC). Test sequences are presented in pairs as in the case of DCR, but now the sequences are two different processed versions of the same original one (i.e. with two different levels or types of impairments). After each pair is presented, the subject has to select which one is preferred in the context of the test scenario. Single Stimulus Continuous Quality Evaluation (SSCQE). This method considers long-duration sequences (3 to 30 min). While the sequence is being played, subjects are asked to continuously evaluate the quality of the sequence, normally by controlling a slider. The proposed duration of sequences is about 10 seconds, including another 10-second period (showing a grey screen) to vote each of the sequences. When sequence pairs are
20 Chapter 2. Understanding Quality of Experience used (DCR and PC), both sequences within a pair should be separated by a short (about 2 seconds) grey screen. 2.3.2 Full-Reference quality metrics Full Reference metrics compare the original and impaired versions of the sequence, thus having access to more information than RR or NR metrics. For this reason, FR metrics have been the first ones to be developed and they also are the ones which produce more accurate results. Video engineers have used for years simple FR objective metrics such as the Peak Signal to Noise Ratio (PSNR) or the Mean Square Error (MSE) of the impaired video with respect to the reference. They are computed as follows: MSE = 1 MN M 1 i=0 N 1 j=0 PSNR = 10 log 10 I(i, j) K(i, j) 2 (2.1) (max I) 2 MSE (2.2) where I(i, j) and K(i, j) are the two compared images, whose size is M N pixels, and max I is the maximum possible intensity value for any pixel in the image (for instance, 255 for 8-bit pixel values). These metrics compare the pictures on a pixel-by-pixel basis, ignoring the image structure, and their capability to predict the perceived MOS is quite limited. However, they are still used for some applications, and especially as benchmark for other FR quality metrics: the acceptability criterium for any FR quality metric is having a correlation with subjective MOS which is significantly better (statistically speaking) than that obtained by PSNR [111]. The first attempts to improve the performance of PSNR and MSE resulted from the application of psychophysical models of the Human Vision System (HVS) to improve the measurements, in a way that has been known to produce good results in the audio quality estimation (and in the development of audio codecs) [78, 120]. A second family of FR algorithms appeared with a different approach: trying to detect impairments related to the known processing applied to the image, the expected impairments that can appear or, in general, how the image is affected from the image point of view. Some metrics having this engineering approach [121] were able to outperform the PSNR in the second round of the VQEG tests for television signals [111]. They are
Chapter 2. Understanding Quality of Experience 21 the ones included in the ITU-T Recommendation J.144 [45], the first standard for FR video quality metrics: BTFR (BT Full Reference). It makes a weighted linear composition of several individual measures, such as: percent of correctly estimated blocks, PSNR of matching blocks, segmental PSNR (error in the matching vectors), energy of edge differences, texture degradation and pyramidal PSNR. EPSNR (Edge PSNR). It measures the PSNR between both images, considering only the regions where there are edges. The result is afterwards scaled non-linearly to generate a MOS value. CPqD-IES. Image is segmented in three regions: flat, edges and textured. The Absolute Sobel Difference (ASD) is computed for each region: the result of applying a Sobel filter and finding the MSE of the resulting images. The result is introduced into a trained model to obtain the final MOS value. VQM. This metric computes also seven different parameters of the image, which are afterwards added linearly with experimentally obtained weights. Measured features are: loss of spatial information, loss of horizontal and vertical edges, gain of horizontal and vertical edges, chroma spread, spatial information gain at edges, errors in high-contrast areas end extreme chrominance errors. An implementation of VQM is publicly available on the internet [89]. Subsequent test projects of the VQEG have resulted in additional ITU-T Recommendations for slightly different scenarios. For instance, ITU-T J.341 [49] introduces VQual- HD, another FR metric specialized for HDTV contents, which combines picture similarity, spatial degradation, and temporal degradation to obtain a quality metric. ITU-T J.247 [47] proposes metrics for multimedia environments, more focused on internet frame resolutions and bit rates (lower than in digital television, as a general rule). ITU- T J.147 [46] proposes embedding hidden data in the original signal and measure their degradation in the received one. Additionally to them, it is relevant to mention the Structural Similarity Index (SSIM) [116]. SSIM considers image degradation as perceived change in structural information. Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. The metric is computed over several windows in the image, and its value between two windows x and y (assumed to be in the same position of two different images) is: SSIM(x, y) = (2µ xµ y + c 1 )(2σ xy + c 2 ) (µ 2 x + µ 2 y + c 1 )(σ 2 x + σ 2 y + c 2 ) (2.3)
22 Chapter 2. Understanding Quality of Experience where µ represent the average, σ 2 the variance and σ xy the covariance of the signals, and c 1 and c 2 are constants used to stabilize the division when the denominator is small. Although the metric has some limitations [13], SSIM has becoming increasingly popular over the recent years, since it seem to offer better results than PSNR while being a simple metric to implement (the source code is available on the internet as well). In any case, most of the FR metrics (and especially the ones included in ITU-T recommendations) have been specifically designed to be able to provide good MOS estimations for relatively subtle impairments, such as the ones generated by video encoders. However, when the errors are generated by packet losses or other network problems, and therefore are more aggressive perceptually, PSNR, SSIM and VQM show reasonably good correlation with MOS [40]. For such cases, it can be more useful to use simpler metrics (such as PSNR or SSIM) rather than the complex schemes proposed by the standards. 2.3.3 Reduced-Reference quality metrics The basic strategy used to design Reduced-Reference metrics is extracting a set of statistic parameters that characterize the video and compare them between the original and the impaired sequences (see [15] for a short survey). We can difference between two types of features: Features which describe image properties: temporal and spatial information [63, 98, 117], structural similarity [106], image statistics [114]... Known impairments on the image, normally by applying No-Reference quality estimators in both pictures (original and impaired) and comparing the results [16]. Simple RR measures can be combined to generate a more complex metric, in a similar way that FR metrics are generated from complex measures. This is the case of the RR metrics selected by the RR-NR project of VQEG [112], which are now included in the ITU-T Recommendations J.249 (for Standard Definition TV)[48] and J.342 (for High Definition)[50]: Yonsei University metric. It is a Reduced Reference version of the EPSNR included in ITU-T J.144 [45]. The algorithm selects some pixels in the edge region of the original image and computes its PSNR with the same pixels in the impaired image.
Chapter 2. Understanding Quality of Experience 23 Temporal, spacial and gain registrations are performed to enhance pixel mapping. Besides, the EPSNR of each picture is post-processed to take into account some defects or features: EPSNR is reduced if there is high blurring, blocking or freezing effects, and enhanced for high-motion or high-complexity pictures. NEC metric. A reduced version of the image is transmitted, containing the activity values of 16x16 pixel blocks of the original luminance image. Activity of a block (ACT) is computed as the average of the absolute differences from the pixel intensities to the average intensity, as in eq. (2.4). The MSE of the activity images is obtained. It is then post-processed (weighted) if the impaired image exceeds threshold on different features: psychophysical features (spatial frequency, color), scene changes, blocking effect, or local impairments. ACT = 1 256 15 15 X i,j X (2.4) i=0 j=0 NTIA metric. It is a Reduced Reference version of the VQM included in ITU- T J.144 [45], called fast low bandwidth VQM. It extracts color, spatial and temporal features, which are transmitted and compared to the same features of the processed (impaired) image. Different complex comparisons are used, so that the original and processed features are used to generate parameters, which are similar to the ones available in the FR metric, measuring modifications in horizontal and vertical edges, in spatial information, in color information and in absolute temporal information. The resulting parameters are linearly combined (with fixed weights obtained by training) to generate the final VQM value. Yonsei University EPSNR metric is the only one included both in SDTV (ITU-T J.249) and HDTV (ITU-T J.342) standards; while the other two were only included in J.249. It is also relevant to note that, even though the models described in the recommendations matched (and, at some points, outperformed) a Full-Reference PSNR, none of them reached the accuracy of the normative subjective testing [112], i.e. they are not good enough to replace subjetive assessment tests. 2.3.4 No-Reference quality metrics There are two basic families of No-Reference video quality metrics: pixel-based (also baseband or picture based) and bitstream-based. The former operate in a similar way as the described FR and RR: they analyze some features of the images and sequences (but without any information about the original image). They typically focus on detecting
24 Chapter 2. Understanding Quality of Experience one specific impairment, normally the ones expected to be introduced in the coding phase (see section 2.2.3). The latter analyze the bitstream of the coded video sequence, trying to obtain a quality metric from the syntax and semantics of the coded video. They are normally used to handle packet losses and other network impairments, but some of the bitstream metrics are also applied for coding defects. It is also possible to find hybrid schemes which combine both approaches. Several surveys can be found which describe all these metrics in detail (for instance, [33] or [15]). We will describe some of the most relevant ones. Yang et al. [123] propose a metric for temporal consistency. They compute the MSE between two consecutive pictures on motion-compensated areas with high spatial complexity and homogeneous movement. Kuszpet et al. [62] propose a metric to detect flickering based on the error of motion-compensated areas with smooth (homogeneous) textures. Several authors propose blocking metrics, trying to detect the patterns produced by block coding. For instance, Wu and Yuen propose GBIM (Generalized Block-edge Impairment Metric) [122], based on the energy of the difference between pixels at both sides of a block boundary. Vlachos [110] estimates the block effect by comparing the cross correlation between pixels within the same block with that of pixels between adjacent blocks. Wang et al. [115] search for peaks in the transform domain (FFT) at multiples of the block spatial frequency. Blurring is normally measured by studying the width of edges in the image. An edge detector (usually Sobel or Canny) is applied to the image and then some statistics are computed to provide a value for the edge width (see, for instance, [21, 68]). Other metrics include measuring other less common artifacts, such as additive white gaussian noise (AWGN), edge continuity, motion continuity... and combinations of them [21, 74]. However, pixel-based NR metrics are not able to provide good enough performance when evaluated towards subjective quality assessments [64]. In fact, VQEG has not been able to recommend any NR metrics for standardization; only RR and FR [119]. For such reason, pixel-based NR metrics are normally not directly applied in the measurement of video quality. However, they are sometimes used as building blocks for more complex FR and RR metrics. The second family of no-reference metrics are the bitstream-based. They have been increasingly popular in the recent years for two reasons. On the one hand, the lack of success for pixel-based NR metrics fosters the search for different ways of measuring quality. On the other, there is a need for measure schemes that are easy to apply to
Chapter 2. Understanding Quality of Experience 25 large platforms of multimedia services (such as IPTV), where using the decoded video could have an excessive cost which would prevent a scalable deployment. The benchmark bitstream-based metric for video delivery over UDP/IP is the Media Delivery Index (MDI), described in IETF RFC 4445 [118]. MDI is a combination of two different values, Packet Loss Rate (PLR) and Delay Factor (DF), which are usually shown separated by a colon: MDI = PLR : DF (2.5) DF shows how many milliseconds of data must be buffered in the receiver to completely remove the effect of jitter. Itistheadditional delay that must be available in the system to avoid that jitter generates packet losses. In other words, when DF grows over the dejitter buffer size in a video receptor, some packets will get lost due to buffer underrun, adding the effect to the losses accounted by the PLR part of the MDI. Let be the instant variation of the fill level (in bits) of the dejittering buffer, the Delay Factor over a period of time (typically of one second) is computed as: DF = max( ) min( ) bitrate (2.6) Media Loss Rate is computed just as the number of packets lost per time interval: MLR = packets expected packets received interval (2.7) MDI is in fact a pure QoS metric, with no knowledge of the effects produced by packet losses or jitter. However, due to its simplicity, has became a de-facto standard in commercial IPTV deployments (see, for instance, [67]). Besides, for randomly distributed errors, it is possible to find a linear correlation between the packet loss rate and the mean square error [102]. However, these results can vary when losses are not randomly distributed along time. Different authors have proposed enhancing these metrics by analyzing how errors are distributed along time and how they propagate between protocols. Liang et al. analyze the effect of different packet loss patterns for low-bitrate applications [65]. Pattara- Atikom et al. analyze the propagation of errors, either coming from packet losses or from excessive delay, from the IP layer to the video layer, also considering different structural factors related with the loss pattern and how the protocol stack is built [80]. Reibman et al. developed a model which can compute the MSE from the received bitstream without decoding it [95]. The algorithm, designed for MPEG-2 video, estimates the error at macroblock level and follows its propagation along the following video
26 Chapter 2. Understanding Quality of Experience frames. The same research group has evolved these results to predict the visibility of packet losses for MPEG-2 and AVC video, based on some parametrization of the packet loss and using Generalized Linear Models to combine the parameters [58, 66, 93]. In a different approach, the Picture Appraisal Rating (PAR) is a metric which estimates the PSNR of the stream from the values of the quantification parameters in MPEG-2 coded video [59]. These schemes are evolving towards hybrid metrics, which combine several bitstream measures, sometimes also with additional picture measures, to obtain better quality estimates. The V-Factor proposed by Winkler et al. use several measures such as quantification parameters, bitrate, packet losses, video stream structure... to produce a single quality value [121]. Erman and Matthews analyze Key Quality Indicators (KQIs) such as blockiness, jerkiness... and predict their value from the measurement of network quality of service (bitrate, packet loss rate, buffering) using a trained model for that mapping [17]. This approach is also being used in the new upcoming multimedia quality standards which are being developed in ITU-T Study Group 12: P.1201 (ex P.NAMS) and P.1202 (ex P.NBAMS) [8]. They are intended to be used both for network planning and for QoE monitoring. P.1201 use only transport information, while P.1202 adds video bitstream information. Only video headers are used; neither of them require to decode the video. Most of the work developed in this PhD Thesis is also located within the framework of the bitstream-based and hybrid quality estimation. We propose a simple but effective method to predict packet loss effects on video quality [86]. It can be used as basis of a full quality monitoring scheme which provides with a significant mapping between quality values and the qualitative effect observed by the user [85]. This model can also be applied to different scenarios, such as unequal error protection [82] or selective scrambling [83], among others. 2.4 Other topics related to QoE in IPTV services When managing multimedia Quality of Experience, there is some implicit knowledge which is not always easy to find in the surveys of metrics, such as a proper definition of QoE, the relevant fact that transmission errors are much worse than coding errors, or the difficulty to generate a good no-reference metric [91]. This section compiles some miscellaneous results extracted from the literature, which can be used to support design decisions.
Chapter 2. Understanding Quality of Experience 27 Cermak et al. [6] study the relationship among video quality, screen resolution, and bitrate to show that, as expected, the perceived quality increases with the bitrate for a given screen resolution. Besides, they conclude that it would be reasonable to choose a bit rate, given a screen resolution; it would not be reasonable to choose a screen resolution given a bit rate. Jumisko et al. [56] study the effect of the selected content in subjective assessment of video quality on mobile devices, finding that the content selection may have strong effects in the results of subjective assessments. Specifically, for audiovisual content, it seems that errors are perceived as more severe in contents which are recognized by the users than in unrecognized contents. There are also several studies which characterize the levels and patterns of packet losses (and other network issues) in IPTV services, so that they can provide valuable inputs to the metrics that monitor the effect of those losses. Hohlfeld et al. [34] providea model to simulate packet loss packets based on Markov chains whose parameters are computed from session capture logs. Ellis and Perkins [14] characterize the packet losses in residential access networks (cable and ADSL), performing an intensive study of packet loss rates in 4 cable and 1 ADSL links, at several bitrates (1-4 Mbps). Most sequences had an error rate lower than < 1%. Typical error bursts were short: 1 to 5 packets. However, this can change if the DSL service activates Forward Error Correction and interleaving, which reduces the error rate at the cost of spreading the errors. With a typical interleaving of about 10 ms [5], any error which is not corrected by the ADSL FEC will result in a potential loss of 10 ms worth of video. Mahimkar et al. perform an extensive analysis of a large commercial IPTV network [67]. They collect a lot of data from the field and develop a method to find the root cause of a problem by statistics analysis and correlations. Beyond that, they provide an interesting insight of what happens inside a real IPTV deployment: Video traffic is monitored using MDI. Other monitoring data used are the logs of the network elements and routers (recovered in a centralized syslog), logs of Set-Top-Box reboots and reports from customer care centers. There is a high correlation between MDI events and network events (syslog), as expected. However, there is low correlation between MDI and call center events (bursty video losses rarely generate a call). On the other hand, most customer complaints (46%) are related to video (supposedly sustained problems). About 5% of STBs had at least one reboot event in 3 months period.
28 Chapter 2. Understanding Quality of Experience Another field to consider is the composition of audio and video qualities to generate an audiovisual quality model of the content. It is widely accepted that the multimedia quality m can be modeled parametrically from the audio and video qualities (a and v), m = αa + βv + γ(a v)+c (2.8) and that those parameters depend on the specific application [31]. A recent analysis of 12 different subjective assessment tests [90] has shown that audio quality and video quality are equally important in the overall audiovisual quality. The application drives the range of audio quality and video quality examined and thus produces the appearance that one factor has greater influence than the other. The underlying perceptual model is invariant to application. The most important overall conclusion is that only the cross term (a v) is needed to predict the overall audiovisual quality. These results are in line with others that showed that instantaneous errors where similarly unacceptable, were they produced in audio or in video [57]. Audio quality metrics as such are not usually included in the quality assessment for multimedia applications. It might be caused by a bias in the studies of multimedia quality, since most of them come as evolution of video-only quality analysis. Even if it might be partially true, there is a good reason for not being too concerned about audio coding quality: while audio and video are equally important for multimedia quality, audio requires at least an order of magnitude less of bitrate to reach a similar quality level [43]. Therefore audio coding quality should not be a problem in a well-dimensioned multimedia service. The measurement of audio coding quality has been standardized in the recommendation ITU-R BS.1387-1 [41], which defines a Full Reference audio quality metric called Perceptual Evaluation of Audio Quality (PEAQ). The model divides the audio signal into segments (called frames). Each frame is divided into different sub-band components (using an FFT or a filter bank) in the Bark scale, modeling also the frequency response of the peripheral ear and the time and frequency component masking of the human hearing system. There are a total of 54 sub-bands between the 80 and the 18000 khz. Afterwards both signals are adjusted, equalized and its relative error (per sub-band) is computed. These results are used to compute several Model Output Parameters (MOVs), which characterize the level and structure of the error signal (bandwidth, modulation, noise to mask ratio... ). Those MOVs feed a neural network which provide the final quality value.
Chapter 2. Understanding Quality of Experience 29 Audio packet losses are relatively simple to analyze, since they basically produce a mute in the audio output. The effect of the mute length in audio quality has been evaluated by Pastrana et al. [79], from which it is possible to extract rough results for audio losses: Mutes below 500 ms produce low to moderate impact in quality. Mutes from 500 to 1000 ms produce a strong impact in quality. Mutes longer than 1 second produce a very strong impact in quality. The end-to-end delay is not frequently considered as a critical design factor for multimedia services (IPTV and similar), as it is for conversational services. However, this could become a relevant factor in some specific situations. For instance, experiments show that for specific contents such as important sport matches, having an end-to-end delay which is 2-4 seconds higher than other services (SDTV vs HDTV for instance) can be a reason for a user to switch services [70]. And sport matches are in fact the most relevant content in current digital television platforms: for instance, in Spain pay-tv channel Canal+, more than 40% of the aggregated audience comes from live football matches 1. 2.4.1 Media formats in IPTV deployments Video coding and transport standards provide a reasonable degree of freedom to implement them. However, when designing, implementing, or testing QoE measuring strategies, some assumptions must be done about the specific parameters used to encode the content. To support these assumptions, we have analyzed video streams from existing IPTV deployments (or field trials of IPTV service providers) in several countries, such as Spain, USA, UK, Brazil, Chile, Argentina, Austria, Cyprus, Dubai, Czech Republic, Slovenia, France, Italy, Japan, Taiwan, India, Turkey, South Korea and Australia, from 2007 to 2011. From this survey, the following conclusions were obtained: Video format is MPEG-4 AVC (H.264) in all the scenarios, plus some MPEG-2 part 2 video in some of them (in all cases for legacy support and with intention to migrate to MPEG-4 AVC). VC-1 has limited use, mainly in North America. Other formats, such as MPEG-4 part 2 (visual) Simple Profile, which were relatively 1 Audience data from January to September 2012. Source: Kantar Media.
30 Chapter 2. Understanding Quality of Experience popular in internet video in the recent years, have virtually no presence in the IPTV world. Main profile is used for SDTV and main or high profile for HDTV. Video resolutions are the typical for television distribution: 720x576 (25 fps) and 720x480 (30 fps) for SD, as well as 1920x1080 (25/30 fps) and 1280x720 (50/60 fps). Fractions of the full horizontal resolution (e.g 1440x1080 or 544x576) are also frequent, especially in the lowest bitrates, to reduce the amount of data to transmit. Video bit rates lay between 1.5 and 3 Mbps for SD. HD bitrates are more variable: from 6 to 20 Mbps. Constant bit rate is used in most of the deployments, although not necessarily with strict CBR constraints (some local variations of the bitrate are acceptable). GOP lengths are between 12 and 100 frames, (0.5 to 4 seconds, approx.). The most typical values are between 24 and 48 frames. GOP structures are IBBP or IBBBP, the latter being more frequent with longer GOPs. Besides, IBBBP GOP structures are normally hierarchical, with the reference structure represented in figure 2.4. Dynamic GOP size, i.e. changing the GOP size depending on the structure of the video (normally to insert I frames in scene changes), are frequently used. However, the number of consecutive B frames (between I and P frames) does not change. Video start-up delay (PTS-PCR difference for I frames) is imposed by the encoder end-to-end delay. In last-generation low-delay encoders, coding delay is typically around 1 second and video start-up delay is between 700 and 900 ms. Mediumdelay configurations have values around 2 seconds (normally with the benefit of better coding quality). The first generation of H.264 encoders had delays on the range of 4 seconds. Audio formats are MPEG-1 layer 2 (typically associated with MPEG-2 video), MPEG-4 AAC (both low-complexity and high-efficiency profiles), and Dolby AC- 3. Audio bitrates range from 96 to 512 kbps. There may be several audio streams (with different languages). Other data streams are usually present, where the subtitles and the teletext are the most relevant ones. Their presence, relevance, and format, vary significantly from one country to another.
Chapter 2. Understanding Quality of Experience 31 Figure 2.4: Hierarchical GOP structure 2.5 Conclusions Even though the market and the technology are in constant evolution, it is possible to define a subset of common elements which can cover a big fraction of the multimedia service playground: delivery of MPEG digital video and audio contents over a packet network. The main task of service providers is indeed offering this delivery with enough Quality of Experience to the end user. To achieve it, they must control three elements: coding quality, network quality of service, and overall service availability. Multimedia coding quality will depend mainly on the available bit rate, which will limit the video quality stronger than the audio. Service providers use the best available codec (from the ones widely available in the market) which best quality offer for a given bit rate budget, which is currently H.264. The coding process can (and should) be screened to assess its quality; and this should be done by the best available means: subjective assessment tests or Full-Reference quality metrics. Regarding FR metrics, the ones that have passed VQEG tests are available in ITU-T recommendations J.144 and J.341. If they are not available, simpler metrics such as SSIM (or even PSNR) can be used, if one is aware of their limitations. The most relevant risk to Quality of Experience in field deployments, however, is the drop of the QoS offered by the network, resulting in the loss of information. These losses can produce both spatial and temporal artifacts (macroblocking, jerky motion, freezing), with strong impact to the QoE. This impact could be monitored using No- Reference or, even better, Reduced-Reference metrics, such as the ones proposed in ITU-T J.249. However, practical reasons make that only QoS monitoring schemes, such as the Media Delivery Index, are widely used in service deployments. Bitstream-based NR QoE metrics can overcome those practical limitations and enhance pure QoS models, as it is intended by the recently approved ITU-T recommendations P.1201 and P.1202. Regarding contributions of different elements to quality, audio and video can be considered equally important. There are also reasons to consider end-to-end lag as a relevant factor. Overall service availability, understood at the possibility to receive the multimedia service, must also be considered in any practical scenario. Customers suffer from user
32 Chapter 2. Understanding Quality of Experience terminal software issues or other outage events, in a number that can be estimated roughly in 1 5%, according to the studies presented in the literature. However, unlike in the coding and network qualities, the problems with service availability will be specific from each service deployment.
Chapter 3 Designing QoE-Aware Multimedia Delivery Services 3.1 Introduction Monitoring multimedia quality of experience (QoE) in a multimedia service is a complex task. Quality monitoring implies generating quality data in real time in all the relevant points of the network, raising alarms whenever a critical event happens, and being able to retrieve and process all the data to obtain significant statistic information about the network performance in terms of QoE. Also a quality monitoring framework typically presumes that the original signal is rarely available at the monitoring point, and therefore reduced-reference (RR) or no-reference (NR) video metrics need to be used. There are dozens of video, audio, and multimedia RR/NR quality metrics which could be applicable to the monitoring of multimedia QoE (see, for instance, [33]). Although their performance is not as good as the Full-Reference metrics [111, 112], they can provide relevant results about the video quality of the measured signal. In fact, this kind of metrics has also being introduced in some commercial monitoring probes during the last decade. The cost of those probes makes them usable for deployment in several points within the communication network (such as the video head-end or the local points of presence), but not at the end user home network. However, in communication networks, the errors occur typically in the last mile, where the computing power of the equipments (network routers, access gateways or set top boxes) is rarely dedicated to the implementation of complex processing algorithms for QoE monitoring. In practical terms, the monitoring information available in field deployments is obtained only at transport level: packet loss rate (PLR) and packet loss 33
34 Chapter 3. Designing QoE-Aware Multimedia Delivery Services pattern (PLP), as well as jitter [3, 67], frequently expressed using the Media Delivery Index (MDI) [118]. Some derivative QoE monitoring metrics, such as [4, 17], are built assuming that packet loss and jitter is the only available input regarding network impairment issues. Since the effect of jitter is creating a packet loss in the receptor (because the packet arrived too late to be used), this is the same as saying that the only available network quality source information is the packet loss pattern at the end decice. Despite its limitations, this approach makes sense because, for random packet losses, the effects (error in the decoded video) are quite correlated to the (effective) packet loss rate [95] and pattern [65]. PLR/PLP monitoring systems have also many other advantages: they make no assumptions about the content, can be widely deployed in a non-intrusive way, and provide data which are easy to understand, aggregate, and analyze. Besides, they are repeatable: if we can assume that a specific packet loss pattern creates an aggregated effect (let us say, an impairment in global perceived quality of x%, within some error margin), we can recreate the same effect by replicating the causes by generating the same error pattern. However, using PLR/PLP as the only description of network losses implies considering all the media packets as homogeneous data, which is certainly sub-optimal. The impact of an isolated packet loss may vary strongly depending on whether the data loss belong to audio or video, and also depending on the part of the audio or video stream which has been lost. Besides, as discussed in section 5.2, the impact of the packet losses can be dramatically mitigated by a simple re-arrangement of the transport stream packets in the RTP and an appropriate prioritization model [82]. Hence, with an appropriate packet priority model applied in the service deployment to reduce the randomness in the packet loss events, the significance of the pure PLR/PLP could also get reduced. Fortunately, there is additional information at transport level, or at the network abstraction layer of media level, that is also available for the network elements with very few additional effort (or, in other words, without needing to decode, even partially, the video or the audio): elementary stream type (video or audio), type of coded video frame (I, P, B... ), position of the frame boundaries within the bitstream... This information, which we could call rich transport data, can be used to better predict the effect of packet losses [86]. The key point here is that the rich transport data are obtained directly from the bitstream in a deterministic way (it is syntactic information which is always available in the media stream). This way, these data share the main properties that made PLR/PLP so useful: content-agnosticism, non-intrusiveness, simple processing and, what is most relevant, repeatability (in the same terms as PLR/PLP). The aim of this chapter is building a framework for QoE monitoring based on the information available at the rich transport data level. This framework will be built from
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 35 a pure bottom-up approach: the target is making the best possible use of the available information within this rich transport data level, and trying to find out whether the obtained results could be sufficient in a network monitoring scenario, as well as whether they would provide more information that the ones obtained only by PLR/PLP analysis. Before going on with the analysis, it is important to consider that the final target of the network monitoring is anticipating, or at least explaining, the degradations in user quality of experience that could cause complains from end customers. In such context, when users complain about errors in the field, they do not speak of packet losses, but of video artifacts [7], such as blockiness, screen freezes, choppy transitions or distorted audio. The aim of our monitoring framework is identifying the root causes of these kinds of impairments, so that they can be detected when they happen. On the one hand, for impairments which are caused by network errors, we will use the information of the rich transport data to obtain the most accurate description and characterization of the effect. On the other hand, for impairments related to the coding process itself, there are also elements in the transport data that can be used as proxies to monitor them. It is also important to consider that, in typical multimedia service deployments (with a few hundred video streams for hundreds of thousands of users), the quality of the encoded content should be high enough in normal operating conditions, and the the monitoring of coding quality could also be done with more complex (and expensive) tools. To validate the characterization of the different impairments based on rich transport data, we have also designed a set of subjective quality assessment tests, where the impairments to analyze are based on that characterization. They compare the effect of the same type of degradation for several contents and different users. The results of the test can be used both to validate the characterization of the error (i.e., to determine to what extent it makes sense), as well as to calibrate its subjective impact. Quality monitoring tools are aimed at estimating the quality perceived by the end users. Therefore, to obtain from subjective tests meaningful conclusions to be applied in the development of the monitoring architecture, these assessment tests should be designed respecting as far as possible real home viewing conditions. Thus, a novel subjective methodology, based on well known standard procedures, was used in the tests covered in the present work to obtain representative results of what end users perceive in their households when typical transmission errors degrade the received video. The main target of our work is to sketch the steps required to build a consistent monitoring framework. This way, it is possible to identify the main impairment sources, find
36 Chapter 3. Designing QoE-Aware Multimedia Delivery Services out which information can be obtained about them (under reasonable assumptions), propose a framework to sort and classify this information, and design and implement a set of subjective assessment tests based on this framework. We believe that this approach can be easily enriched with other algorithms and models and also could provide useful tools to other monitoring schemes that are being developed (for example, the recent standardization efforts such as ITU-T P.NAMS and P.NBAMS [8]). The chapter is structured as follows. In section 3.2 we will describe the architecture of a multimedia delivery service, as well as the main quality impairment events that are present in a field deployment. Once they are identified, in section 3.3 we will propose the architecture for the monitoring process, aimed at detecting and characterizing each of those events. Section 3.4 describes the design of the subjective assessment tests used to validate and parameterize the proposed solution. Finally, section 3.5 describes some QoE enablers: network elements focused on enhancing the QoE offered by the service. 3.2 Delivering multimedia over IP The monitoring system has to be designed according to the architecture of the monitored service. For this reason, a fine characterization of what a multimedia service is and how it works is quite relevant for the purposes of our work. This section proposes an architecture for multimedia services, based on the principles described in Section 2.2. It also describes a set of impairments that appear in those deployments, and how they could be detected based on the monitoring of rich transport data. 3.2.1 Architecture of a multimedia service delivery platform Figure 3.1 shows a schematic architecture for an IPTV and OTT service. Although it is a simplification, it shows the main elements that are present in most commercial deployments [11, 55, 67]. The architecture also shows the most relevant quality monitoring points according to the Recommendation ITU-T G.1081 [44]. They are labeled as PT1-PT5 in the figure, following the terminology proposed in the Recommendation. The main building blocks for a multimedia service delivery architecture are thus the following: The video contribution, coming from the Content Provider. The ingestion of the video contribution is the monitoring point PT1.
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 37 Figure 3.1: Schematic representation of the network architecture for IPTV and OTT services, including reference monitoring points (PT1-PT5)
38 Chapter 3. Designing QoE-Aware Multimedia Delivery Services A Central Headend, where the content preparation occurs. This is normally owned or controlled by the Service Provider. There may be also local headends, which are smaller versions of the local headend used for local content. PT2 is located at the output of the headend. The core network, with different configurations depending on the type of service distributed. It is assumed to be a high-quality network, with negligible error rate. The Point of Presence. This is the last point in the network chain where the Service Provider has control. PT3 is located here. The access network, which is an IP link between the PoP and the Home Domain. The Home Domain, which includes the Residential Gateway (RGW, the entry point of the home, where PT4 is placed) and the HNED or user terminal (whose output is PT5). The video contribution is received, by definition, in contribution quality, which is the maximum multimedia quality available to the service provider. The contribution is ingested into the video headend and processed once in a centralized way (or locally for local headers, with video streams that may have regional or local distribution only). The key principle of the headend is that any processing is done once for each content asset or stream (or, in other words, each processed asset will be common for all the users of the service). For this reason, the processing done in the headend is usually performed by dedicated equipment. Processing or storage capacity in the headend is not a strong limitation in the deployment. Typical head-end functionality includes [87]: Coding (or transcoding) of the contribution. The contribution source is encoded using a format, resolution, and bitrate that fits the dimensioning of the service and the capabilities of the network and user terminal. After this, the encoding can be assumed to be left untouched, and the multimedia quality of the content at this point (delivery quality) is the expected quality to be perceived by the end users. Encryption of the content using a Digital Rights Management (DRM) system [109]. The coded media stream, or a fraction of it, is scrambled using cryptographic algorithms. The scrambled data can only be deciphered by authorized user terminals. Other video processing: multiplexing, remultiplexing, labeling, signaling of entry and exit points for local content splicing, metadata insertion...
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 39 Ingestion into the core network, for Live/OD and for IPTV/OTT, using the appropriate multiplex and transport protocol stack, as it has been described in section 2.2.2. For the case of RTP streams (IPTV), it could also imply adding Forward Error Correction (FEC) redundancy packets, as defined in DVB-IPTV AL-FEC. Quality monitoring in the headend is oriented to guarantee a sufficient degree of delivery quality. It normally requires intensive monitoring, as any quality impairment in this point affects all the users in the deployment. It allows FR measurement of the delivery quality with respect to the contribution quality, between points PT1 and PT2. The core network is different for each of the service types. In the case of IPTV, the live video is distributed using a multicast-enabled IP network. IPTV Video on Demand is ingested into a centralized master VoD server, which may distribute it to video pumps located closer to the end users. The core network for OTT is a Content Delivery Network (CDN). CDNs ingest the master copy of the content into a centralized server (usually called origin server ), which stores it permanently (for on-demand content) or for some time window (for live content). The video is then distributed towards the edge throughout a hierarchy of caches. The point of presence (PoP) is, by definition, the last point where the service provider may have control of the delivered video. Although it has been displayed as a common point for all the networks, it does not need to be this way: it is not unfrequent that CDN PoPs, for instance, cover a wider area (and more users) than IPTV PoPs. The key point of the PoP is that all the processing done here is done in a per-user way. It is also the last common point for unicast services: any communication between the PoP and the end user will be different for each user (except for the case of live IPTV over multicast). As a consequence, the scalability of the PoP processing must be done in a per-user basis (contrary to the per-asset scalability of the video headend), and therefore the cost of processing and storage in the PoPs is very relevant for the full performance of the service. The PoP is also PT3: the last monitoring point in the service provider domain, where two different elements are monitored: 1. Errors in the core network. It allows using RR or NR metrics, depending on the capability of the headend to generate RR information. Each error detected here affects all the users in that PoP; therefore intense monitoring is recommended. 2. Errors in the delivery and home networks. This is the real monitoring of the quality delivered to the end user, which must be done between the PoP servers and the user terminal. In the cases where the service provider does not control any
40 Chapter 3. Designing QoE-Aware Multimedia Delivery Services home network element (which is typical in OTT), the monitoring must be done in the PoP (with the feedback data provided from the client in the communication). As performance is critical, in almost all cases only bitstream NR measures will be available. The access network is the IP link between the PoP and the home domain. Strictly speaking, the term access network is normally used only for the last mile, i.e., the part of the network covering the data link to the home domain (DSL, GPON, 3G, LTE...). However, we will use the term in a broader sense, so that it may cover also the second mile metropolitan network or, in general, any required IP access between the home domain and the PoP. IPTV access networks must support UDP traffic and, more specifically, UDP over IP multicast. OTT traffic is less restrictive, typically only involving HTTP connections (TCP over port 80). Finally the home domain comprises all the equipments located in the end user premises. Depending on the type of service, the Service Provider may have some kind of control of what is happening in the home domain. For instance, in IPTV services it is frequent that the Service Provider owns the residential gateway and/or the user terminal, which are provided as part of the service itself. In OTT services it is more frequent that the user terminal is owned by the end user itself, but it might include some Service Provider specific application software. In the former case, it is possible to take NR bitstream based measures in PT4. In the latter, monitoring in the user terminal is not possible. In any case PT5, which is the final quality displayed to the end user, cannot be effectively monitored in real time service-wide. Only selected users, either with objective monitoring probes or as subjects of subjective assessment tests, will be able to provide quality information. Two additional considerations are relevant. The first one is that the possibility to take some measures (as well as to perform error correction actions when possible) may depend on specific QoE capabilities of the deployment, such as the ones that will be described in the next subsection 3.5. The second one is that, if a DRM system is in place, it would be virtually impossible to apply pixel-based metrics beyond the headend, as the content will be scrambled and will not be decodable by the monitoring probes. It may, as a general rule, be possible to apply bitstream-based metrics, since the scrambles can normally be configured to leave in clear all the relevant rich transport data of the stream.
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 41 3.2.2 Impairing the Quality of Experience The first step to design a monitoring system is to understand the elements which can affect to the quality of experience perceived by the end user. The concept of QoE is used on purpose in this work, since the framework established here is applicable to any element of the QoE which is susceptible to be monitored. However, the work will have a special focus on the aspects of QoE which are directly related to multimedia quality. Having said this, this section does a first classification of the possible causes of degradation for quality of experience, as well as their possible consequences. The classification is mainly based on the causes, because it is what is measured by monitoring systems. In the description of the different causes, we will also identify the mapping to the quality impairments reported by the end users, as described by Cermak in [7]. 3.2.2.1 Coding quality Video coding quality is one of the most relevant elements in the QoE, and establishes an upper boundary for the global perceived quality. The artifacts which appear in video coding, as well as several ways of measuring them, have been widely discussed in the literature [15]. Among them, the edges shimmer reported in [7] is one of the effects of problems in video coding. Low coding quality can also cause blocking effect on the pictures (although it is less visible in AVC video, due to the deblocking filter effect), but it is less aggressive than in the case of video packet losses. Audio coding quality is normally a less relevant issue in video delivery services, because its bitrate is typically one order of magnitude smaller than that of the video; while its impact in the quality is similar [31]. Estimating video quality from rich transport data is not obvious. Without any better proxy to measure coding quality, bitrate normally makes a good one, especially when comparing quality from the same encoder and the same content [6]. Under stable conditions (same encoder implementation and bit rate), the quantification parameters may also provide an estimation for video quality [59]. 3.2.2.2 Packet losses The most relevant impairments in video transmission services (for instance, those reported in [6]) come from errors in the network: either packet losses or jitter. Packet losses can be corrected using either FEC or ARQ techniques (see, for instance, the proposal for IPTV in [19]), while jitter can be corrected by using a reception buffer.
42 Chapter 3. Designing QoE-Aware Multimedia Delivery Services However, when the error (loss or jitter) exceeds the capabilities of the correction strategy, the effect is always a loss of data in the decoded stream. Those effective packet losses are the main target of the quality monitoring. The effect of packet losses depends on the error recovery strategy. In most cases, the decoder tries to conceal the error by inferring an appropriated replacement for the affected data sections (video or audio frames), such as repeating previous data or inserting silence or noise. Losses in the video stream produce blockiness effect or freezes in the video play out. The former cause the appearance of incoherent blocks in regions of the frame, while the latter can be perceived as screen freeze or choppy transition, depending on the length of the effect [86, 95]. Losses in the audio stream produce an audio degradation ( distorted audio ) with a duration of the same order of magnitude as the length of the data loss [79, 84]. However, the behavior for on-demand content could be different. An alternative error recovery strategy is allowed: stopping the play out and waiting until all necessary data have arrived. Nevertheless, this only makes sense when data retransmission is possible (e.g., TCP transport layer, where the integrity of the received data segment is guaranteed). Besides, it generates the buffering events typical of internet video. 3.2.2.3 Latency With bidirectional real-time communication (such as videoconference), end-to-end latency is the most critical parameter to consider. However, in unidirectional content delivery, latency is typically much less important. Coding quality management and packet loss correction is normally done at the cost of latency. Latency is only important in live events (especially in sports events). However, up to our knowledge, the effect of the global latency in the user QoE has not been widely studied in the literature. Except for the previously mentioned buffering events, end-to-end latency is constant. As such, it is established at the beginning of the multimedia session and remains constant from then. In fact, except for pure transport latency (which is only significant for satellite broadcast), latency is decided at the design phase. Another latency-related QoE element is the initial wait time, which is the time that the user has to wait to start viewing (and hearing) a multimedia service. In the context of linear TV, it is called channel change time, or zapping time, and it has been modeled as a component of the QoE [60]. Since digital TV has typically long zapping times, there have been significant efforts in the last years to develop systems that can reduce it (see, for instance, [11]).
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 43 3.2.2.4 Outages Service outages are interruptions of the whole service for a relevant period of time. Although there may be quite different sources for such kind of errors, they must be taken into consideration in any global QoE monitoring systems: on the one hand, because it does not make sense to monitor the less relevant errors if the most critical ones are not controlled; on the other hand, because they effectively happen and are reported by the end users (they are labeled as error stop in [7]). Besides the possible failures in the service equipments (either in the customer premises or in the network), an outage can also be produced by an abrupt loss of video and audio signal, which can be monitored with measures such as the ones defined in [94]. If the origin of the outage is in the contribution media source, it should be monitored in the service head-end (before or after the video coders). Outages caused by the network are equivalent to long packet losses, and are easily monitored as well. 3.2.2.5 Quality degradations in new multimedia scenarios Nowadays multimedia services are starting to popularize two features which affect the QoE management: scalability and stereoscopy. The concept of scalable video, where the video is coded using several quality layers, each one refining the quality provided by the previous one, has been included in the coding standards for the last years. However, it has not had wide acceptance and is not significantly present in current multimedia services. Anyway, the concept of scalability has been recently introduced in the marketplace with the irruption of HTTP adaptive streaming [104]. In this kind of systems, the media stream is coded in parallel using different bitrates, and the streaming can switch among them at pre-defined switching points. The advantage is that the streams are fully compatible with current AVC decoders, thus simplifying its implementation and deployment. As there are different bitrates, there are different coding qualities for the same video and audio stream (and all the considerations done for coding quality apply). Besides parallel coding, another way to create a codec-compatible lower-bitrate version of a video stream is dropping some no-reference frames. This technique, called denting, has been already used in different IPTV applications [85]. Regardless the method used to create the different bitrate versions, they will have different quality (and therefore different impact in the perceived QoE). Any monitoring system has to be aware of the version which is being received by the client at any moment.
44 Chapter 3. Designing QoE-Aware Multimedia Delivery Services Stereoscopic video is also being introduced in broadcast and streaming service, as 3D productions are increasingly popular in the entertainment market. There are basically two coding options for stereoscopic services: either coding the right and left view as parts of a single coded frame (typically side-by-side), using a common 2D video encoder; or coding them separately, using normally MVC. In either case, the coding schemes are basically the same as in 2D video (based on blocks and prediction), and therefore the effects in the decoded picture are equivalent to the ones produced in 2D video. However, those artifacts can produce different impacts in the final stereoscopic reconstruction done by the human visual system, and therefore they have to be studied specifically [24]. 3.3 QuEM: a qualitative approach to QoE monitoring The aim of this section is proposing an architectural design aimed at monitoring the quality of experience in an multimedia service delivery network. First we will provide a definition of the problem, trying to make explicit all the assumptions taken into account in the design. Afterward we will propose the architecture structure, as well as some proposed implementation for its main building blocks. 3.3.1 Problem statement The problem addressed by this architecture is the monitoring of multimedia QoE in an IPTV or OTT network. Figure 3.2 shows the delivery chain of multimedia services based on the network architecture described in section 3.2: source media, coding, transport, decoding, and presentation. The most typical realization of this delivery chain is an IPTV deployment of MPEG-2 Transport Stream video over RTP/UDP over an IP network [19]. However, the main ideas and elements described later will be also easily applicable to HTTP adaptive streaming scenarios. The main assumption taken is that the monitoring is applied to a network of a service provider offering some kind of video distribution service to a high number of end users. This assumption imposes two conditions. On the one hand, scalability is a must. As such, any monitoring metric should require small processing power, be applicable on real time, be a no-reference metric, and assume no prior knowledge of the source content. On the other hand, it is expected that the service provided has established a target quality which is considered sufficient, and which is the one offered by the service in normal conditions. Therefore the aim of the monitoring system will be detecting the
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 45 Figure 3.2: Delivery chain of a multimedia service moments where this quality gets impaired and establish some measure or description of such impairment. In other words, the monitoring system should provide a relative value of the quality, with respect to the target quality that would happen in the absence of impairments. 3.3.2 System design To detect and measure those impairing events, we take a simple approach based on a typical architecture of quality estimator: measure, pool, and map to quality [33]. Figure 3.3 shows the block architecture of the design, which we have called QuEM (Qualitative Experience Monitoring) [85]. The basic building block of the solution is the Qualitative Impairment Detector (QuID). It performs the measurement step by identifying each of the sources from content degradation. Its output is the (approximate) perceived degradation in the user experience. The key of this block is that it must be, as much as possible, a systematic description of effect of the error which has been produced (e.g. half of the picture is blurred for one second ), and not only a single quality value (e.g. Mean Opinion Score equal to 2 ). This property of significance of the QuID output is what makes the approach qualitative (in the sense that there is not only a quantitative value of the degradation, but also a qualitative description).
46 Chapter 3. Designing QoE-Aware Multimedia Delivery Services Figure 3.3: QuEM architecture design At this point of the chain, the repeatability is also very important. Therefore it should be possible, as a general rule, to force the introduction of an error of each type, as it is possible in the pure-plr-based methods. The next step is the Severity Transfer Function (STF). The idea here is mapping the error to quality values which, in the case of packet monitoring, would be the severity of the error. This STF is done within a pooling window. Synchronizing the pooling window along all the different errors (and along different clients) is important, because it allows following whether an error has been produced in different users at the same time. The length of the pooling window is another configurable parameter of the model. It should be in the range of the duration of what could be considered as a single impairment event. To cover macroblocking error propagation along the video Group Of Pictures, segments of adaptive streaming or short outages [94], for instance, pooling windows from 5 to 20 seconds can be considered appropriate. The scale used for the STF may be anything which is significant for the user of the monitoring system, including a Mean Opinion Score (MOS) scale. However, unlike in typical MOS-based quality metrics, the STF is known by the user, thus making it possible to trace the MOS value to the qualitative description of the impairment that generated it. The last step is the aggregation of errors for their use in statistics and in alarm systems. As with the STF, the aggregation function can also be modified by the service provider. Due to the complexity of taking into consideration all the possible interaction between
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 47 different QuIDs, and provided that the severity of each of the QuIDs has already been established, our proposal for this block is simply taking always the maximum severity of the ones in play [57]. 3.3.3 Qualitative Impairment Detectors The key of the usability of this architecture is the definition of the QuIDs in a way that are significant and repeatable. As a relevant example, we are going to build a monitoring system which can operate in an extensive multimedia network, based on the following QuIDs, which will be further discussed in chapter 4: Packet Loss Effect Prediction (PLEP) [86], described in section 4.2, modelsthe effect of video packet losses, which depends mainly on the video coding structure and the position of the packet loss within the stream. PLEP metric provides a good estimation of the effect of the loss: macroblocking (with a reasonable estimation of the area affected and the duration of the artifact) and video freeze. Audio packet losses, described in section 4.3. Their effect can be measured by monitoring loss patterns, since there is a high correlation between the length of the loss burst and the duration of the resulting distortion (normally silence or noise) [84]. Drops of coding quality, discussed in section 4.4. We will assume that the coding quality that enters the core network is the desired quality (or, alternatively, that it can be monitored in the encoder side with more suitable mechanisms). However, this quality can decrease in the cases of network congestion or bandwidth drops, by using HTTP adaptive streaming or packet prioritization mechanisms [82]. Besides the switch to a lower bitrate coded stream, we will consider the drop of no-reference frames (denting). They can be measured by monitoring bit and frame rates. Service Outages, or interruptions in the continuity of the delivered content, which are described in section 4.5. They are basically severe version of the video and audio packet loss effects, and they can be measured with the same techniques. These measures cover the most relevant defects which appear in IPTV deployments [13] and can be easily measured in the bitstream, without needing to decode the video or audio (only NAL Unit headers and Slice headers beyond the transport layer). All the measures are fully compliant with the requirements for repeatability and significance required to qualify as QuID, with the possible exception of the bitrate, whose significance is more questionable. However, for the sake of this analysis, we will consider it
48 Chapter 3. Designing QoE-Aware Multimedia Delivery Services enough to provide quality information to the service provider (which should be able to observe easily the subjective quality of each of the different bitrates produce by its video encoders). Before keeping on with the discussion, it is important to point out that we are using a bottom-up approach to build the QuID measures. We start from the information that can be obtained by tracking audio and video headers in the bitstream, as well as the main properties of the stream itself (bitrate and frame rate). Then we provide some simple measures that can offer information about impairments with a computing cost similar to the PLR/PLP metrics. They key is finding out whether these QuID measures can provide relevant information about the QoE of the received stream. To validate this point, we have designed a methodology for subjective quality assessment tests, which will be discussed in section 3.4. 3.3.4 Severity Transfer Function Our proposed way to build the Severity Transfer Function is using subjective quality assessments which evaluate the effect of the different QuIDs under consideration. Anyway, due to the significance property of the QuIDs, STFs can be established by the service provider (or network operator) considering its own severity criteria. This way, the relative severity of screen freeze events versus blockiness events, for instance, can be modified by the service provider by tuning the STF blocks, and without needing to modify the QuIDs. The subjective quality assessment tests proposed in section 3.4 can also cover this point, as they provide a way to design and calibrate STFs. 3.4 A Subjective Assessment methodology to calibrate Quality Impairment Detectors We have included some subjective tests to assess the validity of the approach and to calibrate the results (and design a first level of STFs). A new test methodology has been designed to adapt the tests to its purpose, which is described in the following subsections. The methodology has also been put into practice to assess the impact of the defects that are being monitored by our QuEM proposal. A description of those tests can be found in the Appendix A.2.
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 49 3.4.1 Design principles The objective pursued with these subjective assessment tests is twofold. On the one hand, the tests should validate the selected QuID measures. Each of the impairments under consideration has a characterization in the monitoring architecture in a way that can be measured with precision and repeatability. The aim of the tests is validating that those characterizations are good enough to provide information, with sufficient independency on the context. Or, at least, to know to what extent are these characterizations usable without knowing the context. This way, if a QuID provides, for instance, estimation of screen freezes and their duration, different realizations of the event detected as screen freeze for 500 ms should have similar evaluation results among them, and be differentiable from events detected as screen freeze for 5 seconds. On the other hand, the tests can also be used to establish some severity transfer function from the QuID outputs to a severity scale. Having this in mind, the tests should evaluate the effect of the same impairments detected by the QuIDs, using evaluation periods similar to the pooling windows of the QuEM architecture. Moreover, since the aim of the QuEM system is precisely estimating the effect of network impairments in real users of the system (and in real time), it is desirable that the tests respect as far as possible domestic real viewing conditions. This will allow mimicking the audiovisual experience of an end user watching multimedia services, and evaluating the QuID elements matching as much as possible their final operation conditions to obtain meaningful results. This fact is especially relevant in the current work compared to other subjective assessment scenarios, and makes unsuitable the most common approaches for subjective quality evaluation. The main reason is that the methodologies should be designed oriented to the specific aspects of the pursued study. Therefore, in the present case, to respect real viewing conditions, many aspects of the test should be adapted, such as: The test material should be similar to that usually watched by people in their homes, e.g. movies, sports, news, etc. In addition, the sequences should be longer enough to attract attention of the observers. This way, as it happens at households, the viewers will be interested on the content and not only focused on detecting the impairments. The equipment used in the tests should be similar to those used in domestic environments; therefore, especially the TV sets should be consumer products. The sequences should be shown to the observers following a single stimulus procedure, which means that no unimpaired reference is presented to them to compare
50 Chapter 3. Designing QoE-Aware Multimedia Delivery Services with the test video. This makes the test similar to home environment scenarios, where there is no explicit reference. The evaluation should be carried out in a nearly continuous way, since the effects of transmission errors are highly dependent on the instant when they occur, and they are not stationary. These aspects cause that the majority of the international standard methodologies are not appropriate (e.g., the procedures proposed in ITU-R BT.500 [42] or ITU-T P.910 [53]),, since they were designed to evaluate the performance of video coding algorithms and, in many cases, some conditions distance the observers from real viewing situations. Nevertheless, these standard recommendations have been considered in the design of the novel methodology that is proposed to evaluate the impact of typical transmission artifacts, so that the results are more easily comparable with those coming from other sources. 3.4.2 Test methodology Our main objective is to mimic home viewing conditions. Therefore, the proposed methodology is based on standard single stimulus methods, such as those recommended by the ITU [42] and the Absolute Category Rating (ACR) [112]. These methods do not have an explicit reference to compare with the content to be evaluated. This situation is similar to home environments where people watch video sequences. However these assessment methodologies limit the maximum duration of the video sequences to usually 10 seconds to allow silence periods for voting, when a grey fixed background is displayed. Figure 3.4 shows the structure of a test sequence according to the standard evaluation methodologies. To allow for a QoE assessment closer to a real-life situation, we have considered a new evaluation scenario where subjects view long test video sequences, so they are immersed in the watching experience. As we are interested in the evaluation of different type of impairments within this continuous stream, we have divided the whole sequence into segments. Then, the impairments under study can only be inserted in the first half of each segment, while the second half remains undistorted. Therefore, while this second half is being displayed, observers can carry out the evaluation of the distortion introduced in the first half. To indicate to the observers when and which segment they have to evaluate, the second half of each segment displays a number in the right-bottom corner of the screen. During
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 51 Figure 3.4: Diagram of the structure of the test sequences in ACR Figure 3.5: Diagram of the structure of the test sequences in our proposed method SCALE 1 2... Imperceptible Perceptible but not annoying X Slightly annoying Annoying X Very annoying Figure 3.6: Questionnaire for subjective assessment tests these periods, the observers can avert their eyes if needed to look to the questionnaires, without affecting the result of the evaluation. In addition, a first segment is used to indicate to the observers the beginning of the test and provide a coding quality reference, thus it is also left unimpaired and marked with a zero. Therefore, the structure of the test sequence is as depicted in Figure 3.5. This methodology simulates better real viewing situations, and therefore allows, in contrast to ACR, a nearly continuous evaluation of the quality of the sequence without losing the continuity of the video. For simplicity, the observers provide their ratings using a questionnaire; however, other methods could be investigated. The evaluations are done according to the five-grade impairment scale proposed in [42]. Thus, the questionnaire contains boxes where the subjects have to write a cross in the one corresponding to the evaluated segment and its score, as depicted in Figure 3.6.
52 Chapter 3. Designing QoE-Aware Multimedia Delivery Services 3.4.3 Selection of impairments The impairments introduced in the video sequence are selected among the effects measured by the QuIDs under study. With the aim of controlling and reducing the possible limitations of using continuous long sequences, such as content dependency and context effect, N versions of the same original sequences are created by introducing different impairments in the same time segments. These versions are called variants, and in each of them, for the same value of i, the impairments introduced in the segment T i will be all corresponding to the same QuID. This kind of distribution allows the parallel evaluation of controlled combinations of degradations, defined as impairment set when concerning the same segment. Each impairment set then is made of N different intensities of the same QuID to be evaluated in parallel. For instance, for a QuID detecting video screen freezing for x seconds, the impairment set would be made of N different values of x. The impairment set may, but does not need to, include also hidden references. This way, the structure of the content streams in the test session has the aspect depicted in Figure 3.7. Each row (A, B, C, D) represents a different variant of the same original sequence, each one divided into aligned segments (T 1, T 2,...). The colored sections in the segments are the halves where the impairments are present, while the white-background sections are the evaluation halves (when the segment number is shown on the screen). The segments in the same position (e.g. the T 1 segments in all the variants) contain different impairments from the same impairment set. Once the number of segments and impairment sets to be tested has been selected, the position of each impairment set in the sequence is selected randomly. For each segment position T i, each of the impairments in the impairment set are also assigned randomly to each of the variants. In the experiments that we have performed using this methodology, N = 4 variants were selected. Each variant was assigned to a different viewing session. This way, the evaluators only view each content asset once, which is in line with the intention of simulating as much as possible home viewing conditions. Each impairment set was introduced at least three times in each of the sequences under evaluation, to be able to have a relevant number of measures, as well as to take the context and content effects into account. A detailed description of the assessment tests can be found in the Appendix A.2.
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 53 Figure 3.7: Structure of the content streams in the subjective assessment test session 3.5 QoE enablers The previous architecture can be enhanced by adding QoE enablers: specific features that can help simplify the management of QoE in the service. In this section we describe three of them: a headend architecture proposal to integrate synchronized metadata, an intelligent way to build RTP packets, and a network element to manage QoE between the PoP and the user terminal. Although they are described briefly in this section, all of them are have been evolved to the point of becoming parts of commercial products which are currently available, or are in the roadmap to be available in the upcoming months. In the rest of this work, it will be assumed that those elements exist or, at least, could be added to the deployment when required. This is not a very restrictive assumption: since the quality monitoring strategies described are targeted to Service Providers that want to improve the QoE offered by their service, it is reasonable to suppose that they may decide to include QoE-enhancing elements such as the ones described. 3.5.1 Headend metadata architecture Introducing metadata synchronized with the media stream can be necessary for a number of purposes [9]. The most obvious one is the possibility to use Reduced Reference quality
54 Chapter 3. Designing QoE-Aware Multimedia Delivery Services Figure 3.8: Schematic representation of a modular headend algorithms. But, in general, any preprocessing that can be done in the headend will be more efficient there than in any other place further. A first QoE enabler is having a headend architecture which allows the introduction and synchronization of metadata, enhancing the interoperability of different headend elements, as we propose in [87]. The proposed architecture is modular and based upon a combination of components fulfilling different functionality. To avoid duplicating the same functions several times, it is necessary that the results of each of the processing steps can be reused by the following components. All the information generated in each step, which can be considered as metadata (data about the data), is propagated along the chain, so that it can be used in further processing components [99], as shown in figure 3.8. The key point here is that each of the components is homogeneous in terms of interfacing, so that both the management of the headend and the integration of new elements get simplified. All the processing components share a common time reference and exchange a set of metadata describing the content. This architecture resembles that of software multimedia frameworks, such as GStreamer or DirectShow, but applied to a distributed scenario. All the meta-information available at each point of the processing chain is kept untouched at the output, and the additional information generated in that processing step is added as well. This way, all the stream analysis done in the different processing components can be reused by the others by just not breaking the metadata chain. That would allow, for example, having access to the Access Unit structure of the stream even after it has been scrambled (if the scrambling module does not filter out AU metadata). Synchronization is possible by keeping reference to the clock of the original stream: all the components shall keep the same time base, so that parallel processing can be resynchronized afterwards. Each block of video data shall include a Transport Time Stamp (TTS) as part of its metadata, representing the time stamp (using original clock
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 55 reference) where the block starts. Metadata shall always have a TTS reference, and can be sent in-band or out-of-band. In this context, in-band means that, together with the multimedia stream, they form a valid MPEG2 TS or ISO file. In such case, however, they shall be correctly signaled as private data within the resulting stream, so that they do not disturb the multimedia decoding. In both cases they will have the same interface strategy (e.g. push or pull) as the multimedia stream. A video headend which implements this architecture offers several advantages to the whole network: the possibility to pre-process the content to help the video analysis in the edge servers (see section 3.5.3), a global synchronization of all the elements with respect to the video internal clock signal which would help synchronize the different QuIDs throughout the network, or a support for a metadata stream that can be used to implement Reduced-Reference metrics, among others. 3.5.2 Intelligent Packet Rewrapper When MPEG-2 Transport Stream is used as multiplexing layer, as it is the case in IPTV, video and audio data are separated at TS packet level, but mixed again when several TS packets are encapsulated in RTP. However, the behavior of the network elements with respect to QoE could get improved if each RTP packet contained homogeneous information allowing, for instance, simple prioritization schemes. This can be achieved by an specialized headend element: the intelligent packet rewrapper (or, in short, the rewrapper) [96]. The rewrapper reorders MPEG2-Transport Stream packets and encapsulates them in RTP packets in such a way that TS packets of the same type (e.g. video elementary stream packets) are grouped together in the same RTP packet. Besides, frame boundaries are split between different RTP packets, so that an RTP packet never contains information from two different frames. The elementary streams are further analyzed (deep packet inspection) in order to include in an RTP header extension some information useful for different applications running lower in the network. The RTP header generated by the rewrapper, shown in Figure 3.9, follows the syntax according to RFC 5285 [101] defining, for ID=1, an extension element with the following semantics: B. Frame Begin (set to 1 if a video frame starts in the payload of this RTP packet). E. Frame End (set to 1 if a video frame finishes in the payload of this RTP packet). ST. Stream Type (0=video, 1=audio, 2=data, 3=reserved).
56 Chapter 3. Designing QoE-Aware Multimedia Delivery Services 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ V=2 P X CC M PT sequence number +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ timestamp +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ synchronization source (SSRC) identifier +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Profile=0xbede length=1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ID=1 len=2 B E ST S r PRI FPRI r a b c +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3.9: RTP header and extension introduced by the rewrapper processing S, r. Reserved PRI, FPRI. Priority (coarse and fine). Values used for H.264 video are described in Table 3.1. a: time from the current packet to the transmission time of the RTP packet containing the last piece of the current frame (in 5-millisecond units). b: time from the end of current frame to the end of the next frame of the same priority within the current GOP (in 20-millisecond units). c: time from the end of next frame with the same priority in this GOP to the end of the GOP (in 20-millisecond units). Table 3.1: Coarse (PRI) and fine (FPRI) priorities used in the RTP header extension when the main video stream is H.264 PRI FPRI Decimal Meaning 3 7 31 Video IDR frame 3 0 24 Audio 2 0 16 Reference frame 1 7 15 Non-reference frame 0 4 4 Rest of cases (data, secondary videos, etc.) 0 1 1 Padding packets The use of the rewrapper enables the development of value-add services over an IPTV deployment. The rewrapper allows that the different types of elements in the coded stream can be easily identified and isolated at RTP level. Since there is one RTP packet per UDP packet, and one UDP packet per IP packet, it is the same as saying that each different IP packet contains only one class of information, which can be retrieved only by reading its RTP header. Therefore it is possible to implement video-related functionality
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 57 which does not require to rebuild the IP packets or, in other words, with the similar level of performance and scalability that is achieved by IP routers. In fact, some of the applications proposed in this thesis use a rewrapper as part of their implementation, such as the Unequal Error Protection algorithm or the Fast Channel Change solution (see sections 5.2 and 5.5 respectively). 3.5.3 Edge Servers for IPTV and OTT Even though enhancing the capabilities of the video headend can help improve the final Quality of Experience, the QoE is, at the end, something experienced by individual users. Besides, the parts of the distribution chain with higher error probability are the access network and the home domain. For those reasons, to be able to enhance the perceived QoE it would be necessary to add new systems at PoP level, together with possible modifications in the user terminal side. For the case of IPTV streams, a video management server (generically called video services appliance, VSA) can be located in the PoP, in parallel with the main video traffic flow and receiving it as well. User terminals can establish individual sessions with the VSA, in order to request some QoE related services such as: Retransmission of lost RTP packets, as defined by RFC 4585 [77]. Unicast delivery of personalized streams (for instance, to accelerate the channel change time, as proposed in RFC 6285 [103]). Collection of quality measures, as the ones proposed in RFC 3611 [22]. These kind of services have been standardized as part of DVB-IPTV (the first two ones as DVB-RET and DVB-FCC respectively), and the most relevant IPTV technology providers have different solutions around this concept. A similar concept can be applied to CDNs. The proposed idea is to modify dynamically the properties of the stream at the edge, in specialized servers ( Tailoring Servers ) placed at the same level than the CDN delivery servers, or even closer to the end terminals [108]. These Tailoring Servers will have access to the CDN and will retrieve from it all available media (segments and manifests) and will be able to process them and offer as a result the same media with some added value functionality to the end devices, using the same HAS API. From the end user perspective, the network will be providing a much better quality of service, and the service provider (the one operating the Tailoring Servers) will achieve it without any modifications in the headend or additional load in
58 Chapter 3. Designing QoE-Aware Multimedia Delivery Services the CDN core network. In this case, it is essential that the Tailoring Server operates in full transparent way, since in OTT environments it is frequent that the service provider does not have any control about the user terminal. Both concepts will be referred generically as Edge Servers from this point onwards. 3.6 Conclusions In this chapter we have proposed a reference architecture for multimedia delivery services over IP. This reference architecture provides a homogenous view of the most relevant scenarios: IPTV and OTT, both for live and on-demand contents, and including quality monitoring points as well. We have also introduced the QuEM quality monitoring framework, which is applicable to almost the same scenarios where PLR/PLP systems are, but offering a more detailed analysis. Specifically, the basis of this approach has been set up with the objective of developing a system that is able to characterize what is happening in the network, and is easy enough to implement, integrate, and deploy in real video delivery systems. Moreover, the proposed approach and the metrics that compose the monitoring architecture have been validated by means of subjective assessment tests, analyzing the effects of several transmission impairments on the QoE of the observers, and the relations among those degradations. Those studies are also useful to calibrate the measurement elements of the architecture to obtain reliable estimations of the impact of the distortions on the perceived quality. Finally, we have described some enablers: network elements that facilitate the implementation of QoE functionality in the delivery network. In next chapters we will fill this framework with information. In chapter 4 we will describe metrics to monitor the most relevant impairments using rich transport data. Those metrics will comply with the requirements established in the QuEM framework, and will be validated using the proposed subjective assessment methodology. In chapter 5 we will use the knowledge obtained in the generation of metrics to propose new valueadd applications in the context of the multimedia QoE. The implementation of these application will also rely on the presence of some of the QoE enablers that we have described in this chapter.
Chapter 4 Quality Impairment Detectors 4.1 Introduction This chapter describes the different metrics which are proposed for the monitoring of the Quality of Experience in multimedia delivery services. Using the terminology defined in the previous chapter, they are the Quality Impairment Detector (QuID) blocks needed to build a Qualitative Experience Monitoring (QuEM) system. Each section devotes to adifferentquid. The general approach to study each of the QuIDs has been similar. First, the impairment which wants to be detected is defined and characterized. This implies identifying the cause of the impairment and therefore propose a technique to monitor it, as well as understanding its impact on the perceived quality. Afterwards this analysis is completed with specific subjective quality assessment tests, which use the methodology described in section 3.4. A common set of subjective tests has been used for this purpose; they are described in Appendix A.2. In specific sections of this chapter, additional subjective and objective experiments have been used. They are described in different sections of Appendix A, and referred in the appropriate sections of the text when needed. The metrics described in this chapter are the ones proposed in section 3.3.3. Theycover the most relevant defects described by users [7], and each of them fulfills the requirements imposed by the QuEM architecture in sections 3.3.1 and 3.3.2 scalability, significance, and repeatability. Section 4.2 describes a video Packet Loss Effect Prediction metric (PLEP). It predicts how the loss of a video packet can lead to freezing or macroblocking effects, by analyzing the propagation of the error within the video frame, as well as to adjacent frames throughout the inter-frame prediction reference chain. The results of this metric are 59
60 Chapter 4. Quality Impairment Detectors analyzed objectively and subjectively using the test sequences described in Appendix A.4 and the test set described in Appendix A.2, respectively. Section 4.3 repeats the same structure of 4.2, but analyzing the effect of the loss of audio packets. Section 4.4 analyzes the media coding quality, with two differentiated subsections. First, in 4.4.1, the video artifacts produced by compression are analyzed with a specific set of subjective quality assessment tests, described in Appendix A.3. The results of these tests is used to explore the possibilities to use RR or NR metrics to monitor video coding artifacts in the context of a QuEM framework. Afterwards, in 4.4.2, a different approach is presented, to analyze the effect of quality drops produced by strong variations in the channel effective bandwidth a typical OTT scenario with HTTP Adaptive Streaming. In this case, two main alternatives are compared: switching to a version with different bitrate and dropping frames. Their effects are analyzed with the subjective assessment tests of Appendix A.2. Section 4.5 describes outage events, understood as the total loss of video, audio, or both signals for a period of time. Techniques to measure outage are described, as well as its subjective effect according to the tests described in Appendix A.2. Section 4.6 analyzes latency-related issues: lag and channel change time. This type of analysis is sometimes excluded in the discussion of QoE, but it has been included in this chapter for two reasons. On the one hand, lag and channel change are relevant only in some specific scenarios; but these scenarios may have great impact in the overall perceived quality of the multimedia delivery service live delivery of sports events is the most typical case. On the other hand, there is a design trade-off between latency and other quality factors, such as video coding quality or packet loss probability. Acknowledging this relationship is relevant when considering the whole QoE of our services. Section 4.7 describes the relationship, in terms of perceived quality, between the different impairments that have been studied. Finally section 4.8 summarizes the main conclusions obtained in the whole chapter. 4.2 Video Packet Loss Effect Prediction (PLEP) model Packet losses are the main cause of errors in multimedia services and, more specifically, in IPTV. The loss of video packets can cause macroblocking and image freezing, which are about half of the QoE impairments reported by customers in a field deployment [7].
Chapter 4. Quality Impairment Detectors 61 For this reason, packet losses are a relevant QoS issue to monitor in IPTV networks. In existing deployments, it is typical to use pure QoS metrics, such as the Media Delivery Index (MDI), to monitor them [67]. On the one hand, MDI is a useful metric to estimate QoE because, on the long term and for random losses, packet loss rate correlate reasonably well with the Mean Square Error which, in this scenario, can be a reasonably good predictor of the perceived quality [40, 95]. On the other hand, in most cases there is simply no other metric which can be applicable in the context of real-time service monitoring, either because they need information that is not available at the monitoring point, or because they are too costly to be applied. However, other approaches are possible. If we have access to rich transport data, such as the information provided by the rewrapper described in section 3.5.2, we can take into account the structure of the video stream to improve the prediction of the effect of losing some packets, instead of applying the sort of flat rate used by MDI. Another important fact to consider is that network QoS provided for IPTV should be good enough to make difficult to assume that MSE correlates to PLR. Besides, QoS-management decisions are taken in the short term (some dozens of packets or so; otherwise delay is too high). Therefore we need to analyze the short-term effect of isolated packet losses in order to improve quality management in IPTV. We will focus in this section on the analysis of packet-loss effect in the short term. We will build a model to predict the effect of packet losses in video, based on the information available at transport level in a real deployment. In particular, we will analyze the transport information (RTP and MPEG-2 Transport Stream), as well as the network abstraction layer of H.264: NAL Unit Headers and Slice Headers. We will not analyze deeper than Slice Header in any case: firstly because, when any scrambling is applied (even partial), some parts of the slice are always unavailable; and secondly because it would require decoding the entropy coding CABAC, which would increase the computation cost of the monitoring tool excessively for practical applications, thus violating the scalability requirements required for QuIDs see section 3.3.1. The analysis has been performed in the context of an IPTV service, where the transport unit (the minimum block that can get lost) is the RTP packet. It has also been assumed that, to simplify the network processing, the MPEG-2 TS has been packaged into RTP using a rewrapper. However, the model can be easily extended to other multimedia delivery scenarios, just by adjusting the size and nature of the packets that can get lost.
62 Chapter 4. Quality Impairment Detectors 4.2.1 Description of the model We need a packet loss effect prediction (PLEP) model which is based on the analysis of rich transport data, provides meaningful information to the operator using it, and is as general as possible. To comply with these requirements, we propose a metric which estimates the fraction of each of the frames which is affected by artifacts coming from packet losses. Therefore a frame with a degradation value of, e.g., 50 percent, will have half of its surface affected by artifacts. The main advantage of this approach is that it focuses on the structure of the error in the image, i.e. on the most direct impact of the packet loss, which is the absence of correct information in parts of the image for some time. This metric does not depend on the statistics of the image itself, and it is therefore usable in environments where the picture intensity values are not available. Besides, it provides an easy qualitative description of the impairment, which makes it suitable for our QoEM architecture. Our solution encompasses two steps which are applied iteratively: we first compute the degradation value in one frame, and then estimate the error propagation to the neighboring pictures. The model only makes use of information which is available in the slice header of H.264 slices: the slice type and reference picture buffer indexes. No data is obtained from either the original (unimpaired) stream or from the decoded video. 4.2.1.1 Degradation Value The first component of the impairment is the error generated in the frame where the packet loss occurs. In an IPTV environment, video frames will typically be transported over several transport packets (typically RTP). For that reason, a loss in one of the packets does not necessarily mean the loss of the whole frame. In fact, the effect of the loss of a single packet within the frame can be estimated by considering two well-known properties of the H.264 coding: The information of macroblocks within a picture is transported in scan order (unless flexible macroblock ordering is used, which is not the case in Main and High profiles). When there is an error in a NAL Unit, decoders usually cannot resynchronize video decoding until the beginning of next NAL Unit. We measure the degradation value on a scale of 0 to 100, where 0 represents that an image has been received without errors, and 100 indicates that it is completely impaired.
Chapter 4. Quality Impairment Detectors 63 The metric will estimate the percentage of image which is affected by the error: E 0 = 100% 1 N N 1 S=0 L(S) 1 f L avg (4.1) where S represents each slice, N is the number of slices per frame, and L(S) represents the length in bytes of the fragment of the slice which is not lost. It is assumed that the rest of the slice is lost the moment an error is produced. Similarly, as macroblock information is sequentially introduced in a slice (i.e., one macroblock after another), it is reasonable to assume that the larger the portion of the slice is affected, the larger the region of image is impaired. L avg is an estimation of the length of the slice if there had been no losses. Depending on the size of the loss and the video transport layer, it may be estimated with higher or lower accuracy. In any case, it is always possible to assume that the slice byte size will be similar to a sliding average of the sizes of the K previous slices of the same type (I, P, B) and their position in the image. f is a function which must be monotonically increasing. We will select the identity function saturated to the value 1, so that no slice can contribute to more than 100 percent of its size. The equation assumes that all slices in the image have the same size (in pixels). Otherwise, values should be weighted by their relative surface in the whole image. 4.2.1.2 Error Propagation Most of the pictures in an H.264 video sequence use other pictures as references in their decoding process. This technique, needed to encode the stream with a reasonably low bit rate, causes errors in one frame to propagate to all frames which make reference to it. If those frames, in turn, serve as references for others, the impairment would propagate even more along the reference chain. Therefore a picture with no losses can also have artifacts which have been propagated from its reference frames. We compute this propagated error E p from the value E of each of the frames which are used as a reference by the picture under study. Given a picture x, depending on a set of references {y k }, propagated error will be: E p = γ k ω k E (y k ) (4.2) where E(y k ) is the error level in the frame y k. This error can be result of a packet loss in that frame (E 0 ) or being a propagated error itself (E p ), and the values of ω k and γ model how to estimate the fraction of affected pixels in the predicted picture.
64 Chapter 4. Quality Impairment Detectors The constant γ represents the attenuation of the error effect along the reference chain. In a typical coding scenario in H.264, instantaneous decoding refresh (IDR) pictures are introduced periodically (each few seconds, at most). Therefore, regardless the value of γ, the error will only propagate until the next IDR frame in the worst case (which is with γ = 1). However, this assumption is not stable for long IDR repeat periods, or for cases where I frames are not IDRs and there can be references beyond GOP boundaries 1. For such reason γ <1 is recommended (for instance, γ = 0.9). Factors ω k represent the weight of the different pictures which contribute as reference to the picture under study. We use a model where higher level errors have a higher weight, as they propagate in a more perceptible way: ω k = E (y k) k E (y k) (4.3) This allows us to write: E p = γ k E2 (y k ) k E (y k) (4.4) 4.2.1.3 Error Composition Finally, it is possible that one picture suffers from a packet loss and also that its reference pictures had errors as well. In this situation, both error contributions must be combined. In the best scenario, both contributions will overlap and the total error level will be the maximum: E bc = max {E 0,E p } (4.5) In the worst case, contributions will be independent and the error will be the sum: E wc =min{(e 0 + E p ), 100%} (4.6) Therefore we assume that the error will be somewhere in between: E = αe bc +(1 α)e wc with 0 α 1 (4.7) 1 In H.264, it is possible to define an I frame which is not an IDR. As I frame, it can be decoded without needing other frames for prediction. However, unlike an IDR, it allows that subsequent frames in decoding order use previous frames as references. This can slightly improve the obtained video quality for a given bitrate constraint, and it is frequently used by IPTV video encoders.
Chapter 4. Quality Impairment Detectors 65 4.2.2 Experiment To test the PLEP model proposed, it is necessary to design an experiment which focuses on the effect of where packet losses occur. Instead of generating random error patterns, we have designed an experiment where packet losses are set deterministically and where it is possible to observe the effect of changing the loss position in the stream. The sequences are pre-processed with the rewrapper described in section 3.5.2. This way, each video frame is transported in an integer number of RTP packets, and so does each GOP. With the aim of analyzing the effect of different packet losses within the stream structure, one single GOP is selected to generate packet losses on it. We apply the following steps, with K taking values from 0 to the number of RTP packets in the selected GOP: 1. In the selected GOP, the RTP packet in position K is dropped. 2. The PLEP metric is obtained for the resulting sequence. 3. The video sequence is then decoded using the open-source decoder FFmpeg 2 (with default error concealment) and stored on a disk without compression. 4. The obtained sequence is compared with the original one (without errors) using MSE. This experiment was conducted with the sequences A, B, C, and D described in the Appendix A.4. The following discussion will be done considering sequence A, as it is the one with longer GOP (100 frames), and therefore the one producing more test cases. However, the same process was repeated with sequences B, C, and D, with similar results a comparison will be provided later. Sequence A is encoded in H.264 over MPEG-2 TS at 2.8 Mb/s (with the video stream at 2.3 Mpbs). Each frame has only one slice, which is the most typical situation for commercially available video encoders for IPTV. The GOP structure is a hierarchical... IBBBP..., such as the one discussed in section 2.4.1 and depicted in Figure 2.4 in page 31. All I frames are IDR pictures. The sequence is encapsulated in RTP using the rewrapper. Each GOP occupies about 1000 RTP packets and, in particular, the GOP under study had exactly 958 packets. Therefore 958 different impaired sequences (each one with the error in a different position within the GOP) were generated, decoded, and processed. 2 http://www.ffmpeg.org
66 Chapter 4. Quality Impairment Detectors It is worth noting that, due to the rewrapping process, all the losses affected only one video frame, although the visual impairment will affect more than one frame due to error propagation in the prediction process. 4.2.2.1 Qualitative Analysis Before analyzing the results of the measurements, it is interesting to examine the video itself, to better understand what happens when one packet is lost. We mainly consider the results in sequence A, since having a longer GOP, it produces more data in the one- GOP analysis. Figures 4.1 is used as an example for this analysis, although the ideas described in this section are applicable to the majority of sequences generated for the study, including both other sequences generated from sequence A and from sequences B, C and D. Figures 4.1(a), (c), and (e) show an IDR frame where RTP packets #11, #28, and #29 have been lost, respectively. Figures 4.1(b), (d), and (f) show the next P frame in display order for the same sequences. Figures 4.1(g) and (h) show the original unimpaired IDR and P frames, respectively. In all the measurements, the frame with the highest MSE is the one where the loss occurred. However, this is not the frame where artifacts are most visible. This is illustrated in Figure 4.1(a). Where the packet is lost when the MSE is high, the visibility of the error is low. However, four frames later in Figure 4.1(b), once the error has been propagated by inter-frame predictions, the error has higher visibility even with a lower MSE than before. This effect is also produced from Figure 4.1(c) to Figure 4.1(d), and from Figure 4.1(e) to Figure 4.1(f). This fact is due to error concealment: when part of the frame is lost, it is simply replaced by the most recent reference frame available. The visual effect of this replacement is a frame with a spatial discontinuity (part of the frame is the correct one, part is the previous), which is not very disturbing visually. However, when the frame is used for prediction, the predicted macroblocks will have errors, and the macroblocking effect will appear. It is also important to consider that in real situations, error concealment techniques may not be as predictable as desired. For example, Figure 4.1(c) and Figure 4.1(e) show the same frame for two different sequences Figure 4.1(c) with the loss of packet #28, and Figure 4.1(e) with the loss of packet #29, with both packets affecting the same frame. In the first instance, FFmpeg concealment attempts to reuse the last referenced frame to replace the missing portion of frame #28, and as a result that the error has low visibility. In the second instance, the lost packet, #29, is directly adjacent to the packet previously used, #28, which shows that the FFmpeg concealment has failed and that the
Chapter 4. Quality Impairment Detectors 67 Figure 4.1: Video sequence used for qualitative analysis. Left column shows an IDR frame where one RTP packet is lost; while right column shows the following P frame. The red line in each frame indicates the position of in the image of the first macroblock which got lost. RTP packet lost are #11 (a,b), #28 (c,d) and #29 (e,f). (g,h) show the original unimpaired IDR and P frames.
68 Chapter 4. Quality Impairment Detectors error has high visibility. These kinds of concealment failures can occur in real decoders, either software or consumer set-top boxes. Therefore one must be careful when making aprioriassumptions about how impaired frames appear on the user screen. We also found that the sooner an error is produced within an encoded frame, the higher is the fraction of the decoded frame affected. The lines in Figure 4.1 show the position of the error within the frame. Frames in Figure 4.1(a) and Figure 4.1(b), where the error was produced in packet #11, have more visible and extensive artifacts than those between frames Figure 4.1(c) and Figure 4.1(d), where the error was produced in packet #28. The underlying idea is that once a fragment of the H.264 slice is lost, the rest of the slice becomes useless to the decoder, which throws it out completely since it is not trivial to resynchronize CABAC decoding. As there is only one slice per frame, when an error occurs within a video frame, the rest of the frame is lost. Finally, we should mention an specific case of interest: when the first video packet in the GOP is lost, then the whole I frame gets lost as well, including any GOP-level header (such as Sequence Parameter Set, Picture Parameter Set or SEI messages). As a result, and with the decoder implementation that we have used, the whole GOP becomes impossible to decode and the image freezes until the next I frame arrives. 4.2.2.2 Quantitative Results We have computed the Packet Loss Effect Prediction (PLEP) values for each one of the sequences under study. As IDRs are used at GOP boundaries, sensitivity to γ is not so critical. We have taken the default value of γ =0.9. Since there is only one packet loss, there is no error composition situation, and therefore the value of α is not relevant. We selected MSE (aggregated along all the impaired frames) as our method choice to measure the impact of error in the sequence. Although there are other methods which correlate better to subjective MOS, such as structural similarity index (SSIM) [116], MSE has been shown to perform better when predicting packet loss visibility [93]. Figure 4.2 shows the MSE for all the sequences (varying the loss position) generated from sequence A. The grey line shows the aggregated MSE of the whole sequence while the green line shows the MSE only of the frame where the loss was produced. The red line shows the MSE obtained by just substituting the frame where the error occurs with the previous available reference frame (i.e., the concealment error at frame level). And the blue line shows the result of the PLEP metric. Figure 4.3 shows the same values for a reduced number of the sequences.
Chapter 4. Quality Impairment Detectors 69 10 2 10 1 10 0 PLEP / MSE 10 1 10 2 10 3 10 4 10 5 0 100 200 300 400 500 600 700 800 900 1000 Position of lost packet Figure 4.2: Mean Square Error and Packet Loss Effect Prediction metric for all sequences under study, varying the loss position: aggregated MSE (grey), MSE at the frame where the loss occurs (green), concealment error (red), and PLEP (blue). 10 2 10 1 10 0 PLEP / MSE 10 1 10 2 10 3 10 4 10 5 0 20 40 60 80 100 120 140 160 180 Position of lost packet Figure 4.3: Detail of Mean Square Error and Packet Loss Effect Prediction metric for all sequences under study It can be seen that error has higher impact in higher levels of the reference hierarchy: when the error occurs in an I frame or P frame, it generates higher MSE than when it occurs at a (reference) B frame, which in turn is higher than error generated by losses in (no-reference) b frames. This is mainly due to the fact that errors in reference frames propagate, and therefore affect more frames. Error concealment also produces more visible results in I frames and P frames since the previous reference frame available is further back in time (four frames distant), than in the case of B frames (two frames away), or b frames (one frame away). The analysis also indicates that error decreases along frame position. This is due to the fact that losing a single packet on a slice means losing the rest of the slice completely, since the decoder is unable to resynchronize the CABAC decoding. Of course this decrease is not completely monotonic, as the reconstruction of the damaged frame is not always perfect. Sometimes concealment techniques fail or are just less effective than expected.
70 Chapter 4. Quality Impairment Detectors Figure 4.4: Mean Square Error versus Packet Loss Effect Prediction metric (log scale) and linear fit between them (R 2 =0.67) Figure 4.5: Percentage of macroblocks which are different between both images versus Packet Loss Effect Prediction metric, both in log scale, as well as linear fit (R 2 =0.85) There is also some tendency to error decreasing along the GOP because the earlier the error occurs in the GOP, the greater number of frames it affects. However, due to the fact that there are some scene changes within the GOP, this effect is not very strong. Figure 4.2 shows that the PLEP model follows the shape of the error and in Figure 4.4 both magnitudes are directly compared. There is reasonably good correlation (R 2 = 0.67) between both values, which suggests that the PLEP model is robust enough to predict packet loss effects. It is worth noting that in this scenario, unlike in other experiments reported in the literature, there is no correlation between the MSE (which is variable) and the PLR (which is constant and equal to 1/958 for all the sequences). This
Chapter 4. Quality Impairment Detectors 71 10 3 10 2 10 1 PLEP / %diffmb 10 0 10 1 10 2 10 3 0 100 200 300 400 500 600 700 800 900 1000 Position of lost packet Figure 4.6: Percentage of macroblocks which are different between both images (blue) and Packet Loss Effect Prediction metric (red) for all sequences under study, varying the loss position means that our PLEP model is able to explain the effect of packet losses reasonably well, even in situations where the packet loss ratio does not provide any valuable information. Results obtained from the other sequences are quite similar qualitatively. Table 4.1 shows the R 2 between PLEP and MSE for all video sequences. Table 4.1: Coefficient of determination (R 2 ) of MSE vs PLEP fit for several video sequences. Sequence A B C D GOP size 100 24 24 12 R 2 0.67 0.63 0.74 0.91 With this in mind, it is also important to consider that the PLEP method is more robust to failures in error concealment than MSE estimation methods. Indeed, error concealment is quite unpredictable in a real case, and not easy to fit into a predefined model, as we illustrated previously in Figure 4.3, where the MSE in the frame where the loss occurred is shown in green, while the MSE in dashed black depicts an instance when an error occurred and the damaged frame was replaced by the previous frame available. This suggests that even knowing the MSE produced by replacing one frame by its predecessor, there is no specific pattern which can easily model MSE in a specific frame when the loss occurs in the middle of a GOP. However, predicting the part of the frame affected is much more stable, since it does not depend on the error concealment techniques used. Thus, a metric defined as the ratio (in percent) of macroblocks different on a pixel-to-pixel basis between both images provides a better approximation than MSE does for the concept of part of the frame affected. Figures 4.5 and 4.6 show that PLEP model is indeed a good predictor of ratio of macroblocks which differ between the original and the impaired images. Correlation with
72 Chapter 4. Quality Impairment Detectors the PLEP model increases so that, for the sequence under study, R 2 = 0.85. 4.2.3 Subjective analysis The next step in the analysis is discovering whether the prediction of the fraction of the image affected by errors can be effectively used to model impairments in the perceived Quality of Experience. With this target, the subjective assessment test session described in Appendix A.2 included some impairments based on the PLEP model. The impairments were caused in the same conditions as in the previously discussed objective experiment: the video is sent by a rewrapper process and only one RTP packet is lost, and the loss includes data of only one frame. The position of the RTP loss within the GOP structure is varied to produce different effects. The different impairment conditions are described in Table 4.2. We will consider the simplified version of γ = 1, so that we assume that the error is propagated until the end of the GOP. Impairment N is the hidden reference (no packet loss). Impairment E1 losses the first packet of the first no-reference B frame in the GOP; thus the error does not propagate to other frames. Impairments E2, E3 and E4 lose one packet in the first reference P frame of the GOP, so that the error gets propagated along the GOP. To vary the resulting effect, the packet is lost at the beginning (E4), in the middle (E3) or at the end (E2) of the frame, which varies the packet loss effect according to what has been discussed previously. Finally impairment V1 has a special effect, which is losing the very first packet of the GOP (in the I frame). In this case, as the most relevant headers for the GOP get lost, the resulting effect is not macroblocking, but the freeze of the image for the duration of the GOP (until another I frame is received). Table 4.2: PLEP impairments analyzed in the subjective assessment tests Code Frame % frame affected Description N n/a n/a Hidden reference E1 B(nr) 100 Loss of one frame E2 P(ref) 25 25% of frame affected during one GOP E3 P(ref) 50 50% of frame affected during one GOP E4 P(ref) 95 95% of frame affected during one GOP V1 I(ref) 100 Video freeze during one GOP The results obtained from the tests are shown in Figure 4.7, differentiating the three content sources under study: an action movie (Avatar, in blue), a football match (yellow) and a documentary (red). The global average value is also displayed, together with its confidence intervals. The description of the sources, as well as more details about the tests, can be found in Appendix A.2.
Chapter 4. Quality Impairment Detectors 73!"#$ $" '#$" '" &#$" &" %#$" %"!#$"!" *+,-."/012-3"4.55" (" )!" )%" )&" )'" *!" Figure 4.7: Results of the subjective assessment for Video Loss impairments As a first conclusion, the results suggest that the PLEP metric is applicable to the characterization of video packet losses, as they confirm that the position of the error within the GOP structure affects significantly the quality perceived by the end user. This conclusion has to be taken with some degree of caution, because there is variability in the results, especially from one content source to another. However, it is clear that the PLEP model outperforms the simple packet loss rate metrics. More specifically, losing one single frame (without propagation) or a small part of the frame (even with propagation along the GOP) is, in general, either no perceived or perceived as not annoying, and statistically indistinguishable from the hidden reference. Beyond that, the bigger fraction of the frame is affected, the higher the severity it has. Finally, freezing the video for the whole GOP has a severer impact into quality than the macroblocking effect. The errors E2, E3, E4 and V1 belong to the same impairment set, as defined in section 3.4.3. That means that they are evaluated in parallel over the same segments. Figure 4.8 shows the detailed results for each of the segments of this impairment set for the three sequences under study. Most of the segments follow the same pattern as the general results, and it is also possible to see that the inter-segment variability for the same error event is lower than the intra segment variability for the different errors applied to each segment. The segments labeled as Doc-10 and Avatar-20 from the documentary and the movie sequences, respectively may be considered outliers, and they share the property of having a low MOS for the less perceptible error (E2). This suggests that in both cases the delivery quality of the unimpaired version of those segments might be lower than expected, and maybe a characterization of the properties of the video in the headend could lead to a RR metric that improved the performance of PLEP.
74 Chapter 4. Quality Impairment Detectors 3' 523' 5' 423' 4' (23' (' %23' %'!"#$ 6(' 64' 65' 7%'!"#$%&'!"#$%('!"#$(&' )*+,+-$%(')*+,+-$%&'./,0"1$(%'./,0"1$(&'./,0"1$%%' )*+,+-$(&' Figure 4.8: Detailed results for each of the individual segments for Video Loss 4.3 Audio packet loss effect When packets containing audio information get lost, there is also an impairment in the perceived quality: either a temporary interruption in the displayed sound or a distortion (glitch or noisy sound). Audio distortions are less frequent than video artifacts or, at least, least frequently perceived by end users [7]. However, they are still common enough for any monitoring system to consider them, especially if we take into account that they are as unacceptable as video artifacts [57]. It is also relevant to consider that, as video streams have normally a very stable bitrate, they normally require a relatively small buffer in the receiver (around 50 ms, compared to the 500-2000 ms typical for video streams). As a consequence, audio packets are much more sensitive to delay variation than video streams; and high values of jitter will easily increase the losses in the audio stream. In this section we will study the effects of those packet losses, both objectively and subjectively. We will take as baseline scenario an IPTV channel over MPEG-2 Transport Stream. To simplify the analysis, we will assume that the stream as been encapsulated into RTP packets by a rewrapper. This way, a packet loss at RTP level will impair either audio or video signals, but not both simultaneously. 4.3.1 Objective analysis Audio coding formats used in multimedia systems use normally block coding: they take a time window of the audio waveform, divide it into spectrum sub-bands and code each sub-band attending spectral masking criteria (obtained from a psychophyisical model
Chapter 4. Quality Impairment Detectors 75 Figure 4.9: Waveform of a lossy audio file of the human hearing system), aimed at maximizing the perceived quality for a target bit rate. There is some overlap between adjacent windows, but not long term coding prediction or complex prediction structures. All the audio codecs considered in our IPTV and OTT scenarios (MPEG-1 layer 2, MPEG-4 AAC, and Dolby AC3) have this kind of design. With this, the impairment produced by the loss of one audio RTP packet will affect only to the time window to which this packet belongs. Therefore we can make the hypothesis that the impairment will be a silence whose length is proportional to the length of the packet loss burst. This, which is exact for uncompressed audio (PCM), will be sufficiently approximate for compressed audio as well. Figure 4.9 shows the waveform obtained after decoding an audio file with losses. It is the audio stream of the sequence A described in Appendix A.4, encodedinmpeg-1 layer 2 at 192 kbps. 70 TS-packet losses (around 550 ms) were introduced each 1000 TS packets (7.8 s). Silence intervals are clearly visible in the waveform, and their duration is effectively around 0.5 seconds each. In some cases, signal peaks can be observed next to the silence intervals. They are perceived as glitches or audio discontinuities, and they may also appear on the event of packet losses. In principle, and for the sake of the analysis of the losses, we will consider only the silences as the base impairment, since they cannot be distinguished from the glitches just by the analysis of the lost packets. Another 2 minute cut of the aforementioned sequence A (with MPEG-1 layer 2 audio at 192 kbps) has been taken to introduce audio packet losses, varying the number of
76 Chapter 4. Quality Impairment Detectors Figure 4.10: Effect of audio losses: measured vs. expected (R 2 =0.98) consecutive packet lost (the loss burst). The expected duration of each TS packet loss would be: 188 8 192000 =7.8 10 3 s (4.8) Afterwards, the resulting stream has been decoded by a software decoder and the length of the silences has been determined. The result can be shown in Figure 4.10. Blue points show the length of the silence events (Y axis) as a function of the number of packet losses (expressed in seconds, X axis). Most of the silence events have a length which is similar to the expected one (although there is a small fraction of outliers, which represent the short silence periods just after or before a glitch effect). Once the outliers have been removed, the data fitting to a regression line (in red) allows us to determine the validity of the approach. The line has a slope of 1.05 and a ordinate at the origin of 0.18, with a determination coefficient R 2 =0.98. With this data, the following conclusions can be obtained: The model is sufficiently good to be used as QuID. The slope is approximately 1, so that we can say that the perceptible duration of the loss is quite similar to the length of the packet loss. Each packet loss, even the shortest ones, generates a silence of at least 180 ms. This last data of the 180 ms should be taken with the appropriated caution. Firstly, because the offline software decoder is not very robust under packet loss events (and, in fact, extracting the silence length has required a careful analysis of the recovered data).
Chapter 4. Quality Impairment Detectors 77 Figure 4.11: Short-length audio losses And secondly because the number of samples used in the model is not high enough to be sure about the quantitative significance of this result. However, from a qualitative point of view, it seems to be clear that there is a minimum silence length that happens in most of the cases. In Figure 4.11, which shows the values of figure 4.10 for its smallest loss durations, it can be seen that the four columns of blue points in the left side (which refer to losses of 1, 3, 5 and 7 TS packets) generate errors between 150 and 300 ms indistinctly. Without considering the quantitative significance of those figures, it is possible to say that, quantitatively, the effect of losing one single TS packet is similar than the effect of a short burst of packet losses. A side-effect of this conclusion is that the fact that 7 audio TS packets are encapsulated in a single RTP audio packet by the rewrapper does not increase significantly the effect of the minimum audio loss, which would be of 1 TS packet (plus probably some video packets as well) for non-rewrapped streams, and is of 7 TS packets (without additional loss of video) for rewrapped streams. 4.3.2 Subjective analysis The subjective assessment test session described in Appendix A.2 included also impairments produced by the loss of audio packets. As the transmitted packets have been processed by the rewrapper, each 7 audio MPEG-2 TS packets are grouped into one audio RTP packet. As described before, audio coded bitstream does not have complex prediction structures (as video has), and the effect of the packet loss is basically related to its duration. Therefore the different type of audio losses differ only in the number
78 Chapter 4. Quality Impairment Detectors!"#$ $" '#$" '" &#$" &" %#$" %"!#$"!" )*+,-"./0123"4-55" (" )!" )%" )&" )'" Figure 4.12: Results of the subjective assessment for Audio Loss impairments of packets that have been lost (it is similar to a packet loss rate / packet loss pattern metric, but with the important distinction that we know that the lost packets are audio packets). The RTP audio packet loss patterns used in the subjective assessment tests are described in Table 4.3. Table 4.3: Audio losses analyzed in the subjective assessment tests. Code N A1 A2 A3 A4 Duration of the burst 0(hiddenreference) 1 packet 500 ms 2s 6s The results obtained from the tests are shown in Figure 4.12, differentiating the three content sources under study: the action movie in blue, the football match in yellow, and the documentary in red. The global average value is also displayed, together with its confidence intervals. The results are stable and coincident with other research in the topic [79]: the longer the loss, the higher the severity. Isolated one-packet audio losses seem to be admissible under real viewing conditions. The acceptability of short bursts (up to 500 ms) depends strongly on the selected content: it is acceptable in the soundtrack of a movie, but not in the narration of a sports match. Long bursts (2 seconds or higher) are unacceptable by all means. Since A1, A2, A3 and A4 belong to the same impairment set, it is possible to compare their results segment by segment. It is shown in figure 4.13, which confirms the conclusions mentioned before. In this case, since the audio structure is simpler and the audio original quality is, as in a real deployment, high enough for the purpose, the probability of having clear outliers is low.
Chapter 4. Quality Impairment Detectors 79 )& (5)& (& 75)& 7& 65)& 6& '5)& '&!"#$ *'& *6& *7& *(&!"#$%&!"#$'(&!"#$')& *+,-,.$'(&*+,-,.$%& /0-1"2$'4& /0-1"2$'%& /0-1"2$'3& *+,-,.$')& Figure 4.13: Detailed results for each of the individual segments for Audio Loss 4.4 Coding quality and rate forced drops Another relevant element for the Quality of Experience is the multimedia quality obtained at the end of the encoding process: the coding quality. The coding quality is important for the overall QoE, but it is not so critical for a monitoring system for two main reasons. On the one hand, its impairments are less frequently reported by the users than the ones produced by packet losses [7]. On the other, the target coding quality is something that must be controlled in the design phase of the service, when selecting the encoder which is going to be used and the conditions, especially bitrate, under which it is going to work. But once in runtime, there should be less unexpected events in the encoder than in the access network, for instance. When considering coding quality, we will only focus on the video stream; and not on the audio. The reason for that is that, while both of them contribute similarly to the final multimedia quality [90], video requires much more bandwidth than audio [6] and, as a result, video encoders will be working under more stressful conditions, In this section we will study the coding quality from two different perspectives. First we will explore the options to control or estimate the coding quality using simple RR or NR metrics (with a chance to be applicable in the QuEM framework). Then we will analyze different scenarios of strong quality drops, such as the ones produced when the stream jumps from one bitrate to a much lower (or higher one). This scenario is typical of OTT services using HTTP adaptive streaming.
80 Chapter 4. Quality Impairment Detectors 4.4.1 Analysis of feature-based RR/NR metrics as estimators of video coding quality The first step done in the analysis of video quality has been trying to find out whether it is possible to estimate the perceived coding quality (or, at least, some salient impairments) from elementary Reduced-Reference or No-Reference metrics performed in the pixel domain. The main reason for that is trying to build a quality estimator that can be of use in scenarios similar to the ones proposed in our QuEM architecture. The approach taken to this problem has been analyzing several NR and RR metrics from the literature. Those metrics have been applied to video at contribution quality (highquality recordings from television content, obtained directly from the television studios in uncompressed D1 format), and to the result of encoding them with commercial H.264 video encoders at different bit rates. The obtained values have been compared to the outputs of subjective assessment tests done for the same video segments. The work described in this subsection 4.4.1 was done during the first steps of the research activity of this thesis [81], before the development of the QuEM strategy and its associated subjective assessment test methodology, described in chapter 3. Therefore, the subjective tests referred in this subsection, described in Appendix A.3, are different from the QuEM-based subjective tests used in the rest of this chapter, and described in Appendix A.2. The experiments, main results, and conclusions are described now. 4.4.1.1 Metrics under study The aim of the experiment is determining whether it is possible to detect degradations in the video quality by using lightweight Reduced Reference (RR) and No Reference (NR) metrics. Most RR metrics are based on comparing some image features before and after the impairment process. These features usually model amount of movement and spatial detail. NR metrics are normally based on the detection of known artifacts produced in the coding process, such as blocking, or blurring [121]. To compare different possible strategies homogeneously, we will extract several features from the original an impairment features, and measure its relative degradation, averaged along time: X[Forig (t)] X[F proc (t)] M = mean t X[F orig (t)] (4.9) Four groups of features have been compared: spatial information (obtained from several RR metrics), temporal information (from RR metrics as well), blocking (from NR metrics), and blurring (from NR as well).
Chapter 4. Quality Impairment Detectors 81 Different feature extractors have been considered for spatial information (or texture): Le Callet et al. [63] propose a pair of complementary measures based on intensity and direction of borders, which they call GHV and GHVP. They compute GHV as the average magnitude of intensity gradient for all the pixels in which this gradient is horizontal or vertical, and GHVP as the average magnitude of intensity gradient for all the pixels in which this gradient is neither horizontal nor vertical. BTFR metric in ITU-T J.144 [45] includes a texture measure computed as the zero cross rate of horizontal gradient. Saha and Vemuri [98] propose using the average value of absolute vertical and horizontal differences, which they call IAM 4. Webster et al. [117] propose a Spatial Information feature (SI), defined as the standard deviation of the Sobel-filtered frame. When characterizing temporal variations, there is less diversity of metrics in the literature. We will consider Le Callet s Temporal Information (TI), defined as the energy of the difference image along time [63]. Regarding the blocking effect, we have studied three of the most frequently cited metrics: GBIM (Generalized Block-edge Impairment Metric) [122]. It measures the differences between both sides of the block (which must present a regular and well-known pattern). Vlachos metric [110], which uses a method based on the spectral analysis of the pixels in block boundaries. Wang metric [115]. It analyzes the Fourier transform of the image to detect energy peaks in the multiples of the inverse of the block period. The other relevant artifact to study is blurring. Most blurring metrics are based on the measurement of the average width of borders in the image [21]. We have selected the implementation proposed by Marziliano et al. [68]. Finally, we have also included two basic measures: global brightness (mean value of intensity) and global contrast (standard deviation of intensity).
82 Chapter 4. Quality Impairment Detectors 4.4.1.2 Evaluation Reference data to benchmark these video quality metrics were obtained from the results of a study of subjective quality for real-time H.264 encoders, described in Appendix A.3. The same sequences used for the subjective tests were provided as input for all the feature extractors described in the previous subsection. Reduced-Reference metrics were obtained for all the features by applying equation (4.9). Besides, the blocking and blurring metrics were considered as individual No-Reference metrics as well, just by computing its average along each test sequence. The output of all the metrics, both RR and NR, was been compared with the MOS obtained from the subjective tests, to check whether any of the features under study could be a reasonable predictor for MOS variations. Pearson correlation and Spearman rank correlation (with p-test) were computed. Results are shown in Table 4.4. Table 4.4: Comparison of NR/RR results with subjective tests Metric Pearson Spearman p-test Brightness 0.41 0.44 OK Contrast 0.61 0.64 OK BTFR Texture 0.16 0.22 NO SI 0.54 0.59 OK GHV 0.43 0.43 OK GHVP 0.42 0.44 OK IAM 4 0.51 0.52 OK TI 0.70 0.68 OK GBIM RR 0.16 0.17 NO Vlachos RR 0.21 0.24 NO Marziliano RR 0.29 0.26 OK GBIM NR 0.17 0.19 NO Vlachos NR 0.31 0.24 NO Wang NR - - - Marziliano NR 0.32 0.33 OK As a rule, correlations are quite low (below 0.7). The best results are obtained from TI metric and from contrast difference. No-Reference metrics obtained quite poor results: no blocking metric provides a result which is statistically meaningful and Marziliano s blur metric gets a very slight correlation. Wang method did not even provide any stable result. Furthermore, even when using them as basis for a Reduced-Reference metric, results were not any better. Figure 4.14 shows the value pairs which have been used to obtain the results mentioned, i.e., the output of the metrics versus the subjective MOS, for to the two metrics which
Chapter 4. Quality Impairment Detectors 83 (a) TI loss vs MOS (b) Contrast loss vs MOS Figure 4.14: Results for (a) TI and (b) Contrast provided better results: TI and contrast degradation. Each shape represents one of the sequences (cross and triangle for the football match; square and circle for the music show). Each color represents one of the encoders. The regression line is also shown. Two considerations can be made: Low values of the metrics are closer to low MOS than high values of the metrics are to high MOS. That means that a bad result in one metric would probably imply bad quality, but a high result will not imply high MOS. Results may vary significantly depending on the content. This is especially clear for contrast loss and one of the music sequences (the square-shaped markers in the figure). For both metrics, music sequences show better correlation than football ones.
84 Chapter 4. Quality Impairment Detectors We can conclude that simple feature-based RR/NR metrics are hardly usable in the context of continuous video quality monitoring, with the conditions that we have established for the design of monitoring systems. Some of the results that have been reported by other authors regarding the performance of those methods in, for instance, JPEG encoded images, need to be taken cautiously when applied to H.264 video. Although the use of more complex metrics could improve the results, their performance could hardly reach the capabilities of the FR metrics [112]. For those reasons, we will not consider the use of NR/RR pixel-based metrics to be included in the QuEM architecture. The source video quality should, as a general rule, be sufficiently high by network design. And the monitoring of the variations of those reference quality along time would be better performed at the video headend, where FR metrics can be applied in dedicated equipment. 4.4.2 Managing coding quality drops The previous discussion suggested that direct monitoring of the video quality in the access link (between the PoP and the user terminal) is difficult to achieve with cost effective RR/NR metrics. However, it is also true that, in the typical monitoring scenario, the video coding quality is selected by the Service Provider in the network design, and any quality drop in the original quality can be monitored in the headend in better conditions. There is room anyway in the access network for drops in the coding quality, if variations in the video quality can be introduced in the Edge Server. The most typical example for that is the HTTP Adaptive Streaming, where the user terminal may select to download different versions of the same video segment (at different bitrates), depending on the instant quality of service provided by the access network. The same principle might be also applied for IPTV: once the video is encoded in parallel at several bitrates, an IPTV Edge Server could force the downgrade of the video quality to overcome a network congestion event. In some production or delivery environments it may be necessary to obtain a lower bitrate version of a media stream but there is no possibility of performing a full transcoding process, either because of lack or resources or timing issues. In this cases the denting concept may be helpful: the idea is to dynamically remove frames from the video elementary stream keeping the original audio (or audios). The result is a stream with a lower frame rate but also a lower output bitrate which keeps the rest of the variant characteristics (codec, resolution, etc) unaltered.
Chapter 4. Quality Impairment Detectors 85 The denting component performs exactly this process: based on a configuration parameter (target bitrate, target frame rate, remove all B, etc) it sends to its output the same media received at the input except for some video frames which are carefully selected to meet the desired requirements. Due to the encoding properties of most codecs, video frames can usually not be removed arbitrarily, because the absence of a frame may prevent other frames which remain in the stream to be properly decoded. For this reason the denting component requires deep information about the video frames, not only about their boundaries but also about their decoding hierarchy. Padding packets can also be removed by the denting component, but non audio/video streams (application data, teletext, subtitles, etc) should only be removed if explicitly allowed by configuration parameters. Denting can be used in the Edge Server to dynamically generate lower-bitrate versions of the main stream, either to create or enhance HAS structures or to reduced the bitrate of a unicast transmission between the Edge Server and the user terminal. In particular, denting has been successfully used in Fast Channel Change solutions to increase the apparent bitrate of the unicast session without effectively allocating a higher bitrate for it. These quality drops (reducing the bitrate and denting) have also been included in the subjective assessment tests described in Appendix A.2. Table 4.5 shows the different values considered. R1 and R2 are a reduction of 50% and 75% of the bit rate. F1 and F2 are a reduction of 50% and 75% of the frame rate. The effective bitrate reduction of F1 and F2 depend on how the video was encoded. However, typical values for the content assets under study are about 25-30% of bitrate reduction for F1, and 35-50% for F2. Table 4.5: Quality drops analyzed in the subjective assessment tests. Code Type Description N n/a Hidden reference R1 Bitrate Bitrate reduced to 1/2 R2 Bitrate Bitrate reduced to 1/4 F1 Denting 1/2 of all frames dropped F2 Denting 3/4 of all frames dropped The results of the subjective assessment tests for these impairments are shown in the Figure 4.15. The following conclusions can be obtained: The results of the hidden reference are high. This means that coding defects introduced at the reference quality are perceived of much less severity than other
86 Chapter 4. Quality Impairment Detectors!"#$ $" '#$" '" &#$" &" %#$" %"!#$"!" )+,-"./01" (" )!" )%" *!" *%" Figure 4.15: Results of the subjective assessment for Rate Drop impairments '& 65'& 6& %5'& %&!"#$ (5'& (& )5'& )& 7)& 7(& 0)& 0(&!"#$%&!"#$'&!"#$()& 01-2"3$(%& 01-2"3$)%& 01-2"3$4& *+,-,.$()/(& *+,-,.$%& *+,-,.$'&*+,-,.$()& Figure 4.16: Detailed results for each of the individual segments for Rate Drop defects (forced quality drops in this case, but also other defects considered in other sections). The impact of this kind of impairments depends on the source content, at least up to some point. In general, the quality variations between bitrates are relevant (and between frame rates as well). However, its specific impact differs from one asset to another, and from one segment to another. This can be better shown in the comparison within the impairment set formed by R1, R2, F1 and F2, in Figure 4.16. Denting has higher impact in the perceived quality than the drop of coding quality, which was expected, as in the latter case the quality-rate trade-off has been optimized by the encoder, while in the former case it has not.
Chapter 4. Quality Impairment Detectors 87 4.5 Outages All the issues considered so far are caused by isolated errors. Now we will analyze a different case: outage loss of service for a period of time. The relevance of this case is that sometimes the users report errors which are described as a complete stop in the video play out, which sometimes is only recoverable after a reboot of the user terminal [7]. Any system that monitors the global QoE must be aware of this kind of errors since, although they are less frequent than the ones caused by isolated packet losses, have a higher impact in the final quality. Outages can be roughly classified into two categories: short and long. By long outages we understand those caused by service unavailability for several minutes or hours. The most typical example is a software problem in the user terminal, but there could be severer situations (such a critical failure in the delivery equipments, for instance). Short outages are the ones caused by a brief stop (some seconds) in the video service delivery, typically caused by discontinuities in the service, an issue in the delivery equipment followed by a recovery of the service from a redundant one... Long outages should be ever monitored and managed by the Service Provider and are, in fact, outside the scope of our work. The impact of having no service at all is not easy to measure in the same scale that we are considering. We will focus in the detection and impact measurement of the short outages exclusively. 4.5.1 Detection of outages The outage can happen in the contribution (detectable in the headend), in the core network (detectable in the PoP), or in the access network (detectable in the HNED, maybe with the help of the Edge Server). If it happens in the contribution, it should be monitored by continuity monitors in the headend. An effective way to do it is using the VODA algorithm proposed by Reibman and Wilkins [94]. This algorithm detects an outage when there is as sudden and simultaneous drop of three different factors: average brightness (i.e. the picture changes abruptly to black), space information, and audio signal power. The three factors must also remain low for some seconds for the outage to be detected. If the outage happens in the network, it will be an extreme case of packet loss with high impact (loss of several seconds worth of video and/or audio), which can be detected normally with packet loss effect estimators (and probably with simpler packet loss detectors).
88 Chapter 4. Quality Impairment Detectors Additionally, short outages in the contribution can be detected in the coded stream (with less accuracy, but it can be enough for our purposes) by monitoring the global video and audio signal level: For video, with the analysis of the frame size and structure (coded long freezes have almost zero-byte P and B frames). For audio, either from the analysis of energy values for each sub-band (exact) or with the analysis of the dynamic range compression parameters, when available. 4.5.2 Subjective impact of outages Some outage events have also been included in the subjective assessment tests described in Appendix A.2. Table 4.6 shows the different values considered: stops of 2 and 6 seconds for audio and video (or both). The results are shown in the Figure 4.17, with the comparison of the impairment set A4, V3, AV in 4.18. In general, and for the same sequence, the longer the outage, the worst the detected quality. However, the specific impact and the relative importance of video and audio is quite dependent on the specific content. Table 4.6: Outage events analyzed in the subjective assessment tests Code Outage Duration Elementary Stream Affected A3 2s Audio A4 6s Audio V2 2s Video V3 6s Video AV 6s Both 4.6 Latency A final QoE factor to consider is latency. Latency issues are usually disregarded in many QoE analyses, because they are only perceived in very specific scenarios. However, the study of latency is relevant because of two different, but related, causes. On the one hand, as discussed in section 2.4, these scenarios where latency is relevant mainly live sport events are important enough to make latency be a meaningful QoE element. On the other hand, there is a trade-off between latency and other QoE components that makes it difficult to have low-latency video delivery services without compromising the
Chapter 4. Quality Impairment Detectors 89 $" '#$" '" &#$" &" %#$" %"!#$"!" *+,-./" (&" ('" )%" )&" ()" Figure 4.17: Results of the subjective assessment for Outage impairments 3& %12& '4& 53& %& 012&!"#$ '5& 0& '5&!"#$%& '()*)+$%&,-*."/$%& Figure 4.18: Detailed results for each of the individual segments for Outage perceived quality. These trade-offs will be summarized at the end of this section, at 4.6.3. Latency will be studied from two different perspectives. Fist we will analyze the endto-end latency or lag. Afterwards we will analyze channel change time, which is also a latency-related scenario with a significant contribution to the overall QoE. 4.6.1 Lag End-to-end latency or lag refers to the delay observed in the displayed video by the user with respect to the moment when the event is being recorded. With such definition, lag only makes sense for live content streams: those which are being watched while they are being captured. Although it is possible to provide an equivalent definition for on-demand content, the reality is that lag is only a QoE factor in live events. And even
90 Chapter 4. Quality Impairment Detectors Figure 4.19: Simplified transmission chain for real-time video for live television channels, there are very few cases where the lag is really an issue, and that receiving the video with some additional seconds of delay makes any difference. However, the few cases where lag is important are also important for service providers and users, the most typical ones being sport matches. For those reasons, keeping the lag under control is very relevant for IPTV service providers [70]. Lag must be constant end-to-end, to avoid losing video continuity. As such, any protocol layer that imposes timing constrains must have also constant end-to-end delay, because it should not assume that the delay variation may be absorbed by the uppermost layers. Figure 4.19 shows it. Points A and Z represent the decoded video stream. In absence of errors, the video reproduced in A and Z should be identical, and therefore the delay between those points T AZ must be constant. A first component of this delay is introduced by the encoding process, and it is due to two main causes. On the one hand, the coding of video using frame prediction normally implies that the frames are encoded and transmitted in a different order that they are displayed, to allow the use of bidirectional prediction. On the other hand, this kind of compression also makes that the size, in bytes, of the different frames differs strongly among them and along time. This generates local peaks of bitrate that normally need to be softened before the transmission, introducing additional delay, to comply with bandwidth restrictions. Those two sub-components of the video delay are introduced by the encoder and depend only on coding decisions (and therefore can be known in point B). MPEG-2 Transport Stream allows that the encoder to manage the coding delay end-toend. The transport stream includes a clock signal called PCR (program clock reference), which indicates the rate at which the coded stream is produced at point B and, therefore, the rate at which it is expected to be delivered at point Y. The stream also includes, for each video, audio or data access unit, its presentation time stamp (PTS) in the same clock base. The total encoder-decoder delay T AB + T YZ is constant. This way, if the
Chapter 4. Quality Impairment Detectors 91 network is able to keep constant delay T BY, the end-to-end delay T AZ will be constant as expected. However, the real delay in the transmission network T CX, which is an IP network, cannot be guaranteed to be constant. Therefore network elements are introduced to control the network ingestion and the reception in the user terminal to flatten network jitter and also to manage error correction protocols. The delay introduced by server-side elements and by the decoder (T AB + T BC + T YZ ) are established by the network design and known a-priori by the service provider. The network buffer T XY depends on the implementation of the user terminal, and it is normally set individually for each video session. Once it is established, however, the end-to-end network delay T BY will remain constant for the whole video session, and therefore each video packet whose jitter exceeds this buffer will arrive too late to be sent to the decoder, and will be considered as a network loss. Therefore, when establishing the length of the network buffer, there is a tradeoff between end-to-end delay and packet loss probability. Additionally, if the video multiplexing protocol is ISO File Format, then it does not include transport timing information equivalent to the PCR. In such case, the user terminal must set up the value of T YZ arbitrarily for the first decoded video frame, and assume that it will be enough to present it on time for then onwards. As a result, buffer sizes are normally overdimensioned, to avoid buffer emptying events, at the cost of suffering a higher lag. This overdimensioning is also generally applied to the network buffer T XY, especially in the case of Over The Top services (where network capacity variations can be very strong). 4.6.2 Channel Change time We will define channel change time (or zapping time) as the time between the moment when the end user presses a channel change key in their user terminal and the instant when the new channel (video and audio) starts playing on their screen. This time can be divided into the following components: T CC = T term + T net + T buf + T vid (4.10) Where
92 Chapter 4. Quality Impairment Detectors T term is the delay between the user key stroke and the moment where the user terminal effectively requests the new video stream to the network (by issuing an IGMP join, an HTTP request or what it is suitable for each scenario). T net is the delay between the new video is requested and the first byte of the new stream arrives back to the user terminal. T buf is the time needed to fill the network buffer in the user terminal. T vid is the time needed to present the first video frame in the decoder output. From the analysis done in the previous subsection, it is immediate to consider that T buf is equal to T XY as depicted in the Figure 4.19. T vid abstracts all the delay introduced by the video stream in the decoding side. This can be inferred only by analyzing the video stream and depends only on the encoding process. It can be modeled as: T vid = T RAP + T dec (4.11) T RAP is the time that the decoder has to wait to reach a Random Access Point (RAP). A RAP is a specific point in the video stream where it is possible to start decoding it, which maps approximately with the beginning of the intra-coded frames. Therefore T RAP can be easily modeled with a random variable of uniform distribution between 0 and the intra frame period T I, whose mean value is T I /2. T dec is the interval between the RAP and the moment when the frame can be presented to the user. It is equal to the stationary delay of the video decoder, i.e. T YZ in Figure 4.19. It represents the decoding part of the end-to-end coding delay for each of the media components (audio, video, and data) and, in MPEG-2 Transport Stream, it is: T dec = PTS(first access unit) PCR(first packet) (4.12) It is relevant to consider that the value of T dec will, in general, be different for each of the elementary streams. Even though the end-to-end delay (T AB + T YZ ) is constant and equal for all of them, it is usual that the part of the delay left to the decoder (T dec = T YZ ) varies strongly from one component to another. A typical example taken from a commercial encoder is shown in Figure 4.20: audio T dec is constant and below 100 ms while video T dec varies along time between 800 and 1400 ms approximately. With these elements, it is possible to build a QuID which monitors the channel change time in the network in the following way:
Chapter 4. Quality Impairment Detectors 93 Figure 4.20: Decoding delay (PTS-PCR) in milliseconds for video (blue) and audio (red) components of a MPEG-2 Transport Stream, and its variation along time (in seconds) T term and T buf depend on the user terminal implementation, which is the only point where they are available. However, they are normally quite stable, so they can be known a priori and introduced into the model as parameters. T net, T RAP and T dec can be easily monitored in the network. It is worth noting that most of the components of the channel change are frequently sacrified in the process of enhancing the available end-to-end quality of experience. In particular, T buf, as it has been mentioned in the previous subsection, represents the buffering required to absorb network jitter and to correct packet losses. T RAP and T dec provide also a higher degree of freedom to the encoder to distribute its bit budget flexibly, according to the coding complexity of the images and therefore optimizing the coding quality. Reducing any of those parameters, what would reduce the channel change time in the same amount, could also have undesired side-effects in the global quality. Unlike the case of the global lag, channel change time is a QoE element which is relevant for many IPTV deployments, and for all the video channels. However, the mapping of the channel change events into a global scale of severities (or qualities) is very dependent on the expectations of the service provider, and there is no standard way to do it. Table 4.7 shows an example that could be used as reference, based on informal laboratory experimentation.
94 Chapter 4. Quality Impairment Detectors Table 4.7: Example Channel Change time ranges and their mapping to QoE Time (s) QoE description < 0.4 Very Fast 0.4 1 Fast 1 2.5 Normal 2.5 5 Slow > 5 Very Slow 4.6.3 Latency trade-offs Since lag and channel change can be considered relevant elements for the global QoE, we may ask whether it is possible to improve them by reducing some of their components. The answer is that it is possible, but with some cost: degrading other QoE factors. We will show here why. Regarding end-to-end lag, encoding latency T AB + T YZ is used to provide a buffer for rate-control operations in the video encoder. Reducing this buffer will impair the video quality that the encoder is able to produce at its output. Network processing delay T BC + T XY provides a buffer to protect the decoder against network jitter. This buffer can be reduced, but only at the cost of increasing the packet loss probability. Channel change components T buf and T dec are T XY and T YZ respectively, so the same considerations can be made. T RAP is also a design parameter for the encoder: if it is reduced, the frequency of I frames will increase, which will degrade the video quality (if the bitrate is kept, as it is assumed). The rest of the delay components are limited by the technology itself, and are normally outside the control of the service provider: T CX and T net depend on the performance of the communication network. T term depends on the performance of the user terminal software. As a conclusion: there is a strong relationship between the latency and the video quality components of the QoE. Therefore latency should always be controlled in any multimedia delivery service. Even in the cases where lag or channel change are not important by themselves, managing latency parameters is always a good strategy. Service providers should be aware that reducing that latency elements in the future will always be at the cost of putting the video quality at risk.
Chapter 4. Quality Impairment Detectors 95 4.7 Mapping to Severity One of the most complex problems to solve when managing a QoE monitoring system in a large multimedia service deployment is the comparison and aggregation of a big quantity of data. In our QuEM model, this problem is addressed by referring all the measures to a common severity scale and synchronizing the measurement windows, so that one single severity value is produced for each monitoring period in each monitoring point (section 3.3.2). These values should be then processed statistically according to the needs of the monitoring service with the particularity that, even though the aggregated value has only meaning in terms of average severity, each of the individual impairment events is easily traceable to a qualitative description of what happened. Each QuEM system should be calibrated according to the specific needs of the service provider, and should also be modified during the operation phase with the feedback retrieved from the field. The best way to calibrate the different QuID elements to produce severity values is by performing subjective quality assessment tests such as the ones described in section 3.4. This way, each service provider could feed the tests with the type of content and impairments that fit better in their deployment, having the Severity Transfer Functions completely under their control. The results of the subjective assessments describe in Appendix A.2 can provide some initial approach to the problem, which should be used as starting point for real deployments of a QuEM infrastructure. Figure 4.21 shows a summary of the different results that have been discussed along this chapter. The most relevant conclusions for each of the type of errors have already been discussed, but we can summarize them as follows: Video packet losses can have very different effect depending on the part of the stream which is lost. We have proposed a simple but effective metric (PLEP) to model this variability. Audio packet losses depend mostly on the packet loss rate and pattern. We have also modeled this in our proposal for audio loss QuID. Bitrate is a reasonably good proxy to monitor video coding quality in the context of a QuEM system. The comparative effect of bitrate change and denting has been studied. The former technique has less impact than the latter in the final QoE, but it requires generating and transporting the different versions of the content stream from the headend to the network edge.
96 Chapter 4. Quality Impairment Detectors Figure 4.21: Results for all the QuIDs mentioned in the chapter Outages can be monitored as severer versions of the rest of the impairments, but they must be considered separately because of its high impact in the perceived quality. Latency effects (end-to-end lag and channel change) have to be taken into account, both for their impact in the final QoE and for their relationship to other quality issues. Besides, the cross-analysis of different QuIDs can also provide some additional ideas: In case of network congestion or any other error situation, the decision of which packet or packets to discard is critical for the final impact in the Quality of Experience. Losing all the no-reference frames for six seconds (F1) has an impact which is similar to losing all the audio during only half a second (A2) or have relevant macroblocking (90% of the picture) for half a second (E4), and is even better than any of the video screen freezes (V1-V3). All those impairments are produced by the loss of less packets than F1. Video freezing is probably the worst artifact (relative to the minimum loss burst needed to produce it). For this reason, it should be avoided by any means. This is especially relevant in scenarios where the network buffer is small because a low latency is required. In such cases, countermeasures such as bitrate drop or frame rate drop are preferable to an empty buffer resulting in video and audio loss signal.
Chapter 4. Quality Impairment Detectors 97 4.8 Conclusions This chapter has presented strategies to monitor all the relevant sources of quality impairments in multimedia delivery services. We have proposed metrics to analyze the effect of packet losses in video and audio, which are currently the most frequent errors in multimedia services; and in particular in IPTV. We have also covered the analysis and monitoring of media coding quality, with a special focus on the strong bitrate variations which are typical of OTT scenarios. We have also analyzed the causes and effects of service outage, as well as the effects of latency in the final QoE. All the metrics proposed in this chapter can be integrated as Quality Impairment Detectors in the QuEM architecture described in chapter 3. Besides, we have analyzed a set of subjective quality assessment test results which support the selection of QuIDs and provide a way to compare their relative severities. The results of this analysis have provided relevant information about the relative severity of the errors under study. The ideas discussed in this chapter suggest that, with the right knowledge of the effect of network events in QoE, it is possible to design network systems whose policies are optimized towards the final perceived quality. The next chapter will present and discuss some of these applications.
Chapter 5 Applications 5.1 Introduction This chapter describes applications which, by making use of the knowledge obtained in previous chapters about the Quality of Experience, can enhance the functionality of existing multimedia delivery services. In fact, some of the applications described here have been applied to products and services which are currently deployed in the field. Section 5.2 describes a variation of the Packet Loss Effect Prediction model which can be used to establish packet priorities in a video communication network. This can be used to support Unequal Error Protection schemes which make best use of the error correction capabilities of the network. A similar idea is applied in section 5.3 to an HTTP Adaptive Streaming scenario. By composing HAS segments in priority order (instead of in the traditional decoding order), it is possible to react better to dynamic variations in the network effective bandwidth without needing to increase the buffering delay excessively. Section 5.4 describes a selective scrambling algorithm which can be used to efficiently protect video content in scenarios where the processing power of the deciphering elements is small. By only selecting to encrypt the most relevant packets (with respect to their impact in the QoE) it is possible to get very effective protection with a low packet scrambling rate. Section 5.5 proposes a solution to overcome the channel change limitations described in section 4.6. Finally section 5.6 discusses the application of the results to stereoscopic video. 99
100 Chapter 5. Applications 5.2 Unequal Error Protection Not all the packet losses have the same impact in the QoE. For instance, the effect of isolated packet losses in perceived video quality depends on several factors, such as coding structure (the type of prediction in the frame or the part of the frame which gets lost), camera motion, or the presence of scene changes, among others [86, 93]. When the number of errors grows, the effects of those factors tend to compensate among them, so that the impact of random errors depends mainly on packet loss rate [95] and loss burst structure [124]. Audio packet losses have a strong impact on the perceived quality, depending mainly on the frequency and length of the bursts of loss packets, with no significant differences between individual packets [79, 84]. When they are studied jointly, video errors seem to be more acceptable than audio errors, except for high error rates [57]. Most of the studies mentioned so far analyze the effect of packet losses for relatively high loss rates. In practical situations, however, real-time video services provide a quality of experience resulting in less than one visible error per hour, with users showing sensitivity to higher impairment rates [7]. In terms of network quality of service, it means that only a few packet loss bursts per hour are allowed, at most. Home networks typically have error rates which are some orders of magnitude above these figures, especially in the case of wifi (802.11) [97]. If the media stream is to be delivered through the home network, the residential gateway must provide some kind of error correction mechanism (FEC or ARQ) in order to keep the required level of service. This protection is performed at the cost of introducing end-to-end delay in the transmission chain [61], as well as increasing the required bandwidth. The understanding of how packet loss can affect video and audio quality has been used to propose several unequal error protection (UEP) schemes, where packets with higher impact in quality are protected better [29, 66]. This allows keeping a good QoE without an excessive increase in the required protection and, consequently, in the additional delay introduced. However they usually require an in-depth video analysis which is difficult to integrate in cost-effective consumer electronic devices. Lightweight UEP designs also exist, but they usually focus on the characteristics of the loss patterns and use limited approaches to characterize the priority of the packets [12, 71]. We have shown in the previous chapter that, even with its limitations, the PLEP model we describe is a promising approximation for blind packet loss effect estimation. However, it is based on reading and building a reference frame list for each frame. Even as simple as it is, this could be too expensive for some applications, such as packet QoS policies applied in routers, and it may require the use of information which is not
Chapter 5. Applications 101 available in real service deployments, perhaps because the elementary video stream is completely scrambled. Here we will show how it is possible to reduce strongly the effect of packet losses by applying a simplified version of the PLEP metric to label video packet priorities (and even using a low number of bits to encode them). This technique can be applied to congestion control in home gateways or buffer management in dynamic HTTP adaptive streaming. In addition, it can improve other lightweight UEP schemes by enriching their characterization of the video sequence. This approach requires low processing capabilities while clearly outperforming a random packet drop. The solution specifically addresses short-term protection decisions, where the error correction system has to decide which packets to protect (or which ones to drop) within a short window of time. Thus it is especially suitable for real-time multimedia transmissions. This solution is applicable not only to error correction, but also to congestion control. 5.2.1 Priority Model 5.2.1.1 Effects of packet losses The priority model proposed is based on the fact that not all the video packets contain the same kind of information and, therefore, the loss of different kinds of packets will produce different effects in the perceived video quality. In fact, even the loss of a single video packet can produce a wide range of different effects, depending on the kind of packet which is lost. There are several factors which influence the effect of a single packet loss. They can be roughly classified in two sets: content-based (camera motion, scene changes... ) and coding-based (type of video frame, position of packet within the frame... ). Only the latter are considered in this approach, since they are the ones which can be easily identified in the analysis of the coded media stream. It will be shown later that they suffice to provide a good performance of UEP algorithms. The factors considered are based on the following previous knowledge: 1. The effect of a loss is higher when it is produced in a reference frame (a frame used by the encoding system to predict the following ones), because the error will propagate to the frames which have it as reference [95].
102 Chapter 5. Applications Table 5.1: Priority value for each slice type NALU Type P S IDR(I) 1 Reference (R) 0.5 No-Reference (N) 0 2. If a packet in the middle of a video slice is lost, then the rest of the slice gets lost too, as the decoder cannot easily re-synchronize in the middle of a slice. This is especially relevant in H.264 video, where most commercial encoders use a low number of slices per frame (typically one). In such cases, the sooner the error is produced within a frame, the higher its impact is [86]. 3. If packets are lost in two different frames, their contribution to the final error (in terms of mean square error, MSE) can be considered to be the sum, as errors are typically uncorrelated [95]. 4. Audio packet loss effects are basically related to the length and structure of the loss burst, not existing meaningful differences between audio packets [79, 84]. 5.2.1.2 Packet Priority A packet priority model is proposed in order to assign higher priority to packets whose loss is going to produce a stronger effect in QoE. The model is based on the type of video slice carried by the packet and the position of the packet within the slice (assuming that typically a video slice is carried in several transport packets). As it has been mentioned before, losses have higher effect in reference slices than in no-reference ones, and at the beginning of the slice and of the GOP, where error propagation effects are higher [66, 86]. The priority model is defined as follows: P = αp S + βh + γt S + δt G (5.1) where P S is the priority of the slice type as described in Table 5.1, H is a flag indicating whether the packet contains a NALU (Network Abstraction Layer Unit) header, T S indicates the number of packets until the next slice in the stream and T G is the number of packets until the next I frame. All the parameters are normalized between 0 and 1. According to their relevance, the following coefficients are selected: α = 10 3, β = 10 2, γ = 10, δ = 1.
Chapter 5. Applications 103 Figure 5.1: Example of the packet priority model applied to one GOP of a coded video sequence Figure 5.1 shows an example of the application of the model to a sequence of video packets in transmission order. Each box represents an RTP packet, while different colors represent different frames. The figure shows all the elements of the prioritization model. P S depends on the NALU type (IDR, Reference slice or No-reference slice), indicated as I, R or N within the boxes. H = 1 (presence of NALU headers) is represented as a black bold frame. Finally, T S and T G are shown for the packet marked by the red circle. Audio packets can be easily introduced in this model just by assigning them a fixed priority value P = P A. In line with the idea that audio losses are more relevant than video ones, except in case of high video degradations [57], P A is set to 900. This way, audio packets have lower priority than IDR packets (for α = 10 3, P A =0.9α), but higher than any other video packet. Different values could be considered depending on the specific application. It is important to remark that it is not a scale of priorities, but only an ordering. The intention of the model is providing a way to sort a group of packets in priority order, so that the higher the priority is, the higher the impact of its loss is. However, there is no information about the relative magnitudes of the losses. Another relevant property of the model is that, once the priority for each packet is known, no more analysis is required. This allows the unequal error protection schemes to be stateless in the following sense: the decision of whether one packet is protected or not will have no effect in the priority value applied to other packets. This simplifies significantly the work of the UEP mechanisms.
104 Chapter 5. Applications Figure 5.2: Implementation of the prioritization model 5.2.1.3 Implementing the model Figure 5.2 shows the basic implementation modules to apply the described prioritization model to a video source. As mentioned before, the priority labeling is applied independently from the unequal error protection mechanism itself, and before it. To each packet x in the sequence, a priority P (x) is assigned and signaled to the UEP module. In the specific case of an IPTV scenario, each packet x is an RTP packet containing H.264 or MPEG-2 video, or MPEG audio (MPEG-1, AAC or similar), over MPEG- 2 Transport Stream. To assign the priorities correctly to the transport packets, it is necessary that audio and video are carried in different packets. It is also advisable that no packet carries data from more than one slice; which, for the typical H.264 stream with one slice per frame, means that no packet should carry data from two or more different frames. All these conditions are satisfied if the packing of MPEG-2 TS into RTP is done by the rewrapper described in section 3.5.2. Priorities assigned to packets can be signaled in the RTP header extension, so that the network processing elements can read them and use them to apply unequal error protection techniques. This has the advantage that the extension is transparent to other RTP receivers, so that the application of priority labels is backwards compatible with any RTP-aware system. This compatibility has been successfully tested with several commercial set-top-boxes, and this use of signalization in RTP header extensions is currently in the field in some IPTV commercial deployments. Other implementation options are possible. For example, priorities can be signaled using different protocols, such as the DSCP bits of the IP header. In such cases, the number of bits available to encode the priorities can be reduced. Next section will show that even a few bits can be enough to encode the priority in an efficient way. One of the main advantages of this model is its simplicity. This makes lightweight implementations possible: to assign a priority to a packet, only the video NALU header has to be read and analyzed. This way, the prioritization algorithm can be implemented in devices with limited processing capabilities, such as home network gateways. In such cases, the priority labeling and the unequal error protection modules would both reside in the same hardware device.
Chapter 5. Applications 105 5.2.2 Experimentation and results 5.2.2.1 Description of the experiment To test the performance of the model, three different short video sequences (4-12 seconds), encoded by commercial IPTV encoders, have been selected. They are sequences A, B and C from Appendix A.4. All of them are encoded in H.264 over MPEG-2 TS and packed in RTP in the way described before; with each RTP packet containing information about part of at most one video frame. Audio is not considered in the experiment. Within each possible window of W consecutive RTP packets in the sequence, the K packets with lowest priority are discarded. Then the resulting sequence is decoded, using the repetition of the last reference frame as error concealment strategy. The Mean Square Error of the resulting impaired sequence is computed, MSE PRIO. For the same W -packet window, the MSE resulting of randomly dropping K packets it is also computed, MSE RAND. The calculation of the random loss is performed by randomly selecting 1000 of all the possible combinations of K lost packets within the window. If there are less than 1000 combinations, then all are selected. MSE RAND is obtained as the average of the MSE of each of the (up to) 1000 combinations. For each window, the MSE gain is computed as MSERAND MSEgain(dB) = 10 log 10 MSE PRIO (5.2) Based on this, an Aggregated Gain Ratio (AGR) can be defined to measure the performance of the model. For each sequence and each pair of (W,K), AGR W,K (G) isdefined as the proportion of windows whose MSE gain is equal or greater than G, and it is expressed as a percentage in a 0-100 scale. Table 5.2 shows the values of AGR for some relevant values of MSE gain, W and K, for the three sequences under study (A, B and C), summarizing the results of the experiment. They will be discussed and analyzed in the following subsections. 5.2.2.2 Single-packet loss The first test considered is the case where K = 1, for several values of W. For each original sequence it is necessary to individually discard each one of the RTP packets and then decode and process the result of that individual discard. This way, more than 1500 impaired sequences have been obtained and used for the analysis.
106 Chapter 5. Applications Table 5.2: Values of the Aggregated Gain Ratio for some relevant values of MSE gain, W and K MSE gain(db) W K AGR% A AGR% B AGR% C 10 15 1 73.7 50.9 64.9 20 15 1 50.2 17.4 47.4 10 20 1 87.4 65.8 72.9 20 20 1 62.3 21.3 56.0 10 15 3 61.8 49.1 59.6 20 15 3 47.9 30.9 49.7 10 20 3 78.9 60.0 71.7 20 20 3 59.5 33.8 58.4 10 15 6 44.1 42.6 35.7 20 15 6 34.7 41.7 22.2 10 20 6 66.7 55.5 48.8 20 20 6 48.2 55.4 30.1 10 15 10 18.5 30.0 15.8 20 15 10 14.0 26.9 14.0 10 20 10 36.8 47.1 22.3 20 20 10 26.0 44.0 18.7 The results for sequence A, K = 1 and several values of W are shown in Figure 5.3. Each of the curves refer to a different value of W and represents, for several values of MSE gain, which proportion of the sequences obtained at least that gain value. The range of values of W is selected to cover typical loss burst lengths in a wireless home network [97]. Gains of 20 db in MSE can be reached for from 20% of the packets (W = 5) up to 85% (W = 30), using window sizes which are reasonable for a home network device. The figure also shows that the longer the window is, the better the results are, since it is easier to find a low-priority packet within the window. Figure 5.4 shows some values of MSE for sequence A, K = 1, W = 15. As it can be seen, the MSE varies highly between different windows along the sequence, independently of the protection method used. However, focusing on any of the specific windows (any value in the horizontal axis), using the prioritization method results in lower MSE in almost all the cases; and in most of them this reduction is very strong. This means that the specific error will depend heavily on the specific window which is selected but, once the window is there (i.e., once the error is bound to happen), a good UEP decision can mitigate the error effect dramatically.
Chapter 5. Applications 107 Aggregated Gain Ratio (AGR) 100 90 80 70 60 50 40 30 5 pkt 10 pkt 15 pkt 20 pkt 25 pkt 30 pkt 20 10 0 5 10 15 20 25 30 MSE Gain (db) Figure 5.3: Effect of the window size: Aggregated Gain Ratio for K = 1 and several values of W 10 1 10 0 priority random 10 1 MSE 10 2 10 3 10 4 10 5 0 20 40 60 80 100 120 140 160 window number Figure 5.4: Values of MSE for some possible windows within sequence A, comparing random packet loss (grey line) with priority-based packet loss (red line) for K=1 and W=15 5.2.2.3 Multiple-packet loss The second test considered is setting the value of W to a fixed value and analyzing the effect of the burst size, by changing the value of K. To simplify the implementation of the test bed, the results of the different (W, K) combinations have been derived from the (W, 1) case of the previous section, according to the considerations described in section 5.2.1.1. This way, only the first error within a slice is considered (as the rest of the slice is lost anyway) and errors in two different frames are assumed to be uncorrelated.
108 Chapter 5. Applications Aggregated Gain Ratio (AGR) 100 90 80 70 60 50 40 30 20 1 pkt 2 pkt 4 pkt 6 pkt 8 pkt 10 pkt 10 0 5 10 15 20 25 30 MSE Gain (db) Figure 5.5: Effect of varying the loss burst size (K) for a window of W = 15 packets Figure 5.5 shows the results for sequence A and W = 15. This value has been selected as representative from the range that was considered in Figure 5.3. Qualitatively, curves for other values of W within that range show similar behaviors. Results from the other sequences are summarized in Table 5.2. When the values of K are high, it can be seen that the effectiveness of the model drops, as there is very little margin to select low-priority packets. It is also interesting the fact that the curves gradually reduce their decreasing rate. For example, Figure 5.5 shows that, for K = 8, only 10% of the sequences have an MSE gain between 10 and 30 db, while 20% reach gains over 30 db. This behavior is due to the fact that the prioritization method concentrates errors firstly in no-reference frames (versus reference ones) and secondly in the end of the frame (versus the beginning). When the window lies entirely within one frame, then the gains against the random loss are limited. However, when the window covers part of two different frames, then the priority strategy concentrates the error in the less-impacting part of the window, thus reaching high MSE gains. As a consequence, even for severe error patterns, the prioritization method allows that, in a representative proportion of the cases, the error effect is negligible.
Chapter 5. Applications 109 Aggregated Gain Ratio (AGR) 100 90 80 70 60 50 40 30 20 P S +H +T S +T G 10 0 0 5 10 15 20 25 30 MSE Gain (db) Figure 5.6: Contribution of each term to the prioritization equation: only PS (red), PS + H (green), PS + H + TS (cyan), and all of them (blue). Computed for W=15 and K=1 5.2.2.4 Contribution of each priority factor An additional analysis of the performance of the model is represented in Figure 5.6. It shows the contribution of each of the terms in equation (5.1) to the aggregated MSE gain of the method. The red line represents the use of only P S as prioritization parameter. Then the green line introduces the effect of H additionally to the already available of P S. Afterwards the effects of T S and T G are added. Several aspects of the graphic are notable. First of all, the use of the very simple method of prioritization of just considering the frame type of the packets (P S ) can be good enough for some applications. And secondly, the most relevant contribution afterwards is T S which allows dramatic improvements to the performance. Therefore, in addition to P G and H, the parameter T S should always be considered. As the scope of the study is focused on the short term, and therefore window sizes are relatively small, it typically results in a small number of frames within each packet window, at most. This is the main reason why the contribution of T G is so limited in current scenario. Nevertheless, additional tests show that when the window size is enlarged, the relative weight of T G increases, supporting the choice of a model with four parameters.
110 Chapter 5. Applications Aggregated Gain Ratio (AGR) 100 90 80 70 60 50 40 30 2 bits 3 bits 4 bits 5 bits 6 bits 8 bits 12 bits reference 20 10 0 0 5 10 15 20 25 30 MSE (db) Figure 5.7: Effects of a limited bit budget to encode the priority 5.2.2.5 Limiting the bit budget These results are useful in a scenario where the priority can be established with as high resolution as possible, meaning that there is a high bit budget to encode priority values. In the mentioned case of a RTP header extension, for example, this bit budget could typically be around 16 bits in each RTP packet to encode its priority. However, in other signaling implementation, such as DSCP, the number of available bits to encode the priority could be lower. Figure 5.7 shows the effect of using a reduced number of bits to encode priority. The lines for 2 and 3 bits are the equivalent to the ones representing using only P S and P S +H. The rest of the lines are built according to the results shown in previous section, i.e., devoting more bits to the encoding of T S than T G. The proposed assignation of bits to each of the components is in 5.3. According to the results of this experiment, very few bits are necessary to encode packet priority. In particular, by using only 3 bits to encode T S, plus 3 more for P S and H, the deviation from the reference curve is already quite small.
Chapter 5. Applications 111 Table 5.3: Bit budget assignation to encode priority Total P S H T S T G 2 2 0 0 0 3 2 1 0 0 4 2 1 1 0 5 2 1 2 0 6 2 1 3 0 8 2 1 3 2 12 2 1 5 4 5.2.3 Applications This packet prioritization model can be applied to several scenarios related to unequal error protection. Some of them will be covered in this section: weighted random early detection (WRED), automatic repeat request (ARQ) and forward error correction (FEC). 5.2.3.1 Weighted Random Early Detection (WRED) Random early detection (RED), sometimes also called random early drop, is a technique used in routing devices to handle with congestion: when packet queues reach some fill level, some packets are dropped randomly in order to avoid buffer overflow. Weighted RED (WRED) is an enhancement of RED which allows assigning different priorities to each packet, so that their probability to get dropped depends on its priority. Using the prioritization algorithm in WRED is straightforward: when some packet has to be discarded, then it should be the one with lowest priority within the buffer. 5.2.3.2 Unequal Forward Error Correction (FEC) Unequal Forward Error Correction can also make use of this prioritization method. In many cases, FEC cannot protect all the packets within a specific sequence, thus only recovering part of them. In this case, where the protection happens before the actual error is produced; the FEC server has to decide whether to protect all the packets equally or whether to use this simplified approach to find out which packets have lowest priority. Typical FEC structures used in IPTV are based on M-by-N matrixes of packets, where XOR redundancy is applied packet-wise, either vertically, horizontally or both [19]. When applying unequal FEC, there is a limited bitrate budget to transmit these FEC packets, so that only part of them are generated and set. By introducing prioritization,
112 Chapter 5. Applications it is possible to reduce the required overhead introduced in the sequence, while keeping a good protection for the high-priority packets. The total number of packets within a matrix is below, but typically close to, 100. In this case, the window size is usually in the same order of magnitude. In that application, as stated before, the effect of the term T G will be higher than in the scenarios that we have discussed [12]. However, the principles presented here are fully applicable. 5.2.3.3 Unequal Automatic Repeat request (ARQ) The priority model is particularly suitable for unequal ARQ. When transmitting multimedia over a lossy network, such as 802.11g, it is common that the loss bursts are longer than what it can be retransmitted according to the available bitrate budget. In such cases, the decisions of whether to retransmit or not, as well as which packets to retransmit, are based on the priority of the specific packet [71]. The problem can be modeled as follows: when the receiver detect a loss of r frames, it requests for a retransmission of the whole burst. However, due to bitrate constraints, the server can only guarantee that the first n will arrive on time, i.e. before they have to be consumed in the reception buffer. Then the strategy of the server is retransmitting only the most important n packets, and doing it on priority order [30]. In such case, the decision in the server is which packets to drop: from a window of W = r possible packets, only n will be retransmitted, meaning that K = r n will be lost. The improvement by introducing this kind of prioritization in the recovery process, instead of just randomly drop some of the packets, is the one discussed and characterized along the previous subsection 5.2.2. 5.3 Fine-grain segmenting for HTTP adaptive streaming An HTTP Adaptive Streaming (HAS) client requests segments at one bitrate or another depending on the bandwidth of the TCP connection. Basically, when the buffer in the client is emptying, it requires lower bitrates; when it is stays full enough, it can use high bitrates. By keeping enough amount of buffering, the play out of the content can be seamless during this process. If by any chance (an abrupt drop of network quality) the buffer gets empty, then the play out stops (freezes) while the buffer fills again. In cases where network capabilities vary strongly, this implementation requires an important amount of buffering. When using HAS to transmit live events, however, increasing
Chapter 5. Applications 113 the buffer means increasing the latency, which is undesirable in such scenarios. For non-live content the effect is a slower startup time, which is also undesirable. On the other hand, if the buffer is small, underflows may be frequent, which is also undesirable in all kind of transmissions (as freezing the video and starting from the same point for a while is not an option, as a general rule). HTTP adaptive streaming works as follows. The source video is encoded at several bitrates (and therefore at different qualities) and then chunked into segments (typically the length of the segments is between 2-10 seconds; depending on the specific application, the length of the segments may be similar, or even equal, but it is not an strict requirement). The segment size is a compromise between flexibility and efficiency: shorter segments allow changing bitrate more frequently, and therefore reducing the required buffering time; however, they increase the server complexity (as it has to handle more requests) and the protocol overhead. Besides, segment boundaries usually need to be video random access points (typically I frames): the minimum segment length will be the minimum intra picture period which, due to coding efficiency, is rarely under 0.5 seconds. Therefore in HAS scenarios, all the decisions are taken within a time resolution which is at least 0.5 seconds long, and typically much longer than that. There is no solution to handle low-buffering systems beyond this point. This section proposes a solution to improve the behavior of HAS under low-delay constraints. 5.3.1 Description of the solution The idea proposed here is to arrange the information contained in each segment (the audio and video frames) in priority order, instead of in the original temporal order. A companion metadata included in the segment will carry all the information required to restore the original order. Depending on the delivery architecture this metadata may be part of the transport level or a separate information package added to the segment. Once the information has been arranged this way, if there is drop in network quality and, for example, only an 80% of the segment can be received by the client on time, instead of dropping the 20% of the segment duration, the client will drop the 20% which is less important. In the case, for example, of a 10-second segment, the difference would be dropping completely at least the last 2 seconds (without our solution) versus dropping, for example, some frames along the whole segment (which also affects the Quality of Experience, but in a less aggressive way).
114 Chapter 5. Applications The assigned priority to a fragment of the segment depends on the effect that the lost of said fragment has in the Quality of Experience. That is, the more important the fragment is for the Quality of Experience of the segment when played for an user (i.e. the higher the effect in the lost of quality perceived by the user if the fragment is lost), the more priority it has. The solution can be implemented following the next steps: First, on the multimedia content server side, each multimedia content segment (original segment) of the HAS streaming is divided into fragments (the length of each fragment is a design parameter), with the following requirements: Each fragment contains homogeneous data: that is each fragment does not content data from more than one elementary stream (video, audio or data) and/or from more than one access unit (frame or field) of a video elementary stream. Fragments may be smaller than a video frame (i.e. dividing each frame into several fragments) as it allows better performance of the solution (more granularity), although it is not mandatory. Each fragment has associated information (metadata) including information which allows to restore its original position in the segment (the temporal original position); for instance, a sequence number. Each fragment is assigned a priority. The packet prioritization model described in section 5.2 will be used here. If all the fragments are concatenated according to the associated information (e.g. sequence number order), then the result is the original segment or a valid segment of the multimedia stream equivalent to the original segment from the decoding point of view (i.e. can be reproduced by a normal multimedia decoder, with the same result as reproducing the original segment), but with maybe a different order in the frames. This segment is called recovered segment. Secondly, a prioritized segment is created by concatenating fragments in priority order. This segment includes the metadata for each fragment: priority and sequence number. Fragments with the same priority value may be ordered for example in sequence number order. This prioritized segment is sent from the multimedia server to the end device or in other words, the fragments are sent to the end device by the multimedia server in priority order. Finally, the prioritized segment is retrieved by the end device and stored in a memory buffer. When the client starts receiving the prioritized segment via HTTP, it starts
Chapter 5. Applications 115 Figure 5.8: Priority-based HTTP Adaptive Streaming segment structure extracting the fragments into the buffer. Each segment is put at their right position using the associated information (sequence number), rebuilding therefore the recovered segment. The buffer may be consumed at normal pace by the client (no special buffering policy is needed). If the segment is consumed before the whole segment has arrived, then there will be gaps in the buffer; but they will take place in the less important positions (the ones with less priority and therefore, less impact to QoE). Late arrivals are discarded. Figure 5.8 shows a schematic diagram of the solution in an exemplary scenario. Top line (both left and right) describes a typical segment transmission and presentation. Bottom describes a segment transmission and presentation using our solution. The segment has video frames (I, P, B) and audio (a) frames or, being more general, access units (AUs). For the present explanation, in order to simplify the figure, it is assumed that one fragment contains exactly one AU, although each AU can be divided into smaller fragments if needed. The left part shows the structure of the segment for transmission. In our solution (prioritized segment), the fragments have been re-ordered in priority order; but both segments (top line and bottom line) represent the same content. Note that the prioritized fragment contains the same AUs than the regular one, but in different order. Now the segment is transmitted but, for any reason, the download (streaming) has to stop in the middle (i.e. all the data under the highlighted square are lost, because they have not been received by the end device) and it has to be sent to play out (this is the presentation part, on the right hand side of the image). In the regular segment case (top), the answer is simple. The client plays out the first half of the segment, and then stops (black or frozen video, and no audio too). In the prioritized segment case (bottom), it is different: packets are re-ordered and, as they have a sequence number, they are re-ordered by the client (end-device) and displayed in their right position and only the less important packets have been lost. To simplify:
116 Chapter 5. Applications we have all the I and P frames, plus all the audio. The result is that the segment is played out completely, although at a lower frame rate (33%), and with all the audio. Of course, dropping the frame rate and keeping the audio is much better than losing several seconds completely. According to the subjective assessment tests described in Appendix A.2 (see also [27]), there could be a difference of 1 to 3 points in a MOS scale (1 to 5) between both approaches. It is important to note that the creation of the prioritized segment is a decision that can be taken prior to the knowledge of the network status between the content server and the end device. In other words, the prioritized segment is generated once in the server, and all the end devices download and play it. If there is no network congestion, the experience will be the same as with the original segment: it will be correctly and completely displayed. However, if there is a sudden network QoS drop, the end device will have its prioritized segment available without having to do anything special in the server side. The solution has thus the following advantages: It allows to recover from buffer underrun in HAS in an optimal way. That is, smaller HAS buffers can be used, therefore reducing the latency of the whole HAS solution. It works passively, in the sense that neither the server nor the client have to change their default behavior when facing a network congestion. Besides, it provides a mechanism to mitigate the effect of high network rate variations. More generically, it makes it possible to use in HAS all the QoE enhancement technology which has been developed for real-time RTP delivery, such as video preparation for Fast Channel Change, unequal loss protection or selective scrambling; that is, this solution allows to use QoE enhancement techniques in a different environment (HTTP delivery). 5.4 Selective Scrambling The concept of selective scrambling means that, when cryptographically protecting a multimedia asset or stream, only a (typically reduced) fraction of the data is scrambled, whilst the rest are distributed in clear. The reasons for such approach are twofold: on the one side, by leaving some specific information unscrambled, intermediate videoprocessing systems can access to the part of the data which is required for them to work
Chapter 5. Applications 117 correctly the rich transport data; on the other, keeping a reduced bit rate of scrambled packets can be the only possible solution for decoding devices with limited computing power, such as user terminals. Addressing the former problem is relatively simple, as the specific data headers required by the network processors are typically well-known. The latter is more interesting, as it is necessary to find a good balance point between scrambling rate and protection effectiveness. 5.4.1 Problem statement and requirements A user which is watching a partially scrambled content asset and is not entitled to (and therefore does not have the appropriated keys to descramble it) will experience the same effect that a user that loses (for example, due to network errors) exactly the same packets that are scrambled in the stream. From this point of view, selective scrambling can be seen as a reverse rate-distortion optimization (RDO) problem. Unlike in the typical RDO problem, however, the aim here is maximizing the final distortion for a specific rate of scrambled packets. In an ideal case, the resulting distortion should be so high that no useful data can be extracted from the content. However, for many practical applications, it could be suitable that the resulting video has a quality bad enough to discourage the potential user to watch it. The underlying idea here is that, in order to find a good selective scrambling algorithm, techniques for Quality of Experience analysis can be used. Notwithstanding, the design of selective scrambling schemes must be aware of the reasons why this algorithm is required: processing the scrambled video in the network and low computing power required in the descrambler. Besides, using a lightweight scheme also in the scrambler side would broaden the applicability of the scheme. Hence the requirements for the selective scrambling algorithm would be to: 1. Be transparent to video servers by leaving rich transport data in clear, 2. Scramble only a (low) percent of the video packets, 3. Be implementable with low computational cost and 4. Maximize the distortion introduced by the encrypted packets (i.e., do not allow to recover the video sequence from the unscrambled packets unless with heavy impairment).
118 Chapter 5. Applications 5.4.2 Algorithms Most existing commercial CAS/DRM solutions fulfill requirement 1. However, they typically rely on the encryption of the full stream. There are several solutions in the literature that address the partial encryption of the video stream. A description of the state of the art can be found in the work of Massoudi et al. [69], who describe a set of encryption techniques that allow good visual degradation of encrypted video while scrambling only part of the packets. However, all of them either require deep analysis of the video stream (thus not satisfying requirement 3) or scramble the video headers to make video impossible to decode (not meeting requirement 1). Fan et al. propose encoding with higher security the most important data and with lower security (and complexity) the less important [20]. Shi et al. divide H.264 video elements in different classes, which are provided with different protection [100]. In the work of Zou et al., different encryption levels can be reached by analyzing the entropy coding of the H.264 stream [125]. These methods satisfy requirement 2, but all of them require analyzing H.264 up to, at least, macroblock level, which might be computationally expensive (especially when CABAC entropy coding is used, as in most IPTV streams). The approach we propose is exploiting the error resilience characteristics of video coding standards such as, but not limited to, H.264, where video frames are divided in slices. It has been shown that, when a fragment of the video slice gets lost, the rest of the slice becomes almost impossible to decode [86]. Therefore by scrambling a small set of data in each slice it is possible to get a very high video degradation. This solution is especially suitable for multimedia deployments because: Commercial encoders use a low number of slices per frame (typically one in SDTV, 4-8 in HDTV, see section 2.4.1). Thus the fraction of video packets to encrypt (scrambling rate) is kept low. The information required to process video in a video server (i.e., stream and picture-level information) is contained in other H.264 syntax elements (called NALUs Network Abstraction Layer Units) which are not slices, and in the header of the slices. The only analysis of the video stream required for this solution is: detecting the type of NAL Units, detecting slices and slice headers and reading the coding type of each frame. This can be performed in the H.264 Network Abstraction Layer i.e., it does not require analyzing anything beyond slice header level. This makes processing much simpler than any other selective scrambling algorithm.
Chapter 5. Applications 119 Table 5.4: Minimum scrambling rate required to completely loss the video signal, as subjectively assessed by expert viewers in laboratory, for several content assets. Content Resolution Bitrate %Scrambled %Scrambled (selective) (uniform) Advertising 576i 2.7 Mbps 1% 15% News 576i 2.7 Mbps 1.5% 15% Movie 1080i 8.5 Mbps 0.8% 10% Movie 1080i 15 Mbps 0.3% 5% With these premises, we propose a scrambling scheme based in two layers [83]: 1. In each Slice, scramble a small set of data just after each Slice Header. This protects video from real-time decoding with a very low scrambling rate. 2. After that, scramble randomly some sets of data of the rest of the VCL units, as well as other streams (e.g. audio). With this second layer, two aims are met: first of all, audio streams are also scrambled so that they are heavily impaired to non-descrambling receivers; secondly, eliminating redundancy in the video stream makes it impossible to decode, even for sophisticated offline error concealment methods. 5.4.3 Results This algorithm has been tested with different scrambling rates and several contents encoded in H.264. Then the processed video has been played without correctly descrambling the packets, so that they result in packet losses. The video has been then watched by expert viewers in the laboratory in order to assess the minimum scrambling rate at which it was impossible to extract any information from it (i.e. the image was completely impaired). This value has been compared with the minimum scrambling rate required to obtain an equivalent result by randomly encrypting video packets. Detailed results are shown in Table 5.4. Even with the limitations of the experiment, it can be shown that, by only encrypting up to 2% of the transport packets, it is possible to impair the video quality so that the resulting video is useless. All the video samples under study used only one slice per picture. For video sequences with N slices per picture, these values would have to be multiplied by a factor K N. Even in that case, for most typical scenarios, the required scrambling rate would be relatively small.
120 Chapter 5. Applications 5.5 Fast Channel Change As it has been shown in section 4.6.2, the channel change time can be modeled as T CC = T term + T net + T buf + T RAP + T dec (5.3) where T term is the response time of the user terminal software, T net is the network response time, T buf is the dejitter buffering time in the terminal, T RAP is the time needed to reach a Random Access Point, and T dec is the decoding start-up time, according to the buffering model imposed by the encoder. These factors are normally neither optimized nor easily optimizable in real deployments. For this reason, specific solutions have been proposed to address this issue. The most common one is the so-called Rapid Acquisition of Multicast Stream (RAMS), described also as unicast based Fast Channel Change in DVB-IPTV [19]. This solution is based on Fast Channel Change servers deployed as Edge Servers in the network, which provides the following functionality: 1. When the user requests a channel change, the user terminal, instead of joining a new multicast stream, it requests a unicast stream to the FCC server. This changes, and usually reduces, T net. 2. The FCC server then sends a unicast stream to the user terminal. This stream starts from a Random Access Point in the past, reducing T RAP to virtually zero. 3. The stream is sent at a higher bitrate that the one of the multicast stream, so that at some point it catches up with the multicast. This point is signaled by the FCC server so that the user terminal can switch to the multicast stream seamlessly. The application of the standard solution, however, only solves part of the problem (T net and T RAP ). The user terminal can also set T buf to a minimum value (T buf-fcc 0) to further reduce the channel change time. Since the unicast is received at a rate which is higher than the nominal one, but it is only consumed at the nominal one, the exceeding bitrate can be used to fill T buf to its desired value after the video has been started to be decoded. There is still a relevant component of the channel change time, which is T dec, which has not been addressed so far. It has additional difficulty for two reasons: it is imposed by the video encoder and it is different for each of the streams (video, audio and subtitles). To reduce it to the minimum value, we propose the following solution:
Chapter 5. Applications 121 At the beginning of the unicast session, re-multiplex the different elementary streams in the FCC server so that they have similar T dec values at the beginning of the unicast stream. The easiest way is increasing the T dec of the audio and subtitling streams. Due to the fact that all the elementary streams have been separated in different RTP packets by the rewrapper processing (see section 3.5.2), it can be done only by reordering the RTP packets in the stream. Besides, reduce the value of the T dec of all the elementary streams together by re-stamping the value of the PCR in the Transport Stream. Finally, during the unicast session, recover the original situation of the stream, by gradually un-doing the re-multiplexing so that, when the session is switch to multicast, unicast and multicast streams are equal and the switchover can be done seamlessly. With this procedure, T dec can be reduced down to about 100 ms. Since T buf, T net and T RAP have been also reduced, almost instantaneous channel change times can be obtained (in the range of 200 to 300 ms), provided that T term, which depends basically on the software design of the user terminal application, can also be optimized. 5.6 Application to 3D Video In the recent years the popularity of 3D video has increased strongly, mainly from the availability of last-generation stereoscopic displays both in cinemas and in consumer television sets, as well as the production of several successful films using this technology. As such, today it is possible to buy a 3D television set and watch blu-ray 3D contents at home at affordable prices. The next challenge is delivering 3D content by the different types of multimedia delivery services, from traditional television broadcasting to overthe-top content distribution. Virtually all the 3D multimedia contents that are available for these kind of services are encoded as stereoscopic pairs of images: two different video frames, one for each eye of the viewer. The resulting content is therefore composed of two different video streams (called views ), which represent the same scene from two slightly separated points of view. These two views can be encoded an transmitted in several ways [10]: As two different video streams (simulcast). Multiplexed in a single video stream. The most typical way to do it is side-by-side (each half of the image, left and right, contains one view; and the player is able to separate them).
122 Chapter 5. Applications Using specific standards which make use of the redundancy between views, such as H.264 MVC. In all the cases the video is encoded either in AVC or in MVC (which is an extension of AVC as well), and therefore the approach presented in this thesis is completely valid. Since the basics of the coding structure are the same as in 2D video, the most relevant errors which will happen in a 3D video delivery service will be again video macroblocking, audio losses, quality drops, outages... These artifacts, however, may have different impact in the Quality of Experience, as the viewing experience is completely different. To test it, the subjective quality assessment methodology described in section 3.4 has been applied to assess stereoscopic video subject to network errors [24, 25, 28]. The results show that these errors have a similar impact in 3D video that what they have in standard 2D video. Macroblocking effect seems to be more annoying in 3D video due to visual rivalry (mismatch between left and right views). The other artifacts, however, seem to be slightly more tolerable in 3D than what they are in 2D video, maybe because they are somehow masked by the added-value provided by the stereoscopic experience. The other relevant difference is that, in the coding schemes where each view is encoded in a different frame (i.e. simulcast and MVC), there is a new dimension in terms of scalability. In other words, one of the views can be signaled with a lower priority than the other so that, in case of an error-prone channel or network congestion, all the errors are concentrated in a single view. In these situations, dropping one view and switching to 2D video is an option that can complement the existing drops in bit or frame rate that have been discussed in section 4.4 [27].
Chapter 6 Conclusions The market of the multimedia content distribution is in a rapid and continuous evolution, which started with the standardization of digital television broadcasting in the 1990s, continued with the deployment of triple-play offers, with the special relevance of interactive IPTV, in the 2000s, and is moving towards global multi-screen OTT services in the 2010s. The service offer is increasingly richer and tends to be more personalized and focused on the expectations of individual users. New players and business models appear in the marketplace while the distribution of video traffic over communication networks increases its relative relevance in the total amount of transported data. Within this complexity, the underlying technology has a common definition: the delivery of digitally-encoded multimedia streams (from short advertisement clips to unbounded television channels) over a packet network. A very restricted set of technologies (MPEG codecs and multiplexers, and IP transport) is used for a good fraction of the present and upcoming services. Therefore, as seen in section 3.2, it is possible to define a generic architecture which can be applied to model the most relevant service scenarios. The same effort to establish a general architecture can be done for the problem of the monitoring of multimedia Quality of Experience (QoE) in such multimedia services. In section 3.3 we propose QuEM (Qualitative Experience Monitoring): a monitoring framework aimed at obtaining significant descriptions of the impairments present in the service [84, 85], which can be used as a replacement of pure Packet Loss Rate (PLR) based monitoring systems. Each measure introduced in the framework must work under real conditions (lightweight processing and bitstream-based), and must be repeatable, in the sense that it is possible to artificially generate the error conditions measured. The output of the measurement block (called QuID, Quality Impairment Detector) can afterwards be mapped to a severity value through a user-defined Severity 123
124 Chapter 6. Conclusions Transfer Function (STF). Measures in the QuEM system are proposed to cover the most relevant artifacts present in existing multimedia delivery services. The QuEM approach has also inspired a novel methodology of subjective assessment tests, described in section 3.4. It is aimed at reproducing as much as possible the viewing conditions of the final user of the services. Therefore the content shown is intended to be meaningful for the viewer and, more relevantly, the content is displayed in a nearly continuous way. This methodology can be used for the validation and calibration of QuIDs [25, 28, 85]. Besides, section 3.5 introduces new features that simplify the management of the multimedia content in the transport network, by using rich transport data which make the network aware of part of the video information. Among them we can cite the homogenization of interfaces in video processing elements [87], the introduction of metadata synchronized with the video stream [9], the intelligent re-wrapping of video into transport packets, or the processing of video in the edge of the network [108]. The most relevant source for impairments in a real deployment is the loss of video packets. Section 4.2 describes a proposed metric to predict the effect of video packet losses (PLEP, Packet Loss Effect Prediction) [86]. By monitoring the Network Abstraction Layer (NAL) of H.264 video and following the chain of references, it provides a reasonably reliable description of the extension and duration of the artifacts associated to the loss of video packets. Experiments show that it clearly outperforms simple PLR monitoring, while still being applicable to the monitoring of real multimedia services. PLEP metric is complemented with other measures, such as the monitoring of audio packet losses (section 4.3), video coding quality (section 4.4) and outages (section 4.5), to complete the relevant impairments. In all the cases, subjective assessments suggest that the monitoring of basic parameters is enough to contribute to the QuEM system in a significant way. In the specific case of video coding quality, bit and frame rates are taken as proxy metrics. Finer monitoring of quality within the same bitrate cannot be reliably addressed with No-Reference quality metrics [81]. For completeness, section 4.6 studies the latency-related measures that can affect QoE: end-to-end lag and channel change time. Both can be analytically obtained by measuring times in the network, the timing and buffering information present in the multiplexing headers, as well as by knowing the (constant) additional buffer introduced in the encoder and decoder elements. The comparison of the different impairments (section 4.7) shows the importance of controlling the effect of packet losses: the impact of losing packets in video no-reference frames is much less aggressive that the loss of audio packets, for instance. Therefore,
Chapter 6. Conclusions 125 with the right knowledge of the effect of network events in QoE, it should be possible to design network systems whose policies are optimized towards the final perceived quality. Following this idea, section 5.2 proposes applying a simplification of the PLEP model to do packet prioritization for Unequal Error Protection: error correction and congestion control [82]. The solution addresses specifically short-term protection decisions, where the error correction system has to decide which packets to protect (or which ones to drop) within a short window of time, based on the potential impact of their loss. Thus it is especially suitable for real-time multimedia transmissions. This principle can also be applied to HTTP Adaptive Streaming (section 5.3), by sorting the packets within a segment in priority order. This way, if the segment download has to be interrupted for any reason, the impact in the final QoE will be minimized [88]. The same concept can be reversed by searching, under certain conditions, the packets whose loss has a stronger impact in the quality. In other words: it is possible to use the PLEP model to maximize the effect in the QoE for a given loss rate. Section 5.4 describes how to apply this idea to a selective scrambling environment. By encrypting a few number of packets in the video stream, it is possible to leave the signal virtually impossible to decode from the remaining packets [83, 109]. In section 5.5 we propose a solution to reduce the zapping time in IPTV channels. The analysis of the decoding and buffering processes makes it possible to accelerate the start-up of the stream decoding after a channel change, reaching zapping times below 500 ms. Finally the metrics and applications described can also be applied to stereoscopic multimedia delivery (and, more specifically, 3DTV) [10]. Section 5.6 mentions it, with a special reference to the application of the subjective quality assessment methodology to 3DTV environments [28], both in IPTV [24, 25, 26] and OTT/HAS [27]. In summary, this thesis proposes a comprehensive approach to the monitoring and management of Quality of Experience in IP multimedia delivery services. With the appropriate framework and a deep knowledge of the rich transport data it is possible to enhance the quality monitoring, minimize the effect of losses, maximize the power of encryption systems, or improve the zapping time of the service without significantly increasing its complexity or cost. The proposed approach is also transparent in the information it provides and the processing it does: any fine tuning can be done by the user of the system, and there is no aprioridependence on empiric parameters or training data. These properties make it perfectly suitable for the needs of multimedia service providers and, in fact, some of the proposals of this thesis have already present in several IPTV deployments around the world. And we believe that, event for scenarios where
126 Chapter 6. Conclusions our proposals are not of direct application (for technical or commercial reasons), the information contained in this work can be helpful to any person who has to address the complex problem of modeling, monitoring and managing the Quality of Experience provided by a multimedia delivery service over IP. Future work is envisioned in three complementary directions. Firstly, in the enhancement of the QuEM model with additional experiments which can provide better calibration for the impairment detectors, as well as outline models to evaluate the effect of multiple artifacts (either simultaneously or sequentially) over a short period of time. Providing simple and robust mechanisms to aggregate quality data is still a challenge which needs to be solved for the context of the multimedia delivery services. Secondly, by continuing the application of the ideas discussed here to the field of 3DTV and stereoscopic video. And finally, by expanding the applicability of the model to HTTP Adaptive Streaming environments. With the popularization of this technology as the basis for OTT delivery services, there is an opportunity to develop new applications that can make an optimal use of the network resources and manage the Quality of Experience of the end-to-end service.
Appendix A Experimental setup A.1 Introduction This Appendix describes the most relevant experiments used in this thesis. Section A.2 details the subjective quality assessment tests performed with the methodology described in section 3.4, and focused on the evaluation and calibration of the QuIDs. Their results have been used in several subsections of chapter 4. Section A.3 describes a previous set of subjective quality assessment tests, aimed at evaluating the video quality produced by video encoders provided by different manufacturers. These tests have been used in section 4.4. Finally section A.4 describes the set of contents used for several objective quality experiments. Those contents have been taken from IPTV field deployments (or laboratory trials) and therefore have been selected as target contents for the development algorithms to work with. A.2 Subjective Assessment based on QuEM approach This section describes the details of the specific set of tests done to calibrate the Quality Impairment Detectors under study, and whose most relevant results have been presented in chapter 4, together with the description of the QuIDs. The tests were performed following the methodology that has been described in section 3.4. A.2.1 Selection and preparation of content According to the test methodology, selected content must be representative of what a multimedia service usually offers, as well as significant to the viewers. An important 127
128 Appendix A. Experimental setup target of the test methodology is reproducing the experience of a user watching video at home. To do that, it is important that the user can see the video stream as meaningful, and not just as a simple evaluation sequence whose contents are irrelevant. Three content sources were selected for the tests, each one with a duration of 5 minutes and 30 seconds: A movie sequence: a cut from Avatar. It is a film with detailed image information (making it suitable for subjective tests), and it was reasonably popular at the time where the tests were made. As an added value, it was released in 3D and could be used to compare 2D vs 3D impairments easily. A sports sequence. In particular, a cut was selected from the extra time of the final match of 2010 FIFA World Championship, including the goal which resulted in the victory for the Spanish team. It was the probably most relevant sports content available. A documentary sequence. A high-quality video was selected. Documentaries are relevant in those tests because their audiovisual features (length of scenes, type of camera movement... ) are quite different from the ones of sports and cinema. The documentary was also available in 3D. The sources were compressed using a professional H.264 video encoder. Selected resolutions and bitrates are described in Table A.1. Lower bitrate versions of the sequences were also generated to simulate bitrate drops. Table A.1: Video test sequences: bitrate and resolution Source Format Bitrate Movie 1920x1080p 24fps 8Mbps Sports 720x576p 25fps 4Mbps Documentary 720x576p 25fps 4Mbps The resulting streams were there chunked into 12-seconds segments for the tests and processed by a rewrapper. Impairments were introduced in the first half of each of the segments. A.2.2 Selection of impairments The selection of impairments was done to cover a sufficient range of error cases related to the metrics that were going to be evaluated and calibrated (the ones defined in chapter 4).
Appendix A. Experimental setup 129 A.2.2.1 Bitrate drops To simulate the effect of a bandwidth drop, the first half of the segment was re-encoded using a different bitrate, which was a fraction of the original one. Two different impairments were defined (called R1 and R2) as detailed in Table A.2. Table A.2: Bitrate drops Test R1 R2 Bitrate (% of reference) 50% 25% A.2.2.2 Frame rate drops In these test cases, the first half of the segment is transmitted using a lower frame rate, which is a fraction of the original one. Frame rate reduction is achieved by discarding some B frames from the original stream (denting). Two different impairments were defined, as detailed in Table A.3. Table A.3: Frame rate drops Test F1 F2 Frame Rate (% of reference) 50% 25% A.2.2.3 Audio losses These impairments are implemented by discarding audio packets in the middle of the first half of the segment. The shortest loss length, achieved by dropping a single audio packet, produced a silence of about 200 ms. Longer lengths were achieved by dropping consecutive packets. Test cases A5 and A6 introduced a sequence of several short losses separated approximately 1 second. Impairments are detailed in Table A.4. The total duration represents the time from the beginning of the first audio mute to the end of the last one. A.2.2.4 Video losses: macroblocking The macroblocking effect caused by a transmission loss can be roughly characterized using three parameters:
130 Appendix A. Experimental setup Table A.4: Audio losses Test A1 A2 A3 A4 A5 A6 Loss length (s) 0.2 0.5 2 6 0.2 0.2 Loss events 1 1 1 1 3 7 Total duration (s) 0.2 0.5 2 6 2 6 The fraction of the picture affected (position of the loss within the frame). The duration of the artifact due to error propagation (position of the loss within the GOP). The loss pattern (i.e. the effect of losing several packets in several frames). To simplify the test cases, the following restrictions were imposed to the test cases: There would be at most a packet loss in each GOP. Loss patterns would be established by introducing the same type of packet loss in several consecutive GOPs. Impairments are detailed in Table A.5. MIN means that the impairment occurred in a no-reference frame, and therefore its effect did not propagate through the GOP. Table A.5: Macroblocking errors Test E1 E2 E3 E4 E5 E6 E7 E8 % of Frame 100 25 50 100 50 50 50 50 % of GOP MIN 90 90 90 90 90 25 25 Number of GOPs 1 1 1 1 3 5 3 5 The rationale for this selection of impairments is the following: E1 Verify that the loss of isolated no-reference frames has no effect in the perceived quality. E2 E4 Analyze the effect of single packet losses. E5 E8 Analyze the effect of multiple packet losses.
Appendix A. Experimental setup 131 A.2.2.5 Video freezing Video freezing was achieved by the loss of a single I frame (or its header), so that the whole picture remains still until the beginning of the next GOP. The length of the freezes were selected as multiples of the GOP length (half a second), as shown in Table A.6. Table A.6: Video freezing Test V1 V2 V3 Freeze duration (s) 0.5 2 6 A.2.2.6 Impairment sets The selected impairments were structured in impairment sets: groups of impairments related among them, as described in Table A.7. N represents a hidden reference (no impairment). AV is the combination of A4+V3 (6 seconds audio mute and video freeze, i.e., a 6-second full outage). Table A.7: Impairment sets Impairment Set Freq. Impairments Description Rate Drop 3 R1 R2 F1 F2 Reaction to bandwidth changes Audio Loss 1 3 A1 A2 A3 A4 Audio mute length Audio Loss 2 3 A3 A4 A5 A6 Continuous vs. periodic mutes Macroblocking 1 3 E1 E1 N N Detectability of no-reference loss Macroblocking 2 3 E3 E4 E5 E6 Impairment duration Macroblocking 3 3 E5 E6 E7 E8 Effect of % of GOP affected Single Loss 5 V1 E2 E3 E4 Effect of a single video packet loss Outage 1 1 V2 V3 A3 A4 Audio vs video outages Outage 2 1 V3 A4 AV AV Audio vs video vs both The Freq. (frequency) label indicates the number of times that each impairment set appears in each test sequence. The sum of all the frequencies is 25, which means that 25 different impairments were introduced in each test sequence: one impairment each 12 seconds. For each of the three video test sequences (movie, sports and documentary), the following steps were followed: 1. Each segmented sequence was replicated 4 times, to create 4 different variants.
132 Appendix A. Experimental setup Figure A.1: Structure of the content streams in the subjective assessment test session 2. The 25 occurrences of the impairment sets were randomized, as well as the 4 different impairments within each set. This way, 4 different sequences of impairments were generated, each one having 25 impairments. 3. Each sequence of impairments was applied to each of the variants, i.e., impairments were introduced in the first halves of the segments accordingly. The resulting sequences have the structure shown in Figure A.1, where the impairments introduced in each of the evaluation periods T i belong to the same impairment set. Table A.8 shows an example of some of them they are the first 13 impairments introduced in each of the variants of the sports sequence in the final tests. Table A.8: Example of a sequence of impairments Variant T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 T 9 T 10 T 11 T 12 T 13... A A4 V4 E8 E4 E1 A5 F2 A4 E5 A3 V1 A6 F1... B A6 AV E7 E3 E1 A6 F1 V2 E6 A1 E3 A5 R1... C A5 V3 E5 E6 N A4 R2 A3 E7 A4 E2 A3 F2... D A3 AV E6 E5 N A3 R1 V3 E8 A2 E4 A4 R2...
Appendix A. Experimental setup 133 ;& :& 9& 8& 7& 6&!"#$%"& '#()*& +,#-./& 0#123*4.%-5& <7& <8&=7&=8& >7& >8&>9&>:& >;& >?& @7& @8& @9& @:& @;& @?& @A& @B& C7& C8& C9& >C& 0#123*4.%-5& +,#-./& '#()*&!"#$%"& D& Figure A.2: Summary of the subjective quality assessment test results A.2.3 Test sessions Tests were carried out in the laboratories of the Universidad Politécnica de Madrid. The viewing room was set to have correct light conditions, according to international standard recommendations for home environment tests. Specifically, a 42 full-hd Panasonic television was used, and the observers were placed at a viewing distance of 3 times the height of the TV set. A total number of 42 observers, 35 male and 7 female, took part in the experiment. All of them had normal visual acuity and color vision. The ages of the subjects were ranged between 20 and 48 years old, with an average age of 27. A maximum of 4 people took part in each of the assessment sessions. In each session, the viewers were shown one variant of each of the three test sequences (movie, sports and documentary). This way, each variant was assessed by at least 10 different viewers. Figure A.2 shows a summary of the results for each impairment and content stream.
134 Appendix A. Experimental setup A.3 Subjective quality assessment of H.264 video encoders This section describes a set of subjective quality assessment tests, performed as part of a comparative study of coding quality of several IPTV H.264 encoders. The results of these tests have been used as benchmark to evaluate NR and RR objective assessment metrics in section 4.4.1. In this test set, 7 different encoder implementations from 7 different manufacturers were analyzed, using several bitrates and several source video sequences. All the source sequences are cuts of television programs in contribution quality. This way, the only impairment introduced in the tests is the one generated by the compression process of the encoder. Tests were Single Stimulus (SS), and they followed the recommendation ITU-R BT.500 [42]. The viewer was presented with a 10-second sequence, whose quality had to be judged using a MOS scale (from 1 bad to 5 excellent ). Four 10-second sequences (two from a football match, two from a live music show) were encoded with seven different implementations of H.264 encoders (from different vendors), each one at five different bitrates: 1.4, 1.7, 2.0, 2.3 and 2.6 Mbps. They were SDTV sequences encoded at Main Profile, level 3. A hidden reference was included as well. The selected content assets came from real contributions of an IPTV network and present demanding coding requirements (movement, textures, capture in interlaced format...). The target bitrates represented the range of real IPTV deployments and the encoder configuration was provided for each vendor. Thus the environment was as close as possible to a real commercial service. For the tests 20 non-expert observers, balanced in age and gender, were selected and divided into 4 sessions of 5 participants each. They were presented the sequences, in random order, and asked to evaluate their quality with a MOS scale (1 to 5), according to the specifications in [42]. Additionally, 6 stabilizing cuts where added at the beginning of each viewing session, whose votes were not taken into account for the final results. Figure A.3 shows the results of one of the sequences for all the H.264 encoders. It is worth noting that the different codec implementations obtain quite different marks. This should prevent us from generalizing the behavior of the H.264 standard when only one implementation is used. In other words, there is no common AVC quality for a given content and bitrate; it will depend on the specific encoder implementation.
Appendix A. Experimental setup 135 5 4 MOS 3 Cod1 Cod2 Cod3 Cod4 Cod5 Cod6 Cod7 2 1 1.4 1.7 2.0 2.3 2.6 Video AVC (Mb/s) Figure A.3: Subjective MOS for a football video test sequence. Each color represents a different encoder. The original sequence was ranked with MOS=4.2 A.4 Test sequences from IPTV deployments This section describes the set of sequences used to test some of the algorithms and applications that have been presented in this thesis. The main target of those tests is developing techniques which have to be applicable in real multimedia delivery services. For that reasons, test sequences have been selected from streams used to validate services in the field: all of them are captures either from a real field deployment or from a validation laboratory in an IPTV service. There is therefore more interest in the way the sequences are encoder rather than in the specific content that was shown at that moment (which is something that can be rarely selected when doing the capture). The properties of the different sequences are described in Table A.9, and their source content is the following: 1. Sequence A is a scene from an action movie (Die Hard 4). 2. Sequence B is a documentary. 3. Sequences C and D are advertisements (the same source sequence with different encoding settings). The following clarifications can be done about the table:
136 Appendix A. Experimental setup Table A.9: Test sequences Sequence A B C D TS Bitrate (Mb/s) 2.8 2.5 2.7 2.7 Video H.264 H.264 H.264 H.264 Video Bitrate (Mb/s) 2.3 2.0 2.0 2.0 Video Profile Main Main Main Main Video Level 3.0 3.0 3.0 3.0 Video Resolution 720x576 544x576 480x576 480x576 Aspect Ratio 16/9 4/3 4/3 4/3 Picture Rate 50i 50i 50i 50i IDRs Yes No Yes Yes Slices per picture 1 1 1 1 GOP length 100 24 24 12 P frame period 4 4 3 3 Hierarchical GOP Yes Yes No No No. of audio streams 2 2 1 1 Audio Format MP1L2 MP1L2 MP1L2 MP1L2 Audio Bitrate (kb/s) 192 192 192 192 A P frame period of 4 means that there are 3 B frames between each consecutive P or I frames (i.e., the structure is IBBBP). Similarly, a P period of 3 represents an IBBP structure. A hierarchical GOP structure (... IBBBP... ) is like the one discussed in section 2.4.1 and deployed in Figure 2.4 in page 31. As it was mentioned in section 4.2, in IPTV scenarios it is frequent that some streams use I frames which are not IDRs. This is the case of sequence B.
Bibliography [1] K. Ahmad and A. C. Begen. IPTV and video networks in the 2015 timeframe: The evolution to medianets. IEEE Communications Magazine, 47(12):68 74, December 2009. [2] ANSI T1.801.02-1996. American National Standard for Telecommunications - digital transport of video teleconferencing/video telephony signals - performance terms, definitions, and examples, 1996. [3] A. C. Begen, C. Perkins, and J. Ott. On the use of RTP for monitoring and fault isolation in IPTV. IEEE Network, 24(2):14 19, March-April 2009. [4] Brix Network. Video quality measurement algorithms: Scaling IP video services for the real world, 2006. [5] Broadband Forum. TR 176. ADSL2Plus configuration guidelines for IPTV v3.0, September 2008. [6] G. Cermak, M. Pinson, and S. Wolf. The relationship among video quality, screen resolution, and bitrate. IEEE Transactions on Broadcasting, 57(2):258 262, June 2011. [7] G. W. Cermak. Consumer opinions about frequency of artifacts in digital video. IEEE Journal of Selected Topics in Signal Processing, 3(2):336 343, April 2009. [8] P. Coverdale, S. Moller, A. Raake, and A. Takahashi. Multimedia quality assessment standards in ITU-T SG12. IEEE Signal Processing Magazine, 28(6):91 97, November 2011. [9] J. M. Cubero, A. M. Sanz, E. Estalayo, P. Perez, F. Jaureguizar, J. Cabrera, and J. J. Ruiz. Gestión y aplicación de metadatos asociados al tráfico multimedia en videoconferencia 3D. In XX Jornadas Telecom I+D, September 2010. Valladolid, Spain. 137
138 BIBLIOGRAPHY [10] J. M. Cubero, J. Gutierrez, P. Perez, E. Estalayo, J. Cabrera, F. Jaureguizar, and N. Garcia. Providing 3D video services: The challenge from 2D to 3DTV quality of experience. Bell Labs Technical Journal, 16(4):115 134, March 2012. [11] N. Degrande, K. Laevens, D. Vleeschauwer, and R. Sharpe. Increasing the user perceived quality for IPTV services. IEEE Communications Magazine, 46(2):94 100, February 2008. [12] C. Diaz, J. Cabrera, F. Jaureguizar, and N. Garcia. A video-aware FEC-based unequal loss protection scheme for RTP video streaming. In IEEE Int. Conf. on Consumer Electronics, ICCE 2011, Jan 2011. Las Vegas (NV), United States. [13] R. Dosselmann and X. Yang. A comprehensive assessment of the structural similarity index. Signal, Image and Video Processing, 5:81 91, March 2011. [14] M. Ellis and C. Perkins. Packet loss characteristics of IPTV-like traffic on residential links. In IEEE Consumer Communications and Networking Conference, CCNC 2010, January 2010. Las Vegas (NV), United States. [15] U. Engelke and H. J. Zepernik. Perceptual-based quality metrics for image and video services: a survey. In Conf. on Next Generation Internet Networks, May 2007. Trondheim, Norway. [16] U. Engelke, T. M. Kusuma, and H.-J. Zepernick. Perceptual quality assessment of wireless video applications. In Int. Symposium on Turbo Codes & Related Topics, April 2006. Munich, Germany. [17] B. Erman and E. P. Matthews. Analysis and realization of IPTV service quality. Bell Labs Technical Journal, 12(4):195 212, February 2008. [18] ETSI TS 101 154 v1.10.1. Digital Video Broadcasting DVB; specification for the use of video and audio coding in broadcasting applications based on the MPEG-2 transport stream, 2011. [19] ETSI TS 102 034 v1.4.1. Digital Video Broadcasting DVB; transport of MPEG-2 based DVB services over IP based networks, 2009. [20] Y. Fan, J. Wang, T. Ikenaga, Y. Tsunoo, and S. Goto. An unequal secure encryption scheme for H.264/AVC video compression standard. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 91(1): 12 21, January 2008. [21] M. C. Q. Farias and S. K. Mitra. No-reference video quality metric based on artifact measurement. In IEEE Int. Conf. on Image Processing, ICIP 2005, September 2005. Genoa, Italy.
BIBLIOGRAPHY 139 [22] T. Friedman, R. Caceres, and A. Clark. RTP Control Protocol Extended Reports (RTCP XR). RFC 3611 (Proposed Standard), November 2003. [23] F. Gabin, M. Kampmann, T. Lohmar, and C. Priddle. 3GPP mobile multimedia streaming standards [standards in a nutshell]. IEEE Signal Processing Magazine, 27(6):134 138, November 2010. [24] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Subjective assessment of the impact of transmission errors in 3DTV compared to HDTV. In IEEE 3DTV Conference, May 2011. Antalya, Turkey. [25] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Subjective evaluation of transmission errors in IPTV and 3DTV. In IEEE Visual Communications and Image Processing, VCIP 2011, November 2011. Tainan, Taiwan. [26] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Monitoring packet loss impact in IPTV and 3DTV receivers. In IEEE Int. Conf. on Consumer Electronics, ICCE 2012, January 2012. Las Vegas (NV), United States. [27] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Subjective study of adaptive streaming strategies for 3DTV. In IEEE Int. Conf. on Image Processing, ICIP 2012, October 2012. Orlando (FL), United States. [28] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Validation of a novel approach to subjective quality evaluation of conventional and 3D broadcasted video services. In Int. Workshop on Quality of Multimedia Experience, QoMEX 2012, July 2012. Yarra Valley, Australia. [29] H. Ha, J. Park, S. Lee, and A. C. Bovik. Perceptually unequal packet loss protection by weighting saliency and error propagation. IEEE Transactions on Circuits and Systems for Video Technology, 20(9):1187 1199, September 2010. [30] R. Haimi-Coehen. Prioritized retransmission of internet protocol television (IPTV) packets, December 2008. US Patent Proposal US 2010/0138885 A1. [31] D. S. Hands. A basic multimedia quality model. IEEE Transactions on Multimedia, 6(6):808 816, December 2004. [32] S. Hawley and G. Schultz. IPTV Video Quality: QoS & QoE. Quarterly Technology and Content Report. Multimedia Research Group, Inc., February 2007. [33] S. S. Hemami and A. R. Reibman. No-reference image and video quality estimation: Applications and human-motivated design. Signal Processing: Image Communication, 25(7):469 481, August 2010.
140 BIBLIOGRAPHY [34] O. Hohlfeld, R. Geib, and G. Hasslinger. Packet loss in real-time services: Markovian models generating QoE impairments. In IEEE Int. Workshop on Quality of Service, IEEE IWQoS 2008, June 2008. Enschede, the Netherlands. [35] Q. Huynh-Thu, M.-N. Garcia, F. Speranza, P. Corriveau, and A. Raake. Study of rating scales for subjective quality assessment of high-definition video. IEEE Transactions on Broadcasting, 57(1):1 14, March 2011. [36] ISO/IEC 13818-1:2007. Information technology generic coding of moving pictures and associated audio information: Systems, 2007. [37] ISO/IEC 13818-2:2000. Information technology generic coding of moving pictures and associated audio information: Video, 2000. [38] ISO/IEC 14496-10:2012. Information technology coding of audio-visual objects Part 10: Advanced Video Coding, 2012. [39] ISO/IEC 23009-1:2012. Information technology dynamic adaptive streaming over HTTP (DASH) Part 1: Media presentation description and segment formats, 2012. [40] O. Issa, W. Li, H. Liu, F. Speranza, and R. Renaud. Quality assessment of high definition TV distribution over IP networks. In IEEE Int. Symposium on Broadband Multimedia Systems and Broadcasting, BMSB 2009, May 2009. Bilbao, Spain. [41] ITU-R Tech. Rec. BS.1387. Method for objective measurements of perceived audio quality, 2001. [42] ITU-R Tech. Rec. BT 500.11. Methodology for the subjective assessment of the quality of television pictures, 2002. [43] ITU-T Tech. Rec. G.1080. Quality of experience requirements for IPTV services, 2008. [44] ITU-T Tech. Rec. G.1081. Performance monitoring points for IPTV, 2008. [45] ITU-T Tech. Rec. J.144. Objective perceptual video quality measurement techniques for digital cable television in the presence of a full reference, 2004. [46] ITU-T Tech. Rec. J.147. Objective picture quality measurement method by use of in-service test signals, 2002. [47] ITU-T Tech. Rec. J.247. Objective perceptual multimedia video quality measurement in the presence of a full reference, 2008.
BIBLIOGRAPHY 141 [48] ITU-T Tech. Rec. J.249. Perceptual video quality measurement techniques for digital cable television in the presence of a reduced reference, 2010. [49] ITU-T Tech. Rec. J.341. Objective perceptual multimedia video quality measurement of HDTV for digital cable television in the presence of a full reference, 2011. [50] ITU-T Tech. Rec. J.342. Objective multimedia video quality measurement of HDTV for digital cable television in the presence of a reduced reference signal, 2011. [51] ITU-T Tech. Rec. J.863. Perceptual objective listening quality assessment, 2011. [52] ITU-T Tech. Rec. P.862. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, 2001. [53] ITU-T Tech. Rec. P.910. Subjective video quality assessment methods for multimedia applications, 2008. [54] ITU-T Tech. Rec. P.911. Subjective audiovisual quality assessment methods for multimedia applications, 1998. [55] ITU-T Tech. Rec. Y.1910. IPTV functional architecture, 2008. [56] S. H. Jumisko, V. P. Ilvonen, and K. A. Vaananen-Vainio-Mattila. Effect of TV content in subjective assessment of video quality on mobile devices. In Proc. SPIE, Multimedia on Mobile Devices, volume 5684, pages 243 254, March 2005. [57] S. Jumisko-Pyykko and J. Korhonen. Unacceptability of instantaneous errors in mobile television: from annoying audio to video. In 8th Conf. on Human-computer interaction with mobile devices and services, September 2006. Espoo, Finland. [58] S. Kanumuri, S. Subramanian, P. Cosman, and A. Reibman. Predicting H.264 packet loss visibility using a generalized linear model. In IEEE Int. Conf. on Image Processing, ICIP 2006, September 2006. Atlanta (GA), United States. [59] M. Knee. The picture appraisal rating (PAR) - a single-ended picture quality measure for MPEG-2. In Int. Broadcasting Convention, September 2000. Amsterdam, the Netherlands. [60] R. Kooij, K. Ahmed, and. Brunnström. Perceived quality of channel zapping. In IASTED Int. Conf. Commun. Sys. and Networks, August 2006. Palma de Mallorca, Spain.
142 BIBLIOGRAPHY [61] K. Kunert, E. Uhlemann, and M. Jonsson. Enhancing reliability in IEEE 802.11 based real-time networks through transport layer retransmissions. In Int. Symposium on Industrial Embedded Systems, July 2010. Trento, Italy. [62] Y. Kuszpet, D. Kletsel, Y. Moshe, and A. Levy. Post-processing for flicker reduction in H.264/AVC. In Picture Coding Symposium, PCS 2007, November 2007. Lisbon, Portugal. [63] P. Le Callet, C. Viard-Gaudin, and D. Barba. A convolutional neural network approach for objective video quality assessment. IEEE Transactions on Neural Networks, 17(5):1316 1327, September 2006. [64] A. Leontaris and A. R. Reibman. Comparison of blocking and blurring metrics for video compression. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2005, March 2005. Philadelphia (PA), United States. [65] Y. Liang, J. Apostolopoulos, and B. Girod. Analysis of packet loss for compressed video: does burst-length matter? In IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2003, April 2003. Hong Kong, China. [66] T. L. Lin, S. Kanumuri, Y. Zhi, D. Poole, P. C. Cosman, and A. R. Reibman. A versatile model for packet loss visibility and its application to packet prioritization. IEEE Transactions on Image Processing, 19(3):722 35, March 2010. [67] A. A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, Y. Zhang, and Q. Zhao. Towards automated performance diagnosis in a large IPTV network. ACM SIGCOMM Computer Communication Review, 39:231 242, August 2009. [68] P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi. A no-reference perceptual blur metric. In IEEE Int. Conf. on Image Processing, ICIP 2002, September 2002. Rochester (NY), United States. [69] A. Massoudi, F. Lefebvre, C. De Vleeschouwer, B. Macq, and J. Quisquater. Overview on selective encryption of image and video: challenges and perspectives. EURASIP Journal on Information Security, 2008, December 2008. [70] R. Mekuria, P. Cesar, and D. Bulterman. Digital TV: the effect of delay when watching football. In 10th European Conf. on Interactive TV and video, July 2012. Berlin, Germany. [71] V. Miguel, J. Cabrera, F. Jaureguizar, and N. Garcia. High-definition video distribution in 802.11g home wireless networks. In IEEE Int. Conf. on Consumer Electronics, ICCE 2011, pages 213 214, Las Vegas (NV), United States, January 2011.
BIBLIOGRAPHY 143 [72] M.-J. Montpetit, T. Mirlacher, and M. Ketcham. IPTV: An end to end perspective (invited paper). Journal of Communications, 5(5):358 373, August 2010. [73] BBC News. Olympics bring 55 million visits to BBC Sport online, August 2012. http://www.bbc.com/news/technology-19242083. [74] T. Oelbaum, C. Keimel, and K. Diepold. Rule-based no-reference video quality evaluation using additionally coded videos. IEEE Journal of Selected Topics in Signal Processing, 3(2):294 303, April 2009. [75] Open IPTV Forum. Release 2 specification volume 2a HTTP adaptive streaming v2.1, 2011. [76] Open IPTV Forum. Release 2 specification volume 2 media formats v2.1, 2011. [77] J. Ott, S. Wenger, N. Sato, C. Burmeister, and J. Rey. Extended RTP Profile for Real-time Transport Control Protocol (RTCP)-Based Feedback (RTP/AVPF). RFC 4585 (Proposed Standard), July 2006. [78] T. N. Pappas and R. J. Safranek. Perceptual criteria for image quality evaluation. In Handbook of Image and Video Processing, pages 669 684. Academic Press, 2000. [79] R. R. Pastrana-Vidal and C. Colomes. Perceived quality of an audio signal impaired by sigal loss: psychoacoustic tests and prediction model. In IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP 2007, April 2007. Honolulu (HI), United States. [80] W. Pattara-Atikom, S. Banerjee, and P. Krishnamurthy. Predicting the quality of video transmission over best effort network service. In Int. Conf. on Computer Communications and Networks, ICCCN 2003, October 2003. [81] P. Perez. Calidad de experiencia en IPTV. Master s thesis, Universidad Politecnica de Madrid, September 2008. Trabajo de Investigación en Tecnologias y Sistemas de Comunicaciones. [82] P. Perez and N. Garcia. Lightweight multimedia packet prioritization model for unequal error protection. IEEE Transactions on Consumer Electronics, 57(1): 132 138, February 2011. [83] P. Perez and J. J. Ruiz. Encryption procedure and device for an audiovisual data stream, April 2011. European Patent Application EP 2,309,745 (Published). [84] P. Perez, J. J. Ruiz, and N. Garcia. Calidad de experiencia en servicios multimedia sobre IP. In XX Jornadas Telecom I+D, September 2010. Valladolid, Spain.
144 BIBLIOGRAPHY [85] P. Perez, J. Gutierrez, J. J. Ruiz, and N. Garcia. Qualitative monitoring of video over a packet network. In IEEE Int. Symposium on Multimedia, December 2011. Dana Point (CA), United States. [86] P. Perez, J. Macias, J. J. Ruiz, and N. Garcia. Effect of packet loss in video quality of experience. Bell Labs Technical Journal, 16(1):91 104, June 2011. [87] P. Perez, J. J. Ruiz, A. Villegas, K. V. Damme, C. V. Boven, J. Dupont, and P. A. Molina-Salmeron. Multi-vendor video headend convergence solution. Bell Labs Technical Journal, 17(1):185 200, June 2012. [88] P. Perez, A. Villegas, and J. J. Ruiz. Method, system and devices for improved adaptive streaming of media content, January 2012. European Patent Application No. 12382006.0 (Filed). [89] M. H. Pinson and S. Wolf. A new standardized method for objectively measuring video quality. IEEE Transactions on Broadcasting, 50(3):312 322, September 2004. [90] M. H. Pinson, W. Ingram, and A. Webster. Audiovisual quality components. Signal Processing Magazine, IEEE, 28(6):60 67, November 2011. [91] F. Porikli, A. Bovik, C. Plack, G. AlRegib, J. Farrell, P. Le Callet, Q. Huynh-Thu, S. Moller, and S. Winkler. Multimedia quality assessment [DSP Forum]. IEEE Signal Processing Magazine, 28(6):164 177, November 2011. [92] A. Raake, J. Gustafsson, S. Argyropoulos, M. Garcia, D. Lindegren, G. Heikkila, M. Pettersson, P. List, and B. Feiten. IP-based mobile and fixed network audiovisual media services. IEEE Signal Processing Magazine, 28(6):68 79, November 2011. [93] A. R. Reibman and D. Poole. Characterizing packet-loss impairments in compressed video. In IEEE Int. Conf. on Image Processing, ICIP 2007, September 2007. San Antonio (TX), United States. [94] A. R. Reibman and A. R. Wilkins. Video outage detection: Algorithm and evaluation. In Picture Coding Symposium, PCS 2009, May 2009. Chigaco (IL), United States. [95] A. R. Reibman, V. A. Vaishampayan, and Y. Sermadevi. Quality monitoring of video over a packet network. IEEE Transactions on Multimedia, 6(2):327 334, April 2004.
BIBLIOGRAPHY 145 [96] D. C. Robinson and A. Villegas. Intelligent wrapping of video content to lighten downstream processing of video streams, June 2009. European Patent Application 2,071,850 (Published). [97] S. H. Russ and S. Haghani. 802.11g packet-loss behavior at high sustained bit rates in the home. IEEE Transactions on Consumer Electronics, 55(2):788 791, May 2009. [98] S. Saha and R. Vemuri. An analysis on the effect of image features on lossy coding performance. IEEE Signal Processing Letters, 7(5):104 107, May 2000. [99] W. B. P. Schallauer. Studies in Computational Intelligence: Multimedia Semantics The Role of Metadata, chapter Metadata in the Audiovisual Media Production Process, pages 65 84. Springer Berlin / Heidelberg, 2008. [100] T. Shi, B. King, and P. Salama. Selective encryption for H.264/AVC video coding. In Proc. SPIE, Electronic Imaging, volume 6072, page 607217, 2006. [101] D. Singer and H. Desineni. A General Mechanism for RTP Header Extensions. RFC 5285 (Proposed Standard), july 2008. [102] C. W. Snyder, U. K. Sarkar, and D. Sarkar. Effects of cell loss on MPEG video: analytical modeling and empirical validation. In IEEE Int. Conf. on Multimedia and Expo, ICME 2002, volume 2, pages 457 460. IEEE, 2002. [103] B. V. Steeg, A. Begen, T. V. Caenegem, and Z. Vax. Unicast-Based Rapid Acquisition of Multicast RTP Sessions. RFC 6285 (Proposed Standard), June 2011. [104] T. Stockhammer. Dynamic adaptive streaming over HTTP: standards and design principles. In ACM Conf. on Multimedia Systems, February 2011. San Jose (CA), United States. [105] S. Süsstrunk and S. Winkler. Color image quality on the internet. In Proc. SPIE, IS&T Internet Imag, pages 118 131, January 2004. [106] M. Tagliasacchi, G. Valenzise, M. Naccari, and S. Tubaro. A reduced-reference structural similarity approximation for videos corrupted by channel errors. Multimedia Tools and Applications, 48:471 492, 2010. [107] M. Verhoeyen, D. De Vleeschauwer, and D. Robinson. Content storage architectures for boosted IPTV service. Bell Labs Technical Journal, 13(3):29 43, September 2008. [108] A. Villegas, K. Chow, C. V. Boven, and P. Perez. Content delivery method, June 2011. European Patent Application EP 2,538,629 (Published).
146 BIBLIOGRAPHY [109] A. Villegas, P. Perez, J. M. Cubero, E. Estalayo, and N. Garcia. Network assisted content protection architectures for a connected world. Bell Labs Technical Journal, 16(4):85 96, March 2012. [110] T. Vlachos. Detection of blocking artifacts in compressed video. Electronic Letters, 36(13):1106 1108, June 2000. [111] VQEG. Final report from the video quality experts group on the validation of objective models of video quality assessment, Phase II, 2003. [112] VQEG. Validation of reduced-reference and no-reference objective models for standard definition television, Phase I. Technical report, 2009. [113] VQEG. Monitoring of audiovisual quality by key indicators, 2012. Draft available online at http://www.its.bldrdoc.gov/vqeg/. [114] Z. Wang and E. P. Simoncelli. Reduced-reference image quality assessment using a wavelet-domain natural image statistic model. In Proc. SPIE, Human Vision and Electronic Imaging, volume 5666, pages 149 159, 2005. [115] Z. Wang, A. C. Bovik, and B. Evan. Blind measurement of blocking artifacts in images. In IEEE Int. Conf. on Image Processing, ICIP 2000, September 2000. Vancouver, Canada. [116] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600 612, April 2004. [117] A. A. Webster, C. T. Jones, M. H. Pinson, S. D. Voran, and S. Wolf. An objective video quality assessment system based on human perception. In Proc. SPIE, Human Vision, Visual Processing, and Digital Display IV, pages 15 26, 1993. [118] J. Welch and J. Clark. A Proposed Media Delivery Index (MDI). RFC 4445 (Informational), April 2006. [119] S. Winkler. Video quality measurement standards - current status and trends. In Int. Conf. on Information, Communications and Signal Processing, ICICS 2009, December 2009. Macau, China. [120] S. Winkler. Digital Video Quality Vision Models and Metrics. John Wiley & Sons, January 2005. [121] S. Winkler and P. Mohandas. The evolution of video quality measurement: From PSNR to hybrid metrics. IEEE Transactions on Broadcasting, 54(3):660 668, September 2008.
BIBLIOGRAPHY 147 [122] H. R. Wu and M. Yuen. A generalized block-edge impariment metric for video coding. IEEE Signal Processing Letters, 4(11):317 320, November 1997. [123] F. Yang, S. Wan, Y. Chang, and H. R. Wu. A novel objective no-reference metric for digital video quality assessment. IEEE Signal Processing Letters, 12(10):685 688, October 2005. [124] F. You, W. Zhang, and J. Xiao. Packet loss pattern and parametric video quality model for IPTV. In IEEE/ACIS Int. Conf. on Computer and Information Science, June 2009. Shanghai, China. [125] Y. Zou, T. Huang, W. Gao, and L. Huo. H.264 video encryption scheme adaptive to DRM. IEEE Transactions on Consumer Electronics, 52(4):1289 1297, November 2006.