MPEG-4 Technology Strategy Analysis

Transcription

1 T MPEG-4 Technology Strategy Analysis Mihai Burlacu, Sonja Kangas T Research Seminar on Telecommunications Business II Telecommunications Software and Multimedia Laboratory Helsinki University of Technology Last modified on 31 maaliskuu,

2 T Abstract Low-bit rate multimedia streaming over IP networks is the next natural demand from the market after the MPEG-2 success with DVD and satellite digital broadcasting. Transfer over the Internet support for On-Demand systems made MPEG-4 a very successful tool; it has also very promising future forecasts. In this paper we shall present the technical description as well as the current business status of MPEG-4. Initially, general aspects concerning MPEG-4 are presented. More detailed technical explanations for the actual implementations are provided. A brief general comparison with the other members in the MPEG family is discussed. Then the technical specific requirements are studied. The new concepts characteristic to MPEG-4 system are presented in more detail. The paper assumes the reader is rather familiar with multimedia specific concepts and mode of operation. Light market analysis for MPEG-4 standard will reveal how the market situation reacts and adopts this new standard. The business area is going to be studied from the market presence point of view, as well the future trend and opportunities that the standard will bring to the consumers. An analysis of the solutions developed by leader companies in the segment will be presented. VTT Information Technology Last modified on

3 T Contents Abstract... 1 Contents... 2 List of central acronyms Introduction MPEG-4 Specification Organizations MPEG Standard Family Other video encoding standards: H.261, H.263 and H MPEG MPEG-4 Structure Transmission/Storage Medium Delivery Layer Sync Layer Multimedia Layer Audio - video coding Video codecs Key MPEG-4 extensions Business strategies Business model for video delivery to wireless terminals Business applications examples Envivio case model DivX networks case Open Video System Results and discussion Summary References VTT Information Technology Last modified on

4 T List of central acronyms 3GPP - 3rd Generation Partnership Project ( AFX Animation Framework extension BIFS - Binary Format for Scenes (a description and interactive control language) DMIF - Delivery Multimedia Integration Framework IETF - Internet Engineering Task Force ( ISMA - Internet Streaming Media Alliance ISO/IEC - International Organization for Standardization/ International Electrotechnical Commission ITU - International Telecommunications Union ( MPEG - Moving Picture Experts Group ( MPEG AVC - MPEG-4 Advanced Video Coding (H.264) M4IF - Non-profit MPEG-4 Industry Forum organization VRML - Virtual Reality Modeling Language WMF - Wireless Multimedia Forum ( VTT Information Technology Last modified on

5 T Introduction Rich multimedia including visual communication and video capabilities have been said to be one of the key development areas in the digital mobile media. The anywhere-anytime 1 idea supports the advancement of mobile multimedia delivery. In the future the wireless bandwidth will be valuable. Delivery of the best audio and image quality at the lowest bandwidth will be a key part of development of services around digital media. Besides the usage in mobile area, enhanced video compression technologies will also play a central role in the development of digital television, real time interactive applications, synthetic content creation and animated chats just to name a few areas of implementation. MPEG-4 also plays a central role in the application development platform; it is expected to be a core tool for the applications is the 3G-handsets and other video-capable wireless devices. MPEG-4 is not just another digital standard that improves earlier standards by bringing more performant elements into its architecture. Compared to previous digital multimedia standards (like MPEG-1 or MPEG-2), it brings into the picture new concepts. One major step forward is the possibility to incorporate object- oriented techniques. Obviously, backward compatibility must be preserved in order for the system to be usable also when dealing with old content versions. MPEG-2 and MPEG-1 were targeted mainly on natural content transmission (especially digitally television, video disks, DVDs etc). MPEG-1 was designed to deal mainly with local storage media (like video disks, etc). MPEG-2 extended the capabilities of MPEG-1 to handle also better handling in error prone transmission media. The need for object oriented handling in multimedia streams became more and more needed. Dealing with synthetic content was a natural step to be supported in MPEG-4. As we shall see later, the handling of synthetic content was "inspired" by Virtual Reality Modeling Language (VRML). MPEG-4 standard is build on the needs of authors, service providers and users to get higher flexibility and to take more advantage of the fast deploying internet. For end users MPEG-4 brings higher level of interaction with content. It also brings multimedia to new networks and devices, including employing relatively low bitrate 1 The idea of anywhere implies that the technology for these products must be portable, and operate under low power, mobile conditions: on a train, in the airport, in your car. Anytime means that the consumer will decide when to use this technology, when to communicate, or when to access information, according to his or her own schedule. Burlacu & Kangas 4

6 T and mobile ones. For authors MPEG-4 allows easy manipulation of data. For instance many feeds of MPEG-4 can be combined together and edited on the fly. MPEG-4 allows content providers to encode (compress) once and deliver everywhere. A single stream can be delivered via cable, satellite and wireless, and can be provided over multiple bit rates. MPEG-4 enables the production of reusable flexible content. The issue of IPR 2 has been more carefully taken under consideration. For network service providers it offers transparent information, which can be interpreted and translated into the appropriate native signaling messages of each network with the help of relevant standard bodies. MPEG-4 uses improved concepts for visual and audio to handle both natural and synthetic nature of the objects: media objects. Description for the composition of media objects has many new concepts with respect to the old versions of MPEG. Multiplexing 3 and synchronization for the data associated with media objects is available for transport over network channels providing QoS appropriate for the nature of the specific media objects. Interaction with the av-scene generated at the receiver's end is possible. MPEG-4 has already been accepted well because it allows relatively easy manipulation of data. For instance many feeds of MPEG-4 can be combined together and edited on the fly. MPEG-4 allows content providers to encode (compress) once and deliver everywhere. A single stream can be delivered via cable, satellite and wireless, and can be provided over multiple bit rates. In short, MPEG-4 is a set of specifications for: 1) Represent aural, visual and audiovisual content called media objects. These objects can be natural or synthetic. 2) Deliver these independent layered media objects over heterogeneous networks by using streaming protocols (composition of media objects). 3) Render and present scenes dynamically (compared to the web). 4) Make it possible for the receiver to interact with the audiovisual scene generated at the decoder phase. 2 IPR Intellectual property rights can be defined as the rights given to people over the creations of their minds. They usually give the creator an exclusive right over the use of his/her creations for a certain period of time. 3 Multiplexing or muxing, when speaking of video and video editing, means basically a process where separate parts of the video (or 'streams' as they're called in video terminology) are joined together into one file. Burlacu & Kangas 5

7 T This report will contain the general picture of the MPEG-4 standard and present some of its qualities. Because MPEG-4 is the next level of development from MPEG-1 and MPEG-2, we shall compare MPEG-4 to them, but also to other video compression standards like H.261 and H.263. At the latter half of this paper we will concentrate on the value web and current commercial applications of MPEG-4. Our focus will also be on the wireless applications as they are one of the imminent areas of development. At the end of this report we will picture the future development of technical standards and business solution. Burlacu & Kangas 6

8 T MPEG-4 Specification Organizations MPEG-4 is targeted for wide range of applications, involving the cooperation with a multitude of parties. The development of MPEG-4 standard was not done separately from the elements it was targeted for. So, the involvement of other standardization parties was needed for interoperability success of MPEG-4. Also several companies have been active with the development of the standard. This report is based on reports of several MPEG-4 organizations. The central one is MPEG (Moving Picture Experts Group). It is the name of a family of standards used for coding audiovisual information (e.g. video, audio, movies) in a digital form. There are also several other active parties, e.g. 3GPP (The 3rd Generation Partnership Project), a global cooperation between six organizational partners (ARIB, CWTS, ETSI, T1, TTA and TTC); the world's major standardization bodies from Asia, Europe and USA. 3GPP produce technical specifications and reports for 3rd Generation Mobile System based on evolved GSM core networks and the radio access technologies that they support. The Internet Streaming Media Alliance (ISMA) is a consortium of companies developing specifications and products that use parts of MPEG-4 as well as nonstandard extensions and Internet Engineering Task Force (IEFT) that is an international community of network designers, operators, vendors and researchers. They develop audio-video payloads and the Real Time Protocol just to name a few of their activities. M4IF s (MPEG-4 Industry Forum) focus is to adopt the MPEG-4 standard by establishing MPEG-4 as an accepted and widely used standard among application developers, service providers, content creators and users. Whereas Wireless Multimedia Forum (WMF) aims at establishing technology consensus around a set of protocols that are suitable for use in streaming multimedia over a wireless network. Burlacu & Kangas 7

9 T MPEG Standard Family Moving Picture Experts Group (MPEG) was created to develop video packing standards. At the moment there are three MPEG-standards available and two on the process: MPEG-1 (ISO-11172), MPEG-2 (ISO-13818) and MPEG-4 (ISO-14496). Also multimedia document description interface MPEG-7 (ISO-15938) and frame of reference standard MPEG-21 (ISO-18034) is under development. MPEG-1 was released in It is a standard for storable multimedia, mostly for CD-ROM systems. The idea was to get VHS level image and sound at the speed of CD-ROM. MPEG-1 was designed for compressing video, but lately it has received a lot of attention of its good sound compressing abilities. MPEG-1, audio layer 3 (MP3) is one example of this development. Currently it is one of the most widely spread sound compressing formats. The typical transfer speed of MPEG-1 is VHS quality video (1,5 Mb/s). The MPEG-1 standardized following modules: 1. Sound and picture synchronization to one data stream by time stamps 2. Video data coding 3. Sound data coding 4. Testing methods by the standard 5. Technical report about the implementation of the first three parts of the standard The next step was the release of MPEG-2 in It was originally designed for digital television due the need for better image and sound quality. Because of the increased demand for image quality, MPEG-2 was designed to handle larger transfer speeds. The MPEG-2 standard consists of: 1. System (basic element of MPEG) 2. Video (-"-) 3. Audio (-"-) 4. Compliance testing 6. Software simulation 7. Protocols handling the controlling of the bit streams Burlacu & Kangas 8

10 T Specification for advanced multiplex channel sound coding. (Improved version of MPEG-1 sound packing format.) 9. Support for video coding over 8 bit sampling accuracy (the development has stopped because of the lack of need for this). 10. Definition for the real time interface for decoders transferring bit stream over the net. 11. Real time interface that defines the real time interface for decoders transferring bit stream over the net. MPEG-2 standard is used for several purposes. Lately the evolution of Finnish digital television has raised new needs for video encoding and MPEG-2 has been widely used in R&D cases. For the wide variety of usages, specific profiles and levels have been created. Profiles are basically are group of tools that exist at a number of levels. The profiles define how and what tools should be used for creating the bit stream, the levels define what values every parameter can have in a particular profile. There are five profiles: 1) Simple 2) Main 3) SRN scalable 4 4) Spatial scalable 5) High. The basic structure of MPEG-2 is very similar to MPEG-1. Many features have just been improved. The most important differences are the higher resolution and improvement of transfer speeds. After MPEG-2 also so called HDTV standard MPEG-3 was developed, but then dropped and included in the MPEG-2 standard. MPEG-4 was released in 1998 and another version in 2000 (and 2002 MPEG-4 AVC). In MPEG-4 the audiovisual presentation consist of separate and independent media objects that can be sound, video, still image, 3D objects and so on. 4 SRN- Signal-to-Noise-Ratio. The ratio of the amplitude of the desired signal to the amplitude of noise signals at a given point in time. Burlacu & Kangas 9

11 T Figure 1: MPEG-4 format can be divided into several profiles where MPEG-4 device or application can support only the subset it needs (Koenen 98). MPEG-4 is notably enhanced compared to MPEG-1 and MPEG-2. The idea of compression has been shifted into object-oriented compression of image and sound. The objects in the image (cat, dog, fish) are identified and handled separately. Audiovisual objects (often named as media objects in literature) types can be: natural video, sound, compressed speech, synthesized speech, polyphonic music (containing MIDI a.so.), 2-D, 3-D objects, animated 3-D faces which can be synchronized with the synthesized speech (compare to Koenen 2002). Burlacu & Kangas 10

12 T Other video encoding standards: H.261, H.263 and H.264 Standards H.261 and H.263 are primarily used for teleconferencing purposes. Although MPEG is used for mainly motion pictures, its new enhancements allow its usage in videoconferencing applications, so a brief description of them is in place in this report. Picture below is from the March Networks technology report (see March Networks 2001). Figure 2: The evolution of video compression (March 2001) H.261 is video coding standard published by the International Telecommunication Union 5 in It was designed for data rates multiples of 64Kbit/s and is sometimes called p x 64Kbit/s (p is in the range 1-30). These data rates suit ISDN lines, for which this video codec was designed. The H.261 is meant for low quality links with a 5 International Telecommunication Union (ITU) has several tasks. One important one is to cooperate with regional intergovernmental organizations and those non-governmental organizations concerned with telecommunications. ITU acts actively on standardization. They have three primarily sectors: radio communications, standardization and development (trends, industry work) sectors. Burlacu & Kangas 11

13 T reasonably high error rate (while MPEG is meant for low bit error rate networks). In H.261 the coding algorithm is a hybrid of inter-picture prediction, transform coding, and motion compensation. The data rate of the coding algorithm was designed to be able to be set to between 40 Kbits/s and 2 Mbits/s. The inter-picture prediction removes temporal redundancy. The transform coding removes the spatial redundancy. Motion vectors are used to help the codec compensate for motion. To remove any further redundancy in the transmitted bit stream, variable length coding is used. H.261 supports two resolutions, QCIF 6 (Quarter Common Interchange format) and CIF 7 (Common Interchange format) (AXIS 2002). H.263 was designed for low bit rate communication. Early drafts specified data rates less than 64 Kbits/s, however this limitation has now been removed. It is expected that the standard will be used for a wide range of bit rates, not just low bit rate applications. It is expected that H.263 will replace H.261 in many applications. The coding algorithm of H.263 is similar to H.261. Still it contains improvements and changes for performance and error recovery improvement. Half pixel precision is used for motion compensation (in H.261 full pixel precision was used combined with a loop filter. There are four optional negotiable options included to improve performance: Unrestricted Motion Vectors, Syntax-based arithmetic coding, Advance prediction, and forward and backward frame prediction similar to MPEG called P-B frames. H.263 supports five resolutions: QCIF, CIF, SQCIF, 4CIF, and 16CIF. SQCIF is approximately half the resolution of QCIF. 4CIF and 16CIF are 4 and 16 times the resolution of CIF respectively. (ITU 2003) The support of 4CIF and 16CIF means the codec could then compete with other higher bit rate video coding standards such as the MPEG standards. The next generation H.264 (MPEG-4 AVC or part 10) standard was released at the end of year It has been developed to improve the quality of video pictures over wireless networks to DVD sharpness and bandwidth for digital video services. This is good news for parties developing 3G wireless applications. The standard defines the syntax for encoding video bit stream together with the method of decoding this bit stream. It does not define a codec, but coding 6 Quarter Common Intermediate Format is an old video resolution name. The 1/4 of CIF video resolution standard sizes: 176x144 (PAL) and 176x120 (NTSC). 7 CIF = Common Intermediate Format. This acronym's name comes from video conferencing tools in late 1980's and early 1990's. Nowadays the term CIF is used to mean specific video resolution: 352x288 in PAL and 352x240 in NTSC. CIF is 1/4th of "full resolution" TV, also called as D1 and is best known because Video CD standard uses this resolution. Burlacu & Kangas 12

14 T functionalities like prediction, transformation, quantization and entropy encoding. There are little but notable differences to previous standards. The most important changes are in the details of each functional element (JVT 2002). For the comparison table below, we have just used the information on earlier standard of MPEG-4. Summing up comparison of MPEG standards, H.261 and H.263: Standard MPEG-1 MPEG-2 MPEG-4 H.261 H.263 Greatest resolution 360 x x x x x 144 (QCIF) (QCIF) 352 x 288 (CIF) 352 x 288 (CIF) Compressed bit 1.5 Mbits/s 2-15 Mbits/s 8-64 Kbits/s 500 Kbits/s 56 Kbits/s rate (7.5 Kb/s) Image quality VHS Visually lossless Visually lossless Hi-quality video Videophone, desktop wireline video conferencing conference Audio quality Stereo CD surround sound speech, stereo CD, Speech surround sound speech Purpose of use CD-Rom Digital tv, HDTV, Video Video Videophone video recording, conferencing, conferencing DVDm broadcast interactive applications, mobile multimedia Compression DCT 8, DCT, Selected by the Fixed data rate Flexible data methods compensation compensation of objects: estimating rate of movement, movement, the movement different length coding different length (MVQ) coding 8 In Discreet Cosine Transformation (DCT) method, the sharp edges of reconstructed images are clearly degraded. Burlacu & Kangas 13

15 T MPEG-4 First, it is worth mentioning that MPEG-4 deals with "media objects", that are a generalization for the visual and audio content. These media objects are used together to form the audiovisual scenes. The main parts of MPEG-4 are systems, visual, audio and DMIF. The basis is formed by systems (presentation, demux and buffer), audio and visual (decoding). DMIF (Delivery Multimedia Integration Framework) is transport interface between application and network (storage). As stated before, the previous versions of MPEG were dealing mainly with natural content packing and transmission. As synthetic content are intended to be efficiently supported also, a general framework for dealing generally with all types of content needed to be defined. Detailed explanations on coding for media objects and their transmission possibilities are going to be presented in the layered manner. Based on the ISO layer the specific format and logic of MPEG-4 will be presented. So, practically, natural content is to be treated as a particular instance of a media object. In such a way, dealing with general media objects allows the layers separation under the new generalization. Thus, lower layers are meant for transport purposes. Transport part does not need to be aware what kind of media format it is transmitting. But, of course transport transparency requirement will also imply the need of adaptation layers to map the new concepts to the traditional transport platforms. Briefly, coding/decoding of the media content must not be related to any transmission medium. In such a way portability and flexibility is ensured. Traditionally, transmission over IP networks must be supported. Also the possibility to use the MPEG-2 Transport stream is desirable (note MPEG-2 program stream support is not recommended). File storage format must be naturally supported. ATM transport is mentioned here because still high quality achieved by high bit rates can be used. MPEG-4 is targeted to deliver content mainly remotely over the network. Obviously, also MPEG-2 content was designed to be transmitted over the network, but the possible choices for networks were rather restricted. Performant networks, high data rates requirements, etc. made the transport necessities very demanding. MPEG-4 tries to loose the restrictions, allowing the possibility to transport streams on PSTN networks, "normal" speed Internet IP connections. In this case the support for lower data rates must be possible at decent quality. Unfortunately, lower data rates mean high compression for natural content, and so, multimedia content quality is traded off. In this moment when lower data rates are required, synthetic content may "help". Burlacu & Kangas 14

16 T Since the majority of targeted users have limited possibilities to use high data rates, low data rates must be focused. Natural benefits like the possibility for the receiver to interact with audiovisual scene and the possibility of composing and rendering the objects like in "computer graphics" to form audiovisual scenes are just few advantages that make MPEG-4 very powerful tool. Taking into consideration that the standard imposes the semantic and syntax, but not restrict the actual implementation of how the data is generated, provides great adaptability on a wide range of platforms, starting with mobile terminals and ending with high power supercomputers. The core areas in MPEG-4 standard are: video, audio, file format, transport protocols and digital rights management. Video decoding has been defined for rendering and playback. MPEG-4 has specified four different versions for video compression, defining capabilities grouped into 19 distinct profile groupings with various level degrees in each. Also audio decoding is defined for rendering and playback. MPEG-4 defines several audio codecs, including MPEG-4 AAC, multiple speech codecs, and synthetic audio. The file container is defined as well as how information is organized at the byte level in stored files. The MPEG-4 (MP4) file container is based on but not compatible with Apple QuickTime. On profiles and levels, sets of capabilities and data representation format, video resolution, and content delivery data rates are defined among other things. MPEG-4 defines how content is delivered over networks (transport protocol). Also the increasing concern on digital rights has been taken into notice in MPEG-4 standard. The MPEG-4 initiative does not have DRM defined today but it does have hooks to proprietary DRM systems. This development shall with no doubt continue. The MPEG-4 specification also defines other areas such as object-based video coding, hybrid coding of synthetic and natural content, face animation parameters, synthetic audio, Binary Format for Scenes (BIFS). The standard is very extensive and our intention in the report is just to present the central features. Burlacu & Kangas 15

17 T Some advantages (compared to previous standards): and disadvantages (compared to previous standards): - independent image and video packing - more efficient video, image and texture packing - support for very large images and textures - better failure correction system - small buffer lag - global motion compensation (GMC) - content independent texture scaling - flexibility of the standard - allow object oriented compression methods very deep (detailed) standard - takes a lot of computing power at the presentation stage (e.g. 3D graphics, VRML, speech synthesis takes a lot of resources) - it is difficult to guarantee that the same data flow looks the same on different terminals it is presented - differences in 3D support depending on the 3D program used - differences in MIDI support - problems with patented techniques (compared to GIF) static nature of the standard 4.1 MPEG-4 Structure For structural presentation of MPEG-4, ISO model is deployed. Detailed description of the protocols will be presented. Also the interface between the protocols will be deeper analyzed Transmission/Storage Medium Transmission/Storage Medium layer is very isolated from the actual logic behind the MPEG-4. It specifies the physical layer, digital storage requirements a.so. But, still at this level the data is rather regarded as raw data. Normally it is the tasks of the upper network layer (like UDP/IP, ATM, MPEG2 Transport Stream) to handle the actual physical network properties. Burlacu & Kangas 16

18 T Delivery Layer Media objects are transported in a streaming manner. Multiple elementary streams are used to convey the media object. These streams can contain a number of different information: audiovisual object data, scene description information, control information in the form of object descriptors, as well as meta-information that describes the content or associates intellectual property rights to it. The elementary streams themselves are created in the upper layers, and at this layer their meaning is not so important. The task of the delivery layer is to handle and relay these elementary streams. Figure 3: Layered description of MPEG-4 standard The sync layer will pass the elementary streams to delivery layer through the DMIFapplication interface (DAI). It permits isolation of the Sync layer from the actual transport layer. Sync Layer does not need to be aware whether the peer is a remote interactive peer, broadcast or local storage media. Burlacu & Kangas 17

19 T The DAI defines procedures for initializing an MPEG-4 session and obtaining access to the various elementary streams that are contained in it. Initially, at start up phase, it is setup of a session with the remote part. Streams are selected and the delivery layer sends a request to stream them. Remote peer will return the information needed to connect to the actual streams that need fetching. The connections for the streams with the local host are also established at this point. The DMIF- Application interface (DAI) is also used for bringing broadcast material and local files. In such a way we define a single, uniform interface to access multimedia from a multitude of delivery technologies. More explicitly interactive network technology, broadcast technology and the disk technology are brought to a uniform handling. Also an important feature offered by DAI is the possibility given to the user to send commands with acknowledgements. FlexMux is an optional tool present in delivery layer. It is used to group elementary streams with low Quality of Service profile, or to multiplex multiple slower elementary streams with low speed to a network channel that permits faster speeds. In such a way it is possible to reduce the network end-to-end delay and the number of network connections. This simplifies the interfaces to the network layer especially. Multiple useless network connections can create undesired overhead. The content provider also needs the Quality of Service. The DAI allows the user application to specify it for the necessary streams. It is then up to the layer protocol implementation to ensure that the requirements are fulfilled. The streams are then packetized and delivered to the network to be transported. Next, several network transport possibilities/storage media are listed: RTP/UDP/IP PES/MPEG2/TS AAL2/ATM H323/PSTN LocalFile Burlacu & Kangas 18

20 T Sync Layer The layer receives the content from the media layer, process the data and pass the result to the delivery layer. No matter of the type of data conveyed in each elementary stream, it is important that they use a common mechanism for conveying timing and framing information. Synch Layer handles the synchronization of elementary streams and also provides the buffering. It is achieved through time stamping within elementary streams. The clock recovery information must be provided. This information can be retrieved from the time stamps, but also, like in traditional systems, through the clock references. Independent of the media type streamed, this layer allows identification of the type of media transported (e.g., video or audio, scene description commands) from the elementary streams. It provides/recovers the timing information from the media object (or scene description). The layer is responsible with the synchronization of the elementary streams belonging to a particular presentation scene. The syntax of this layer is configurable and very flexible. The elementary streams are transported via the DAI interface described above. Synchronization Layer does not contain information for frame demarcation. I.e, the Sync Layer header does not store the information about the packet length because it is assumed that the delivery layer that processes SyncLayer packets will provide this information. In this way fragmentation duplication is obtained. This will provide gains in the term of less encoding/decoding overhead. It is possible also to operate without any clock information, meaning the data is processed as soon as it is received. This would be suitable for non-streaming applications requirements, like a PowerPoint slide presentations Multimedia Layer The type of information identified in each stream must be retrieved at decoder (respectively the encoder must provide it). For this purpose object descriptors are used. These descriptors identify group of elementary streams to one media object (no matter if treating audio object, visual object, a scene description stream, or even point to an object descriptor stream). Briefly, the descriptors are the way decoder identifies the content being delivered to it. It is compulsory that each elementary stream is associated to an object descriptor. Object descriptors are transported in dedicated Burlacu & Kangas 19

21 T elementary streams, called object descriptor streams that make it possible to associate timing information to a set of object descriptors. Media objects (audio, video) are carried in its own elementary streams. Also there should be at least one stream allocated for the scene description information that characterizes the media objects trans-operated in other different streams. The scene description information defines the spatial and temporal position of the media objects and their behavior over time. Briefly, the scene description "tells where to position the video objects, and how to move them (or remove, or add new elements) as a function of time. Interactivity features by interaction with the scene description are made available to the user. The scene description contains the reference (or sometimes called pointers in literature) to object descriptors when it refers to a particular audiovisual object. Note the alternative "pointer" term used in literature may be a little misleading if interpreting with its basic sense. In MPEG-4 the term pointer must be interpreted in a broader sense, like the information provided to locate the media objects related to the actual scene described by the scene description. In their turn, the media objects contain object descriptor that refers to the information needed to fetch the object (actually information needed to locate and stream the elementary streams associated with the media object). A key feature of the scene description is that, since it is carried in its own elementary streams, it can contain full timing information. This implies that the scene can be dynamically updated over time, a feature that provides considerable power for content creators. In fact, the scene description tools provided by MPEG-4 also provide possibility to modify parts of the scene description in order to effect animation. This can be done with a separate stream by providing only the parameters that need to be updated. Finally, the scene description information, together with decoded media object data, will be combined to form the final resulted scene. Scene composition will be described next. The root for the scene composition is a binary language for scene description called BIFS (BInary Format for Scenes). It is used to describe scene composition information. BIFS is practically an improved language based on VRML. The difference between VRML and BIFS is basically that BIFS is a binary format while VRML is textual. BIFS describes an efficient binary representation of the scene graph information. BIFS are designed for streaming operating environments. The scene can be sent as an initial scene followed by timestamp modification to the scene. Note that there exists an extended version of Burlacu & Kangas 20

22 T BIFS called Extended BIFS available in MPEG Version 2. Figure 4: Media Object hierarchical example An MPEG-4 scene follows a hierarchical structure, which can be represented as a directed a cyclic graph. The nodes in the graph represent the media object. Note that the tree structure is changing over time; the parameters of the nodes, like positioning parameters, scaling factors etc. can be changed while nodes can be added, replaced, or removed. In the MPEG-4 model, audiovisual are characterized both in time domain and space domain. Each media object has associated a coordinate system. The coordinate system is used to manipulate media object in space and time. Media objects are positioned in a scene by specifying a coordinate transformation from the object s local coordinate Burlacu & Kangas 21

23 T system into a global coordinate system defined by one more parent scene description nodes in the tree. 4.2 Audio - video coding In the next section there will be described the tools for shape coding, motion estimation and compensation, texture coding, error resilience and scalability used in natural and static texture encoding/decoding. MPEG-4 has an extensive set of audio features. It provides separate codecs for low-bit rate speech and general-purpose audio. The MP3 was one of the key elements in MPEG-1, but it seems unlikely that MPEG-4 audio (MP4) will become as important file format for consumers than MP3 because the needs for the consumers were well covered by MP3. Basically for dynamic image coding, two coding models are used: Intra-Mode and Inter-Mode. In Intra-Mode both the spatial redundancy and irrelevancy are exploited with block based DCT coding, quantization, run length and huffman coding. Only information from the picture itself is used in Inter-Mode and thus every frame can be decoded independently. Afterwards the predicted image is subtracted from the original image. The resulting difference picture is DCT coded, quantized and VLC coded. The motion vectors describing the motion of the blocks in the picture are necessary side information for the decoder and are also encoded with VLC. Burlacu & Kangas 22

24 T Figure 5:MPEG4- video coding scheme Interframe coding usually requires much lower bit rate in comparison to Intraframe coding (only about 10 to 20%) because there is less information in the difference picture than in the original picture (Frauenhofer 2002c). More detailed description of the steps (including also generalization concepts are added) will follow. The basic requirement of having variable media bit streams within a wide range of bandwidth led to the approach of defining the video objects in a layered manner. The idea is to have a basic layer, and then, depending on the rate available, transmit also enhancement layers. Recall that each object is characterized in temporal and spatial domains using the shape, motion and texture. Next there is detailed information on shape, motion, texture and static textures will be covered. Shape coding can be of two types: binary and grayscale. A matrix associated to the object is defined. It is referred also as a binary mask. The pixels belonging to the object in that matrix projection are coded with 1, and the pixels outside the projection are coded with 0. The grayscale coding is an enhancement of the binary shape coding, where instead of 0 and 1 we have the "transparency" information of the object also. Burlacu & Kangas 23

25 T The binary mask is divided in 16x16 binary alpha blocks. The BABs (Binary Alpha Blocks) are encoded separately with respect to the current video object. For example, if the object contains fully a 16x15 BAB, then all the values for the pixels in the BABs are 1. In case BAB is completely outside the video object the elements are 0. If the matrix is on the border, then the matrix will follow the contour of the object in contained in the 16x16 block. Motion estimation/compensation are used to compress video sequences by taking advantage of the temporal redundancies between frames. Motion estimation is done only for 16x16 BABs in the contour of the video object in question. If a BAB lies entirely within a video object, motion estimation is performed in the usual way, based on block matching of 16 x 16 macro blocks. In MPEG-4 there are supported Intra (I) coded frames, as well as temporally predicted (P) and bidirectionally (B) predicted objects (frames). It is to the application to choose which coding will choose. 1. A video object may be encoded independently of any other time variant of the same object. In this case the encoding for video object is called an "Intra" VO (I-VO). 2. A video object may be predicted (using motion compensation) based on another previously versions (in time) of the object. They are usually called Predicted OPs (P-VO). 3. VO may be predicted based on past as well as future VOs. Such VOs are called Bi-directional Interpolated VOs (B-VO). B-VOs may only be interpolated based on I-VOPs or P-VOPs. Higher compressions can be achieved using B-VO. But it has the drawback that under high loss transmission media, the performance is quickly degrading. Motion vectors are further evaluated. The motion vectors are directly coded using variable length coding, since little processing can be done on them. In the case of an I-VO, the texture information resides directly in the object (image) itself. If we are dealing with motion compensated Vos then the texture information is contained only in the residual error obtained after motion compensation is previously performed. For texture information encoding, DCT (8x8 blocks) is used. To encode an arbitrarily shaped VO, 8x8 matrices are super-imposed on the VOP. 8x8 blocks that are internal to VO are encoded without modifications. Boundary blocks (8x8 Burlacu & Kangas 24

26 T block intersecting the border of the video object) are treated differently from internal blocks. The transformed blocks are quantized, and individual coefficient prediction can be used from neighboring blocks to further reduce the entropy value of the coefficients. Next, it follows a scanning of the coefficients, to reduce to average run length between to coded coefficients. Scanning of the coefficients can be in zig-zag, alternate horizontal, or alternate vertical Then, the coefficients are encoded by variable length encoding. The static texture coding technique takes advantage on the powerful properties of the wavelet transform. The discrete wavelet transform is applied to the original data, yielding 4 new sub matrices (sub bands). The sub bands are named AC and DC. The AC and DC sub bands are coded separately. The wavelet coefficients are quantized, and encoded Texture information is separated into sub bands by applying a discrete wavelet transform to the data. The inverse discrete wavelet transform is applied on the sub bands to recover the original information. The discrete wavelet transform is applied recursively on the obtained sub bands, generating the spanning tree of sub bands. This provides the layers of resolution needed for the variable bit rate applications. Note that the error resilience described in this chapter is different from the resilience of the delivery layer! Error resilience is needed for error detection, data recovery and re-synchronization. Re-synchronization is done as inserting special marks into the bit stream. In case of an error, the information between the marks is ignored. The "time distance" between the marks is configurable, but user has rather low interaction with this marks information. Separating the texture information by the motion information is useful also for building the elementary streams, and also that in case of an error, say in texture information, the motion information continue to do its job, while, the particular frame Burlacu & Kangas 25

27 T is corrupted. There will be a temporary fault in one video, element, but the scene overall remains intact, and the impact of the failure is much atenuated. Of great importance is the usage of reversible variable length codes. They are code words that can be decoded in forward as well as backward manners. If an error occurs, and skipping of the bit stream until the next resynchronization mark takes place, it is possible to still decode portions of the corrupted bit stream in the reverse order to recover as much data from the corrupted zone as possible. MPEG-4 audio coder includes coding tools from several different coding paradigms, such as parametric audio coding, synthetic audio, speech coding and sub band/transform coding. Within this comprehensive "tool box", the high-quality of MPEG-4 audio functions will be covered by the so-called "General Audio (GA)" coders. In a GA coder, the input signal is first decomposed into a time/frequency (t/f) spectral representation by means of an analysis filter bank, which is then subsequently quantized and coded. Its natural features are bit rate scalability allowing a bit stream to be parsed into bit steams of lower bit rate. Bandwidth (bit rate) scalability, where part of a bit stream representing a part of the frequency spectrum can be discarded during transmission/encoding. Encoder and Decoder complexity scalability allows different level encoders to work well generating meaningful and valid bit streams and decoders to work on different levels of complexity. Error robustness tools contain error resilience and protection tools and Low Delay Audio Coding allows for high quality coding of general audio signals with short delays. 4.3 Video codecs Video codecs are the key building blocks for a host of new multimedia applications and services for streaming video on the Internet, digital television or wireless solutions. Codec stands for Coder/Decoder. It is a piece of software or a driver that adds a support for certain video/audio format for the current operating system. With codec, the system recognizes the format the codec is built for and allows to play the audio/video file (=decode) or in some cases, to change another audio/video file into that format (=(en)code) (See also WI 2003). For example, when installing Windows to the computer, it will automatically install most commonly used codec(s) into the system. The user does not have to download Burlacu & Kangas 26

28 T them separately from their vendors. Not all widely used codecs come with the operating systems. One good and notable example is DivX ;-). The idea of image encoding is to compress the image data into memory-intensive form and decode it without visible losses. Often this is done in a way that the codec removes information from the picture that is not visible for the human eye. Naturally the more the image or sound information is encoded, the more problems with image quality will appear. MPEG contain the codec in itself, but there are formats (e.g. MOV, AVI) that do not contain the codec but the information can be packed with a codec chosen by the user. There are three different approaches of implementing MPEG-4 codec in a mobile terminal: 1) Software (QuickTime DV, Canopus DV, Adaptec DVSoft) 2) Combination of software and hardware 3) Hardware (uses the camera's or recorder's codec for full image video) (Hantro 2002). MPEG-4 share many features with MPEG-1 and MPEG-2, such as DCT (Discreet Cosine Transformation) encoding and I, P and B-frames, that are tied together inside of GOp (Group of Pictures). It also has a lot of enhancements especially for low data rate use. This includes better motion estimation and a de-blocking filter. Its quality at web data rates is better than MPEG-1, and generally competitive with other web video solutions (MPEG 2000). Unlike most video codecs, MPEG-4 has full support for interlaced content. The MPEG-4 video codec natively supports alpha channels, so video can be composed in real time over a background, resulting in a more flexible and higherquality image. There are many different vendors provide MPEG-4 codecs. For example Toshiba released an MPEG-4 codec for mobile applications last year. Most knows codecs are QuickTime (Sorenson V2, V3), Sorenson ISO MPEG-4, ISO MPEG-4, Microsoft Windows Media Video (7, 8, 9) and RealSystem 8 9. The MPEG video compression algorithm has been developed monitoring the H.261 standard and it retains a large degree commonality with it. Other codecs are as follows: MJPEG encoder video frames on JPEG-compression. There are several versions of Indeo codec (e.g. r3.2, r.4.3 and 5.11) available. Needless to say, the 9 For test results, see e.g. Burlacu & Kangas 27

29 T newer the better the image quality and better the encoding and the extensive demands aiming at the processor. Indeo r3.2 can be found from many multimedia computers. Radios Cinepak is also available on almost every Windows and Mac multimedia computers. There are also codecs developed for specific purposes like Microsoft RLE 8-bit that is designed for animation and multimedia in mind. There are also several other codecs like GIF, AIFF, WAV, DivX, SBC, XviD, Real, Apple, flash, vrml/3d and math depending on the information to be compressed. 4.4 They Key extensions of MPEG-4 There are several visual and system level extension under development. One notable extension is the development of MPEG-4 part 10 (or MPEG-4 AVC). It has been prepared by the Joint Video Team and will be published simultaneously by ITU-T and ISO/IEC in Other MPEG-4 extensions cover the areas of animation, multi user world communication on the Internet, extensions for audio capabilities and bandwidth extensions. Also work on MP4, scene description and ISO Media File Format is under development (Fraunhofer 2002d). The extensive scope of MPEG-4 is one of its values, but the ever-growing amount of extensions makes it difficult to handle different issues covered by MPEG-4. Here is a short description of the extensions that are currently under development. Security issues are very on high priority in MPEG consortium as they are in digital world in general. IPR generally bestows on its owners the right to exclude others (with certain limited exceptions) from the use or re-use of their intellectual property without a license from the IPR owner. Any intellectual property that is delivered, either freely or for commercial gain, through an MPEG-4 application afforded such protection. Under extension IPMP (Intellectual Property Management and Protection) animation framework (AFX) is under development. The AFX is meant to provide an integrated toolbox for building synthetic MPEG-4 environments. The values of this are e.g. technical abilities for higher-level animations, enhanced rendering possibilities, interactivity at user level, scene level and client-server session level such issue that has been characteristic to the whole evolution of MPEG-4. Also compression techniques of animated paths and models are about to improve enabling the better storage efficiency and low bit rate animations have been studied. The AFX framework defines a collection of interoperable tool categories that collaborate to produce a reusable architecture for interactive animated contents. Burlacu & Kangas 28