Web Radio: Technology and Performance

Transcription

1 Web Radio: Technology and Performance Paula Cortés Camino LITH-ISY-EX

2

3 Avdelning, Institution Bildkodning (Image Coding Group) Institutionen för systemteknik (Division of Electrical Engineering) Datum Date augusti 2002 (August 2002) Språk Language Svenska (Swedish) Engelska (English) Rapporttyp Report Category Licenciatavhandling Examensarbete C-uppsats D-uppsats Annan ISBN ISRN Serietitel och serienummer Title of Series, numbering ISSN URL för elektronisk version LiTH-ISY-EX Titel Title Författare Author Web-radio: Tekniker och prestanda Web-radio: Technology and Performance Paula Cortéz Camino Sammanfattning Abstract We review some popular techniques involved in web radio, from the compression methods (both open and proprietary) to the network protocols. The implementation of a web radio station is also addressed. Nyckelord Keywords Web-radio, streaming audio, windows media, real audio, shoutcast, MP3, AAC, WMA, audio coding, RTP, RTSP

4

5 Web Radio: Technology and Performance Master thesis in Image Coding Group Linköping Institute of Technology by Paula Cortés Camino LITH-ISY-EX Supervisor: Dr. Robert Forchheimer, Image Coding Group Examiner: Dr. Robert Forchheimer, Image Coding Group Linköping, August 2002

6

7 Acknowledgements I would like to express my sincere gratitude to Dr. Robert Forchheimer for accepting me in the Image Coding Group. He found the right project for me and helped me all the time with his suggestions and corrections. The first time we met I got an excellent impression about him, not only as a professor but also as a person, that has been improving with the time and talks we have had. Thanks also to Peter for his effort to be my opponent, and to Jonas for his corrections. Thanks to all the people that have made my time in Linköping so special and unforgettable, my friends from the university and my host family. There are many people I have to be grateful to for being so close to me in spite of the distance, all my friends from Castillejo del Romeral, Guadalajara, my university, all my family...the list would be endless. I want to dedicate this thesis to my parents and my mormor (abuela). They have been always there, listening to me when I have needed it, and also here visiting me. Also to Jesús that has been with me during the summer, without his support this thesis (and many other things) would have never been possible. I love all of you so much. TACK SÅ MYCKET! MUCHAS GRACIAS! Linköping, August 2002

8

9

10

11 Abstract The march of electronic technology and the explosive growth of the Internet, is changing how audio communications are delivered. For decades radio was delivered through analog signals sent over the airwaves, but this is being transformed. Today, streaming technologies allow more and more people to listen to their favorite radio station through the Internet. The development of audio broadcasting via the web is probably the biggest revolution in broadcasting since the advent of FM. Web radio is radio with a lot of potential, with the ability to hear stations of all formats, musical genres, political social orientations, from every part of the world. Web radio is also far from radio heard by other means, in terms of quality, quantity, and variety. It is becoming more and more popular and in the future it may even replace traditional radio. The aim of this thesis is to make a review of all the techniques involved in web radio, from the compression methods (both open and proprietary) to the network protocols. The implementation of a web radio station is also addressed. Before the audio gets to the listeners, it suffers many transformations, jitter, delays etc. that degrade its quality. In the end of the thesis the main performance parameters and factors affecting the quality of real time audio are reviewed, and some measurements in quality parameters are analyzed. Key words: Audio, Compression, Bandwidth, Streaming, Broadcasting, Web Radio Station, RTSP, RTP, Streaming Server, Player, Encoder, Web Server, Audio File Format, QoS, Multicasting.

12

13 Contents 1 Introduction Disposition of the Thesis Audio Overview Audio Compression Classification of Compression Techniques Lossless Compression Techniques Lossy Compression Techniques Audio Compression on the Internet MPEG Windows Media Audio RealAudio Compression Techniques Comparison Coders Comparison Formats Comparison Final Conclusions Internet Audio Transmission Introduction Streaming Audio Overview Streaming Improvement Broadcasting Multimedia Protocols Physical/ Data Link Layer... 51

14 4.4.2 Internet Layer Transport Layer Session/ Presentation/ Application Layer Asynchronous Transfer Mode (ATM) Media Servers Web Servers Streaming Servers Web Servers vs. Streaming Servers Standards Organizations Web Radio Stations Introduction How do they work? Implementing a Web Radio Station Elements Steps Commercial Tools Windows Media Real Networks SHOUTcast and Icecast Comparisons Performance 89 8 Conclusions 97 Appendix: Audio File Formats 99 Acronyms 101 References 115

15

16

17 1 Introduction In this chapter, the thesis is presented and an overview of audio processing and transmission is given. 1.1 Disposition of the Thesis The march of electronic technology and the explosive growth of the Internet [119], is changing how audio communications are delivered. For decades radio was delivered through analog signals sent over the airwaves, but this is being transformed. Today, streaming technologies allow more and more people to listen to their favorite radio station through the Internet. Web radio is radio with a lot of potential, with the ability to hear stations of all formats, musical genres, political social orientations, from every part of the world. Web radio is also far from radio heard by other means, in terms of quality, quantity, and variety [123]. In the future it may even replace traditional radio. This thesis presents an overview of Internet radio systems. Chapter 1 reviews the processing of audio, from analog sound to packets ready to be transmitted through the Internet. Chapter 2 emphasizes the necessity of compression in Internet audio transmission, classifies compression techniques and presents the main techniques used nowadays. In Chapter 3 both audio formats and the encoders available for each format are compared. Chapter 4 lists the main audio file formats found on the Internet and their main characteristics. In Chapter 5 the actual transmission of the audio packets through the Internet is addressed, introducing the concept of streaming and the protocols used. The chapter ends by taking a look at the standard organizations for audio compression and transmission. Chapter 6 introduces web radio and explains the general operation of a web radio station. Chapter 7 covers the process of creating an Internet radio station with three of 1

18 1 Introduction the main streaming systems, SHOUTcast/Icecast, Windows Media and Real Networks. Factors affecting the QoS of web radio stations are studied in Chapter 8, and some tests regarding the performance are analyzed. The thesis finishes with a conclusion about the actual state and the future of Internet radio and future work given in Chapter Audio Overview Sounds are rapid pressure variations in the atmosphere such as are produced by many natural processes or man-made systems. The human ear responds to atmospheric pressure variations when they are in the frequency range between 20 Hz to about 20 khz, which is called the audio bandwidth. The nature of the sound is analog, but it is better converted into a digital signal before transmission. The conversion from analog to digital is done by an ADC (Analog-Digital Converter), which filters, samples, quantizes and encodes the audio. In this process of digitizing, some things should be taken into account to get a high audio quality. In the sampling stage, a higher sampling frequency increases fidelity. The sampling frequency is the number of times the audio is quantized within a given time period. The sampling frequency is related to the audio signal bandwidth by the Nyquist limit, which states that the sampling frequency should be at least two times the maximum frequency of the signal (or its bandwidth). When recording music, the choice of the sampling frequency is crucial since musical instruments produce a wide range of frequencies. The sampling frequency should be above 40 khz, giving a highest frequency reproduced of 20.0 khz, approximately the top of human hearing range [2]. If recording speech, however, a lower sampling frequency will be enough [3]. Table 1.1 summarizes some sampling frequencies and their common use [2]. Sampling frequency Common use 11.1 khz Minimum quality currently used on personal computers khz Very common in computer sound file formats 24 khz Minimum acceptable quality needed for speech recognition 44.1 khz The standard for audio compact discs and high quality personal computer sound. CD quality Table 1.1: Sampling frequencies and their common use. 2

19 Web Radio: Technology and Performance Quantization refers to the process of approximating the continuous set of values in the data with a finite set of values. The instantaneous amplitude of the analog signal at each sampling is rounded off to the nearest of several predetermined levels. The number of levels is usually a power of 2, so that these numbers can be represented by three, four, five, or more bits when encoding. There are two types of quantization, scalar and vector quantization. In scalar quantization, each input symbol is treated separately to give the output, while in vector quantization the input symbols are combined together in groups called vectors, and then processed to give the output [4]. Obviously, as this is a process of approximation it introduces quantization error. Increasing the word length (number of bits used to encode each sample), the quantization error can be reduced. In large amplitude signals the correlation between the signal and the quantization error is small, the error is random and sounds like analog white noise. However, in low level signals, the error becomes correlated with the signal, which leads to distortion [5]. To decorrelate this error, a technique called dithering is used. Dithering distributes the error across the entire spectrum by adding some noise prior to quantization. This method raises the noise floor equally at all frequencies, however, since the ear is not equally sensitive to all frequencies, it makes sense to push the majority of this dither noise to frequencies where the ear is least sensitive, and remove noise where the ear is most sensitive. This is done by another technique called noise shaping [6]. The complete approach of digitizing is called PCM (Pulse Code Modulation) [7]. The digital audio is then stored in files in a standard format. However, the size of the files can be quite large [3], so the files usually need to be compressed. Once the audio file is compressed, it is divided into packets before transmitting it through the Internet. The conventional process of transferring and listening to an audio file involves first transferring the file from one computer to the next or downloading files from a server via FTP (File Transfer Protocol). Another option is to listen to an audio file as it is transferred, or real time transfer, using a technique called streaming. 3

20 4

21 2 Audio Compression In this chapter the necessity of audio compression on the Internet is emphasized and audio compression techniques are introduced. Compression techniques can be classified in lossy and lossless, typical examples of both are presented. At the end of the chapter three of the main approaches in audio compression for the Internet (MPEG, WMA and Real Audio) and their main techniques are analyzed and explained in more detail. To store a 3 minutes song on your hard disk with CD quality (44.1 khz, stereo, 16 bits per sample) will take up: samples/s * 2 channels (for stereo)* 2 bytes/sample * 3 * 60 s/min = around 30 MBytes of storage space. Then, downloading it over the Internet, given an average 56K modem, would take: bytes * 8 bits/byte / ( bits/s * 60 s/min) = around 70 minutes. Sound Quality Stereo 16 bit Stereo 8 bit Mono 16 bit Mono 8 bit CD (44.1 khz) 10 MBytes 5 MBytes 5 MBytes 2.5 MBytes FM Radio (22.05 khz) 5 MBytes 2.5 MBytes 2.5 MBytes 1.25 MBytes AM Radio (8 khz) 1.8 MBytes 900 KBytes 900 KBytes 450 KBytes Table 2.1: Approximate file sizes of a one minute sound file [3]. 5

22 2 Audio Compression Looking at the previous example and also at Table 2.1, the need of compression is clear. Audio compression (also called audio coding ) reduces the amount of memory required to store an audio file, reducing also the time required to transfer it and the bandwidth needed for the transmission. The term bitrate is used to define the strength of the compression[9]. Bitrate denotes the average number of bits that one second of audio data will consume. For PCM, bitrate is then in bps = fsampling (Hz) * word length (bits/sample), e.g. for a digital audio signal from a CD, the bitrate is Kbps (44.1 K * 2 * 16). The first proposals to reduce audio followed those for speech coding, but speech and music have different properties. Furthermore, speech can be coded very efficiently because a speech production model is available, whereas nothing similar exists for audio signals [10]. High coding efficiency is achieved with algorithms exploiting signal redundancies. These redundancies can be [11]: Spatial: exploits correlation between neighboring data items. Spectral: uses the frequency domain to exploit relationships between frequency of change in data. Psychoacoustics: exploits perceptual properties of the human auditory system. Temporal. 2.1 Classification of Compression Techniques Compression can be categorized in two ways [11, 8, 4]: Lossless compression: when the compressed data can be reconstructed (uncompressed) without loss of information. It is also referred to as reversible compression. Lossy Compression: the aim is to obtain the best possible fidelity for a given bitrate or minimizing the bitrate to achieve a given fidelity measure. The compression is not reversible, the decompressed file is not the same as the original file. 6

23 2.1.1 Lossless Compression Techniques Web Radio: Technology and Performance These methods are fairly straight forward to understand and implement. Their simplicity is their downfall in terms of attaining the best compression ratios. Some lossless compression techniques are: Simple Repetition Suppression: replaces a series of sequence of n successive tokens that appears with a token and a count number of occurrences. This technique is used when there is silence in sound files. Pattern Substitution: substitutes a frequently repeated pattern(s) with a code shorter than the pattern [11]. Entropy Encoding: it is based on information theoretic techniques [11, 8], e.g.: o Huffman Coding: uses a lower number of bits to encode the data that occurs more frequently. The basic Huffman algorithm has been extended to the Adaptive Huffman Coding, because the previous algorithms require a statistical knowledge which is often not available (e.g. in live audio). o Arithmetic coding: maps entire messages to real numbers based on statistics. o LZW (Lempel-Ziv-Welch): is a dictionary-based compression method. It maps a variable number of symbols to a fixed length code Lossy Compression Techniques Traditional lossless compression methods don't work well on audio because their compression is not high enough. Lossy compression methods use source encoding techniques that may involve transform encoding, differential encoding or vector quantization. Perceptual techniques (based on psychoacoustics) are also used (e.g. in the MPEG standards), getting higher compression. The sensitivity of the human auditory system for audio signals varies in the frequency domain, it is high for frequencies between 2.5 and 5 khz and decreases beyond and below this frequency band. Therefore, some tones are masked by others, and then are inaudible. There are two main masking effects: Frequency masking: tones nearby a loud tone are masked. 7

24 2 Audio Compression Temporal masking: after hearing a loud sound, it takes a little while until humans can hear a soft tone nearby. There is a threshold in quiet and any tone below this threshold won t be perceived. For every tone in the audio signal a masking threshold can be calculated [see Figure 2.1]. Tones lying below this masking threshold can be eliminated by the encoder because will be masked and then irrelevant for the human perception[9]. The following are some of the lossy methods applied to audio compression: Silence Compression: detects the silence, similar to Simple Repetition Suppression. ADPCM (Adaptive Differential Pulse Code Modulation): it is a derivative of DPCM (Differential Pulse Code Modulation). It encodes the difference between two consecutive signals, and adapts quantization so that fewer bits are used when the value is smaller [7]. Used in CCITT G.721 at 16 or 32 Kbps and in G.723 at 24 and 40 Kbps. LPC (Linear Predictive Coding): fits the signal to the speech model and then transmits the parameters of the model [7]. CELP (Code Excited Linear Predictor): does LPC, but also transmits an error term [7]. ITU-T G.711, mu-law and A-law: mu-law is an encoding commonly used in North America and Japan for digital telephony. mu-law samples are logarithmically encoded in 8 bits. However, their dynamic range corresponds to 14 bits linear data. A-law is similar to mu-law, and is used as an European telephony standard. This encoding method comes out to be 64 Kbps at 8 khz. Transform coding: e.g. Frequency domain coders: The spectral characteristics of the source signal and the masking properties of the human ear are exploited to reduce the transmitted bitrate [12]. The time-domain audio signal is transformed to the frequency domain before quantization [13]. The reason for transforming the signal is that the input samples are highly correlated, and the time-to-frequency transform produces coefficients that are less correlated. There are also more coefficients with the value near zero, that can be coded as zero without introducing great distortion. The spectrum is split into frequency bands that are quantized separately. Therefore the quantization noise associated with a particular band is contained within that band. 8

25 Web Radio: Technology and Performance Figure 2.1: Masking effects in the human ear [9]. The total number of bits available for quantizing (usually fixed by design) is distributed by a dynamic bit allocation over a number of signal component quantizers, so that the audibility of the quantization noise is minimized. The number of bits used to encode each frequency component varies: components being subjectively more important, are quantized more finely, while components being subjectively less important, have fewer bits allocated, or may not be encoded at all [12]. This results in the highest possible audio quality for a given number of bits [14]. Transform coders differ in the strategies used for quantization of the spectral components and masking the resulting coding errors [15]. Some examples are: o SBC (Subband coding): eliminates information about frequencies which are masked, according to psychoacoustic models. It is used in ISO/MPEG Audio Coding Layers I and II. o Adaptive transform coding: used in Dolby AC-2 coding and AT&T s Perceptual Audio Coder. o Hybrid (subband / transform) coding: is a combination of discrete transform and filter bank implementations. It is used by Sony s Adaptive Transform Acoustic Coding used in minidisk and by MPEG-1 Layer III. Other transformations can be applied instead of frequency, such as DCT (Discrete Cosine Transform) DST (Discrete Sine Transform), and DWT (Discrete Wavelet Transform). There are also nested approaches or multilevel data compression techniques, such as applying the DCT several times [16]. Finally, coding methods can also be divided in CBR (Constant bitrate) and VBR (Variable bitrate). CBR techniques vary the quality level in order to ensure a consistent bitrate throughout an encoded file. Difficult passages (e.g. passages containing a relatively wide 9

26 2 Audio Compression stereo separation) may be encoded with fewer than the optimum number of bits, and easy passages (passages containing silence or a relatively narrow stereo separation) are encoded using more bits than necessary. Consequently, difficult passages may experience a decrease in quality, while easy passages may include unused bits [17]. VBR techniques ensure consistent high audio quality throughout an encoded file, at the cost of variable bitrate. Difficult passages in the audio source are allocated additional bits and easy passages fewer bits, thus reducing unused bits. VBR encoding produces an overall higher quality level than CBR encoding, and should be used when consistent audio quality is the top priority and constant or predictable encoded file size is not critical [17]. 2.2 Audio Compression on the Internet When the Internet was young, the only way to deliver audio in an acceptable time was to precede the compression by some reduction scheme, reducing the sampling rate, converting from stereo to mono, reducing the resolution from 16 down to 8 bits per sample, or all of the above. But every reduction in the above parameters resulted in a lower quality sound [3]. Not long after, however, groups began to create highly complex algorithms that allowed for reduction in the size of the file while retaining the highest quality possible. An early effort was the MPEG (Moving Picture Experts Group), which intended to develop a compression scheme first oriented to video and then also to audio [3]. The next step was creating an acceptable compression scheme for streaming technology [3]. For the first time, hour-long sound files could be played on a computer with a 28.8 Kbps modem. The higher the bandwidth available for the users, the better quality of the sound they got. The following is a review on the MPEG audio compression standards and two proprietary audio compression formats, Windows Media Audio and RealAudio [111] MPEG The MPEG is a working group of ISO/IEC in charge of the development of standards for coded representation of digital audio and video. Established in 1988, the group has produced MPEG-1, the standard on which such products as Video CD and MP3 are based, MPEG-2, the standard on which Digital Television set top boxes and DVD are based and MPEG-4, the standard for multimedia [18]. Furthermore, the MPEG audio compression algorithm is being considered by the European Broadcaster s Union as a standard for use in digital audio broadcasting [19]. They have developed more standards that are not revised here since they are not related to audio. 10

27 Web Radio: Technology and Performance MPEG-1 (ISO/IEC 11172) Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbps MPEG-1 is a standard for efficient storage and retrieval of audio and video on CD consisting of several parts: part 1 (Systems), part 2 (Video), part 3 (Audio), part 4 (Conformance Testing), and part 5 (Reference Software). The Systems part provide multiplexing and synchronization support to elementary audio and video streams. The Audio part provides lossy encoding of stereo audio with transparency (subjective quality similar to the original stereo) at some predefined bitrates [see Table 2.2], and also provides a free bitrate mode to support fixed bitrates other than the predefined. From Layer I to Layer III, the codec complexity increases, the overall codec delay increases [see Table 2.2], and the performance increases. The sampling frequencies used in the three layers are 32, 44.1 and 48 khz. Layer Bitrates Minimum coder delay Layer I 384 Kbps (1:4) < 50 ms. Layer II Kbps (1:6-1:8) < 100 ms. Layer III (MP3) Kbps (1:10-1:12) < 150 ms. Table 2.2: Bitrates and theoretical minimum coder delay in the MPEG-1 layers [7]. MPEG-1 Audio Layer I (MP1) The Layer I encoder is the basic MPEG encoder [see Figure 2.2]. The audio stream passes through a PQMF (Polyphase Quadrature Mirror Filterbank) that divides the input into 32 equal-width subbands of frequency, which don t reflect the human auditive model where critical bands are not of equal width [20]. Each subband has 12 frequency samples that are obtained calculating the DFT (Discrete Fourier Transform) of 12 input PCM samples. This means that the Layer I frame has 384 (12 * 32) PCM audio samples. In Layer I the DFT is calculated with 512 points. In addition, the filter bank also determines the maximum amplitude of the 12 subband samples in each subband. This value is know as the scaling factor for the subband and is passed both to the psychoacoustics model and, together with the set of frequency samples in each subband, to the corresponding quantizer and coding block [7]. The input audio stream simultaneously passes through a psychoacoustic model that determines the ratio of the signal energy to the masking threshold for each subband. The quantizer and coding block use a bit allocation algorithm that takes the signal-to-mask ratios and decides how to distribute the total number of bits available for the quantization of 11

28 2 Audio Compression the subband signals to minimize the audibility of the quantization noise and maximize the compression at the same time. Figure 2.2: MPEG basic audio coder and decoder [21]. Finally, the last block takes the representation of the quantized subband samples and formats this data and side information (bit allocation, scale factors) into a coded bitstream [see Figure 2.3]. Figure 2.3: MPEG-1 Layer I Frame Structure [21]: valid for 384 PCM Audio Input Samples. Duration: 8 ms. with a Sampling Rate of 48 khz. 12

29 Web Radio: Technology and Performance The decoder is the reverse of this process, except that no psychoacoustic model is required. The bitstream is unpacked and passed through the filter bank to reconstruct the time domain PCM samples. Layer I is the same as the PASC (Precision Adaptive Sub-band Coding) compression used in Digital Compact Cassettes. Typical applications of Layer I include digital recording on tapes, hard disks, or magneto-optical disks, which can tolerate the high bitrate [21]. MPEG-1 Audio Layer II (MP2) Layer II algorithm is a straightforward enhancement of Layer I. It codes the audio data in larger groups (1152 PCM samples) and imposes some restrictions on the possible bit allocations for values from the middle and higher subbands. It also represents the bit allocation, the scale factors values and the quantized samples with more compact code. Layer II gets better audio quality by saving bits in these areas, so more code bits are available to represent the quantized subband values [20]. The resulting frame structure is represented in Figure 2.4. Typical applications of Layer II include audio broadcasting, television, consumer and professional recording, and multimedia [21]. Figure 2.4: MPEG-1 Layer II Frame Structure [21]: valid for 1152 PCM Audio Input Samples. Duration: 24 ms with a Sampling Rate of 48 khz. MPEG-1 Audio Layer III (MP3) The Layer III algorithm is a much more refined approach derived from ASPEC (Audio Spectral Perceptual Entropy Coding). Although based on the same filter bank found in the other layers (for reasons of compatibility), Layer III compensates for some filter bank deficiencies by processing the filter outputs with a MDCT (Modified Discrete Cosine Transform) [see Figure 2.5] [22]. It can be said that the filter bank used in MPEG Layer III is a hybrid filter bank which consists of a polyphase filter bank and a MDCT [20]. 13

30 2 Audio Compression Unlike the polyphase filter bank, without quantization, the MDCT transformation is lossless. The MDCT further subdivides the subband outputs in frequency to provide better spectral resolution [20]. Figure 2.5: Block structure of ISO/MPEG audio encoder and decoder, Layer III [15]. Besides the MDCT processing, other enhancements over Layer I and II algorithm include the following [20]: Alias reduction: Layer III specifies a method of processing the MDCT values to remove some artifacts caused by the overlapping bands of the polyphase filter bank. Non-uniform quantization. Scale-factor bands: These bands cover several MDCT coefficients and have approximately critical-band widths. In Layer III scale factors serve to color the quantization noise to fit the varying frequency contours of the masking threshold. Values for these scale factors are adjusted as part of the noise-allocation process. 14

31 Web Radio: Technology and Performance Entropy coding of data values: To get better data compression, Layer III uses variable-length Huffman codes to encode the quantized samples. The Huffman code tables assign smaller words to more frequent values. If the number of bits resulting from the coding operation exceeds the number of bits available to code a given block of data, this can be corrected by adjusting the global gain to result in a larger quantization step size, leading to smaller quantized values. This operation is repeated with different quantization step sizes until the resulting bit demand for Huffman coding is small enough. This loop is called rate loop because it modifies the overall coder rate until it is small enough [23]. Uses of a bit reservoir : The coder encodes at different bitrate when it is needed. If a frame is easy it is assigned less bits and the unused bits are put into a reservoir buffer. When a frame comes along that needs more than the average amount of bits, the reservoir is tapped for extra capacity. Ancillary data: Is held in a separate buffer and gated onto the output bit stream using some of the bits allocated for the reservoir buffer when they are not required for audio. Noise allocation: The encoder iteratively varies the quantizers in an orderly way, quantifies the spectral values, counts the number of Huffman code bits required to code the audio data, and calculates the resulting noise. If, after quantization, some scale-factor bands still have more than the allowed distortion, the encoder amplifies the values in those scale-factor bands and effectively decreases the quantizer step size for those bands. Then the process repeats. The process stops if any of the following 3 conditions is true: none of the scale-factor bands have more than the allowed distortion, the next iteration would cause the amplification of the bands to exceed the maximum allowed rate or the next iteration would require all the scalefactors to be amplified. As with Layer II, Layer III processes the audio data in frames of 1152 samples. The arrangements of the bit fields in the bitstream is like this: Header (32), CRC (0,16), side information (136,256) and main data. Layer III is intended for applications where a critical need for low bitrate justifies the expensive and sophisticated encoding system. It allows high quality results at bitrates as low as 64k bps. Typical applications are in telecommunication and professional audio, such as commerbc±/ly published music and video [20]. 15

32 2 Audio Compression MPEG-2 (ISO/IEC 13818) Generic coding of moving pictures and associated audio information MPEG-2 has 9 parts. Part 1 addresses the combining of one or more elementary streams of video and audio, as well as other data into single or multiple streams which are suitable for storage or transmission. This is specified in two forms: the Program Stream (PS) and the Transport Stream (TS). Each is optimized for a different set of applications. Part 2 builds on the powerful video compression capabilities of the MPEG-1 standard to offer a wide range of coding tools. Part 3, the Audio part, provides support to encoding of multichannel audio in such a way that it is a backwards compatible multichannel extension of MPEG-1 Audio, using the same family of audio codecs: Layer I, II and III. The new audio features of MPEG-2 are: low sample rate extension to address very low bitrate applications with limited bandwidth requirements. The new sampling frequencies are 16, or 24 khz, and the bitrates extend down to 8 Kbps, multichannel extension to address surround sound applications with up to 5 main audio channels (left, center, right, left surround, right surround), optionally one extra low frequency enhancement or LFE channel, multilingual extension to allow the inclusion of up to 7 more audio channels [25]. Part 4 and 5 correspond to part 4 and 5 of MPEG-1 [23]. The Part 6 DSM-CC (Digital Storage Media Command and Control) provides protocols for session set up across different networks and for remote control of a server containing MPEG-2 content. Part 7 Advanced Audio Coding (AAC) provides a new multichannel audio coding that is not backwards compatible with MPEG-1 Audio. It will be explained within the MPEG-4 standard, which defines an improvement of AAC. Part 8 was intended to support video coding when samples are represented with an accuracy of more than 8 bits, but its development was discontinued when the interest of the industry that had requested it did not materialize. Part 9 Real Time Interface provides a standard interface between an MPEG-2 Transport Stream and a decoder. MPEG-2 provides broadcast quality audio and video at higher data rates. Parts 1, 2 and 3 are used in digital television set top boxes and DVD (Digital Versatile Discs). Some MPEG-2 encoders are very costly professional equipment and some are inexpensive PC boards that are sold with video editing software. AAC has been adopted by Japan for a national digital television standard and by several manufacturers of secure digital music. As the MPEG-2 standard defines 16 khz as lowest sample rate, a further extension has been introduced, again dividing the low sample rates of MPEG-2 by 2: 8, , and 12 16

33 Web Radio: Technology and Performance khz [23]. This extension is named MPEG 2.5 but it is not part of the official ISO standard. MPEG 1, MPEG - 2 Audio Frame An MPEG audio file is built up from smaller independent parts called frames, each one with its own header and audio information. There is no file header, and therefore, any part of an MPEG file can be cut and played correctly [26, 27]. To get the information about an MPEG file, it is usually enough to find the first frame, read its header and assume that the other frames are the same. However, this may not be always the case, VBR MPEG files may use bitrate switching, which means that bitrate changes according to the content of each frame. Layer III decoders must support this method, and Layer I & II decoders may support it. The frame header is constituted by the first four bytes in a frame. Here is a graphical presentation of the header content. Characters from A to M are used to indicate different fields. In Table 2.4 are shown the details about the content of each field [26]. AAAAAAAA AAABBCCD EEEEFFGH IIJJKLMM Sign Length (bits) Position (bits) A 11 (31-21) B 2 (20,19) Description Frame sync (all bits set) : used for synchronization. To avoid false frame sync, 2 or more frames in a row must be checked. MPEG Audio version ID 00 MPEG Version reserved 10 MPEG Version 2 (ISO/IEC ) 11 MPEG Version 1 (ISO/IEC ) Layer description C 2 (18,17) D 1 (16) E 4 (15,12) 00 reserved 01 Layer III 10 Layer II 11 Layer I Protection bit 0 - Protected by CRC (16 bit CRC follows header) 1 - Not protected Bitrate index 17

34 2 Audio Compression bits V1,L1 V1,L2 V1,L3 V2,L1 V2, L2 & L free Free free free Free bad Bad bad bad Bad NOTES: All values are in Kbps V1 - MPEG Version 1, V2 - MPEG Version 2 and 2.5 L1 - Layer I, L2 - Layer II, L3 - Layer III For Layer II there are some combinations of bitrate and mode which are not allowed. Here is a list of allowed combinations. bitrate allowed modes free All 32 single channel 48 single channel 56 single channel 64 All 80 single channel 96 All 112 All 128 All 160 All 192 All 224 stereo, intensity stereo, dual channel 256 stereo, intensity stereo, dual channel 320 stereo, intensity stereo, dual channel 384 stereo, intensity stereo, dual channel 18

35 Web Radio: Technology and Performance F 2 (11,10) Sampling rate frequency index (values are in Hz) bits MPEG1 MPEG2 MPEG reserv. reserv. reserv. Padding bit G 1 (9) H 1 (8) I 2 (7,6) 0 - frame is not padded 1 - frame is padded with one extra slot Private bit. It may be freely used for specific needs of an application, e.g. if it has to trigger some application specific events. Channel Mode 00 Stereo 01 - Joint stereo (Stereo) 10 - Dual channel (Stereo) 11 - Single channel (Mono) Mode extension (Only if Joint stereo) J 2 (5,4) Mode extension is used to join information that are of no use for stereo effect, thus reducing needed resources. These bits are dynamically determined by an encoder in Joint stereo mode. Complete frequency range of MPEG file is divided in subbands There are 32 subbands. For Layer I & II these two bits determine frequency range (bands) where intensity stereo is applied. For Layer III these two bits determine which type of joint stereo is used (intensity stereo or m/s stereo). Frequency range is determined within decompression algorithm. Layer I and II Layer III value Layer I & II Intensity stereo MS stereo 00 bands 4 to 31 off off 01 bands 8 to 31 on off 10 bands 12 to 31 off on 11 bands 16 to 31 on on Copyright K 1 (3) 0 - Audio is not copyrighted 1 - Audio is copyrighted L 1 (2) Original 19

36 2 Audio Compression 0 - Copy of original media 1 - Original media Emphasis M 2 (1,0) 00 none 01-50/15 ms 10 reserved 11 - CCIT J.17 Table 2.3: MPEG-1 and -2 frame header [26]. Frames usually have a CRC check after the header, and other fields in different formats as has been shown for each layer [see Figure 2.3 and 2.4]. Then comes the audio data and after the data follows tag information, used to describe the MPEG Audio file. The structure of the MPEG Audio Tag ID3v1 is: AAABBBBB BBBBBBBB BBBBBBBB BBBBBBBB BCCCCCCC CCCCCCCC CCCCCCCC CCCCCCCD DDDDDDDD DDDDDDDD DDDDDDDD DDDDDEEE EFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFG Sign Length (bytes) Position (bytes) Description A 3 (0-2) Tag identification. Must contain 'TAG' if tag exists and is correct. B 30 (3-32) Title C 30 (33-62) Artist D 30 (63-92) Album E 4 (93-96) Year F 30 (97-126) Comment G 1 (127) Genre ( 1=Classic Rock, 2=Country, 3=Dance ) Table 2.4: MPEG Audio Tag ID3v1 [26]. The ID3v1 tag has some limitations and drawbacks. It has a fixed size of 128 bytes and supports only a few fields of information, which are limited to 30 characters, making it impossible to correctly describe many titles and authors. Furthermore, since the position of 20

37 Web Radio: Technology and Performance the ID3v1 tag is at the end of the audio file it will also be the last thing to arrive when the file is being transmitted. ID3v2 is a new tagging system designed to be expandable and more flexible. It improves ID3v1 by enlarging the tag to 256 MBytes and dividing the tag into smaller pieces, called frames, which can contain any kind of information and data such as title, lyrics, images, web links etc. It still keeps files small by being byte conservative and with the capability of compress data. What is more, the tag resides in the beginning of the audio file thus making it suitable for streaming. MPEG Audio stereo redundancy coding MPEG Audio works both with mono and stereo signals. It supports two types of stereo redundancy coding: Intensity and MS (middle/side) stereo coding. All layers support intensity stereo coding, Layer III also supports MS stereo coding. Both forms of redundancy coding exploit psychoacoustics. Above 2 khz and within each critical band, the human auditory system bases its perception of stereo more on the temporal envelop of the audio signal than on its temporal fine structure. In intensity stereo mode the encoder replaces the left and right signal by a single representing signal, plus directional information [38]. The MS stereo mode encodes the left and right channel signals in certain frequency ranges as middle (sum of left and right, L+R) and side (difference of left and right, L-R) channels. A technique called joint stereo coding is used in Layer III to achieve a more efficient combined coding of the left and right channels of a stereophonic audio signal. It takes advantage of the redundancy in stereo material, the encoder switches from discrete L/R to a matrixed L+R/L-R mode dynamically, depending on the material. MPEG-4 (ISO/IEC 14496): Coding of audio-visual objects The first 6 parts of the standard correspond to those of MPEG-2, and it is backwards compatible with MPEG-1 and MPEG-2. There are, however, a number of significant differences of content. MPEG-4 enables the coding of individual objects, which means that the video information don t need to be of rectangular shape as in MPEG-1 and MPEG-2 Video. The same applies for audio, MPEG-4 provides all tools to encode speech and audio at different rates from 2 to 64 Kbps and with different functionality, including MPEG-4 AAC, an extension of MPEG-2 AAC [34]. Part 5 is a complete software implementation of both encoders and decoders. Compared with the reference software of MPEG-1 and MPEG-2 whose value is purely informative, the MPEG-4 Reference Software has the same normative value as the textual parts of the 21

38 2 Audio Compression standard. The software may also be used for commercial products and the copyright of the software is licensed at no cost by ISO/IEC for products conforming to the standard. So far the industry has enthusiastically adopted MPEG-4 Video, which has been selected by several industries for setting standards for next generation mobile communication and is being utilized to develop solutions for video on demand and related applications. Some people call it the future global multimedia language [29]. MPEG-4 Audio provides several profiles to allow the optimal use of MPEG-4 in different applications. At the same time the number of profiles is kept as low as possible in order to maintain maximum interoperability. Some of the profiles that MPEG-4 offers are Speech Audio, Synthesis Audio, Main Audio, High Quality Audio, Low Delay Audio, Natural Audio, Mobile Audio Internetworking etc.[36]. MPEG-4 Audio has two work items underway for improving audio coding efficiency [35]: bandwidth extension for both general audio signals and speech signals and parametric coding, to extend the capabilities currently provided by HILN (Harmonic and Individual Lines plus Noise) [35]. MPEG-2/4 AAC (Advanced Audio Coding) The AAC system is the highest performance coding method within MPEG [37]. AAC works with a wide range of sampling rates from 8 to 96 khz, bitrates from 16 to 576 Kbps (achieving indistinguishable audio quality at 96 Kbps per channel), and from 1 to 48 audio channels. Because it uses a modular approach, users may pick and choose among the component tools to make a product with appropriate performance/complexity ratios [37]. Due to its high coding efficiency, AAC is a prime candidate for any digital broadcasting system and delivery of high-quality music via the Internet [29]. MPEG-2/4 AAC has no backward compatibility with MPEG-1, -2, but as it is built on a similar structure to Layer III, it retains some of MP3 s powerful features: redundancy reduction using Huffman encoding, bit reservoir, non-uniform quantization, ancillary data and the joint stereo mode [37]. However, it improves MP3 in a lot of details and uses new coding tools [see Figure 2.7]. The crucial differences are [29]: Filter bank: AAC uses a plain MDCT (Modified Discrete Cosine Transform) [29], and an increased window length (2048 instead of 1152). TNS (Temporal Noise Shaping): It shapes the distribution of quantization noise in time by prediction in the frequency domain, and transmits the prediction residual of the spectral coefficients instead of the coefficients [29]. Prediction: It benefits from the fact that a certain type of audio signals (stationary or semi-stationary) are easy to predict [29, 30] and then, instead of repeating such information for sequential windows, a simple repeat instruction can be passed. 22

39 Web Radio: Technology and Performance Quantization: makes a finer control of the resolution using an iteration method. AAC is a block oriented, VBR coding algorithm, but rate control can be used in the encoder such that the output bitrate is averaged to a predetermined rate (as for CBR). Each block of AAC compressed bits is called a raw data block, and can be decoded stand-alone (without knowledge of information in prior bitstream blocks), which facilitates encoder and decoder synchronization and, if any packet is lost this doesn t affect the decodability of adjacent packets [32]. The syntax of an AAC bitstream is as follows: <AAC_ bitstream> => <raw_data_block><aac_bitstream> <raw_data_block> => [<element>]<end><pad> here [] indicates one or more occurrence. <END> indicates the end of a raw_data_block and <PAD> forces the total length of a raw_data_block to be an integral number of bytes. The <element> is a string of bits of varying length, indicating if the represented data is from a single audio channel, stereo, multi-channel, user data, etc. [32]: Figure 2.6: MPEG-2 AAC audio coding [1]. Legend: Data Control 23

40 2 Audio Compression The standard defines two examples of formats for the transport of audio data [33]: ADIF (Audio Data Interchange Format) puts all data controlling the decoder (sampling frequency, mode etc...) into a header preceding the audio stream. This is useful for file exchange, but not for streaming. ADTS (Audio Data Transport Stream) packs AAC data into frames with headers (like the MPEG-1, -2 format), which is more suitable for streaming. MPEG2/4 AAC-LD The MPEG-4 AAC LD (Low Delay Audio Coder) is designed to combine the advantages of perceptual audio coding with the low delay necessary for two way communication. The codec is closely derived from MPEG-2 AAC [41], but the contributors to the delay (such as frame length or window shape) have been addressed and modified [37]. MPEG-4 General Audio coder The MPEG-4 General Audio coder is derived from MPEG-2 AAC, and is backward compatible with it, but adds several enhancements: PNS (Perceptual Noise Substitution): PNS is based on the observation that one noise sounds like the other. Then, the actual fine structure of a noise signal is of minor importance for its subjective perception. Consequently, instead of transmitting the actual spectral components of a noisy signal, the bitstream would just signal that the frequency region is a noise-like one and give some additional information on the total power in that band [38]. LTP (Long Term Prediction): LTP is an efficient tool for reducing the redundancy of a signal between successive coding frames. It is especially effective for the parts of a signal which have clear pitch property [38]. The structure of the coder is represented in Figure 2.6. The same building blocks are present in the decoder implementation, performing the inverse processing steps. To increase coding efficiency for coding of musical signals at very low bitrates (below 16 Kbps per channel) [38,41], TwinVQ based coding tools are part of MPEG-4 General Audio [40]. The basic idea is to replace the conventional encoding of scalefactors and spectral data used in MPEG-4 AAC by an interleaved vector quantization applied to a normalized spectrum [38]. The input signal vector (spectral coefficients) is interleaved into subvectors that are then quantized using vector quantizers [38]. The rest of the processing chain remains identical as can be seen in the Figure

41 Web Radio: Technology and Performance Figure 2.7: Building Blocks of the MPEG-4 General Audio Coder [38]. Legend: Data Control Figure 2.8: Weighted interleaved vector quantization [38]. 25

42 2 Audio Compression Windows Media Audio Microsoft Windows Media Audio delivers audio for streaming and download. According to Microsoft, WMA offers [47]: Near-CD Quality at 48 Kbps, and CD quality at 64 Kbps. High scalability with bandwidths from 5 Kbps to 192 Kbps and sampling rates from 8 khz to 48 khz high-quality stereo music [47]. This allows users to choose the best combination of bandwidth and sampling rate for their content. Microsoft claims that their WMA codec is very resistant to degradation due to packet loss, which makes it excellent for use with streaming content. In addition, by using an improved encoding algorithm, this codec encodes and decodes much faster than others. Windows Media audio files can also support ID3 metadata. If a source.mp3 file is encoded using the Windows Media Audio codec, any ID3 properties are included in the Windows Media audio file [47]. As it is a proprietary format no information can be obtained about the coder and the compression technique used, all the information exposed here was obtained from Microsoft RealAudio RealAudio is a proprietary encoding format created by Real Networks. It was the first compression format to support live audio over the Internet and thus gained considerable support, but it requires proprietary server software in order to provide the real-time playback facility. It was first designed for voice applications on the web but later it developed also into music and video algorithms [48]. RealAudio uses a lossy compression scheme that provides high audio quality from source material with high sampling rates (11 khz, 22 khz and 44 khz). The RealAudio Encoder compression scheme works by making educated guesses about what is most important in the sound file [48]: It knows how much room there is in the destination stream and fills that available bandwidth with as much sound information as it can. Any sound information that doesn't fit is lost. The user can help the encoder with its task by emphasizing the most important parts of the recording. 26