1 COMP VoIP 17/2/ Introduction University of Manchester School of Computer Science Comp18112: Foundations of Distributed Computing 2013 Voice over Internet Protocol (VoIP) (An application of Distributed Systems) Barry Cheetham A Distributed System (DS) may be described as a hardware and software system with more than one processing or storage element, concurrent processes, or multiple programs running collaboratively. A personal computer (PC) with video card, sound card and other processes running collaboratively in parallel is therefore a distributed system. The processes are interconnected by wires (the interconnect consists of wires) and the processes may be synchronised to a common LOCAL clock. In practice, all processes are often not synchronised to a single local clock, and there may be several. The Internet and are examples of a DS where the functionality is split into parts that run simultaneously on multiple computers communicating over packetised network links. Varying delays and unpredictable failures occur and the possibility of synchronising all processes to a common clock is not assumed. Communications and synchronisation are vital aspects of distributed systems. Where computer networks provide the interconnect, delays and unpredictable failure must be anticipated, and ways must be found of co-ordinating actions that may be occurring tens of meters or thousands of miles apart. transmits and receives streams of text and is therefore a form of stream oriented communications. uses asynchronous transmission mode which means that there are essentially no delay constraints. If an arrives many seconds, many minutes or even many hours late it is normally accepted and still useful. We do not normally reject messages just because they arrive late. However, messages must be reliable in that there must be absolutely no errors. The naming and location of users is needed, as described in earlier lectures. Computer networks are ideal for . Voice over IP (VoIP) telephony generates streams of voice samples, normally two streams, so it is also stream oriented. It employs synchronous transmission mode which means that a maximum allowed delay is imposed by the application. Samples that arrive beyond that maximum allowed delay become useless (because of the delay) and must be discarded. VoIP telephony imposes quality of service (QoS) demands on the interconnect and is therefore often described as a QoS application and also a real time application. The same considerations apply to audio/visual telephony and video-conferencing. This section on VoIP telephony exemplifies the problems that occur when distributed systems are required to process and distribute information that is real time and sensitive to excessive delay. All telephony will soon be VoIP, despite the fact that IP not really ideal for this application. 2. Voice Voice, like music, is sound. Sound is variation in air pressure which travels as a wave from speaker to listener at approximately 300 metres per second or 1080 km/hour. Voice sounds are produced either by vocal cords vibrating as air is forced through them by the lungs of a human
2 COMP VoIP 17/2/2013 speaker or by turbulent air flow as air is forced though the glottis and the mouth. When it reaches a human listener, the wave causes an ear-drum to vibrate in sympathy to produce vibration patterns that are processed by the brain. A microphone converts the same pressure variation to voltage variation and this copy or analogue of the pressure variation may then be conveyed along wires. A sound waveform is a graph of voltage (representing air pressure) against time as illustrated below: Volts time t This is similar to the type of waveform produced by vibrating vocal cords, i.e. voiced speech which would be heard as a vowel (A.E, I O U, etc.). It is not a sine-wave. It is the sum of many sine-waves of frequency f, 2f, 3f, 4f, 5f,.etc., where f is known as the fundamental frequency in Hertz (Hz or cycles/second). If f = 500, we would have sine waves of frequency 500, 1000, 1500, 2k, 2.5kHz, and so on. We may ask how many sine-waves there are in total, or how high in frequency they go up to. How many could we actually hear? 3. Bandwidth: It is generally assumed that humans can hear up to 20 khz. Recorded music on CDs has bandwidth 50 Hz to 20 khz, and in principle, speech has the same bandwidth. Telephone quality (narrow-band) speech is filtered to the range 300 Hz to 3.4 khz with some loss of naturalness but not intelligibility. Wideband speech filtered to the range 50 Hz to 7.2 khz sounds better. Bandwidth means frequency range, not bit-rate! 4. Plain old fashioned telephone System (POTS) Originally, analogue transmission was used for all telephone speech. Wires carried speech voltage waveforms, and were connected manually, by switch-board operators, to form a circuit between two users. The use of analogue transmission meant that the waveforms travelled along the wires with negligible delay, i.e. approximately 1 ms per 100 miles (1 ms = 1/1000 second). The circuit thus set up remained connected until the end of the call. This was circuit switching and it was connection orientated. Voice digitisation and exchange to exchange digital transmission began in the 1960 s, and nowadays, most telephone speech is transmitted digitally. However, the concept of circuit switching with low delay has remained in telephony to the present day. Analogue transmission still used for the last mile (or the first mile ) between local telephone exchanges and homes.
3 COMP VoIP 17/2/ Digitisation of speech & music The voltage waveforms produced by speech or music are digitised by taking regular samples and converting them to binary numbers of an appropriate word-length. A famous theorem known as the Sampling Theorem tells us that if the signal bandwidth is 0 to B Hz, we must sample at more than 2B samples/second (Hz) to obtain a faithful representation of it. Music (50Hz-20 khz) is sampled at 44.1 khz with 16 bits per sample (stereo) to obtain CD quality recordings. Narrowband telephone speech (300Hz to 3.4 khz) is often sampled at 8 khz with 8 bits per sample to obtain 64 kb/s log-pcm (ITU-G711). This is the basis of pulse code modulation (PCM) with logarithmic compression. The logarithmic compression means that small samples are digitised more accurately than larger ones. ITU-G711 is a very famous standard for speech digitisation and it is universally used in wired telephony and VoIP. 6. Other standards for speech digitisation The 64kb/s required by ITU-G711 is too high for mobile telephony and often for VoIP. Other ITU standards exist, for example, G726 (32 kb/s), G728 (16kb/s), G729 (8kb/s) and G723.1 (5.3kb/s). Speech compression and decompression is applied by distributed processing. The compression is lossy (not like zip or rar ). Mobile phones use a 9.6 kb/s speech digitisation standard. VoIP often uses the same ITU standards. 7. PC to PC voice link Assume 2 PCs are linked by an ideal connection allowing voice samples to be sent in either direction. Assume that on each PC, an A to D converter (on a sound card) samples speech from a microphone to provide a single 16-bit sample when requested by the CPU. The CPU requests samples at intervals of 1/8000 second, and compresses and sends off each sample via the connection. Both CPUs do this simultaneously. Each CPU receives samples, at intervals of 1/8000 seconds, from the other side. It decompresses and sends them directly to the D to A converter to produce sound. This is simple, but probably not viable because of the use of the CPU at either end to control the timing of the sampling and D to A conversion processes. It is impractical because the CPUs have many parallel tasks to perform and cannot be relied upon to be available at any specific point in time. At the precise time instant when a new sample arrives, the CPU might be busy doing something else. This applies to normal operating systems such as Linux, WIN, etc.
4 COMP VoIP 17/2/ Buffers In practice, sound cards control their own sampling rates using independent crystal controlled clocks. The sound processing is therefore distributed between intelligence on the sound card and the CPU. Sound cards must have buffers to store sections of their inputs and outputs. A buffer is an array or block of storage. On a sound card, an output buffer would be filled periodically by the CPU and emptied at a regular rate of say 8000 samples per second as required to feed the digital to analogue converter. This is like a leaky bucket, with a hole in the bottom, being filled periodically by turning on and off a water tap. CPU supplies water & controls the tap. Bucket with a hole in bottom (on sound card) Empties into D-A converter & speaker Regular say 8000 drops/s Being a little late or early not critical now. Tap turned on when CPU is available. Turn off when busy Bucket must not empty or overflow D-A converter Also an input buffer would be filled at a regular rate by the analogue to digital converter and emptied periodically by the CPU. This is like a bucket being filled by a constantly dripping water source and being regularly emptied by the CPU. A-D converter Filling with water from mic & A-D Conv Regular say 8000 drops/s CPU turns tap on when it can receive water. Off otherwise. To CPU In both cases, the CPU being a little late or early to fill or empty a bucket is not so critical now. Actually the output buffer must never be allowed to become completely empty otherwise the D to A converter would have no samples to convert and would therefore fail catastrophically. Also the
5 COMP VoIP 17/2/2013 input and output buffers must never become so full that they overflow and bits are lost. Imagine a low water mark and a high water mark drawn on the water bucket. 9. Problems caused by lack of synchronization Now consider the accuracy of the sampling rates of the sound cards at either end of a voice communication session. The sampling rates are controlled independently by crystals accurate to 0.01%. A sampling rate that is intended to be 8000 Hz could actually be 8001 Hz or 7999 Hz. We would never hear difference when listening to speech with such a slight inaccuracy in the sampling rate. So why do we have to worry about it? What problems could be caused by this innacuracy? The host with the slower clock receives two extra samples per second. The extra samples gradually fill up the output buffer and will eventually cause it to overflow. After ten minutes, 1200 extra samples will have been received that cannot be output to the D to A converter. Also, as the buffer fills up there will be increasing delay which becomes 300 ms after 20 minutes and likely to be unacceptable for speech telephony. The host with the faster clock will be receiving fewer samples than it is sending to its D to A converter and will eventually run out of samples (buffer underflow). This is a fundamental problem with real time distributed systems when real time processing is required. It is solvable by monitoring buffer levels and intelligently discarding single samples, or creating extra ones now and then to keep buffer levels between their low and high water marks. Discarding or duplicating a single sample during periods of quiet speech is likely to be imperceptible whereas doing this for a whole block of samples is likely to be very annoying. A similar problem occurs with atomic clocks because the Earth s rotation is slowing down. About 300 million years ago, there were (according to Tanenbaum DS p. 245) 400 days per year. Universal co-ordinated time (UTC not UCT) counts ticks from highly accurate atomic clocks, but to accommodate the slowing down, it needs to add approximately 1 ms every 13 hours. In practice a leap second is added about every 18 months. Network time protocol sends time to your PC, accurate to 1 50 ms, but it has to be quite clever to achieve this accuracy. 10. Voice connection via network Replacing the ideal connection by a network link introduces new problems. Computer networks convey data in packets. Delays and imperfections in the links must be expected. The person to be contacted may be far away, and we may not even know where. We need a way of setting up and maintaining communications with acceptable voice quality and round trip delay. These are the problems of VoIP. Firstly, let s revisit the concept of protocol layers.
6 COMP VoIP 17/2/ TCP/IP Protocol Layers The transmission and receiving of data over networks is arranged in layers, the principle being to separate the complexity involved into independent units that are easier to design and understand. Assume that an application on one host wishes to convey data to a corresponding application on another host that may be many miles away. The application uses software provided by a lower layer (within the operating system) which sends messages to a corresponding lower layer on the other machine according to an agreed method or protocol. The agreement includes the structure or format of the packet that will convey the data. The agreed format consists of a header and the data itself. Instead of communicating directly with the other host, the lower layer uses software in even lower layers each with its own agreed protocol and packet format. The principle is best illustrated by looking at the most widely used structure of layers which is known as the TCP/IP layered model. This may be thought of as having five layers, as shown below, but the lower two layers are often combined into one and described as the host to network layer. Application Application Transport Transport Network (IP) IP layer IP layer IP layer Data Link Physical DLL Phy DLL Phy Data Link Phy layer H to N Host 1 Routers Host The 7-layer OSI reference model: This is alternative definition of network layers seen in many textbooks.
7 COMP VoIP 17/2/2013 7) Application Layer 7) Application Layer 6) Presentation Layer 6) Presentation Layer 5) Session Layer 5) Session Layer 4) Transport Layer 4) Transport Layer 3) Network Layer 3) Network Layer 2) Data Link Layer 2) Data Link Layer 1) Physical Layer 1) Physical Layer TCP/IP compared with OSI Reference Model Layer OSI Reference Model Application Layer Presentation Layer Session Layer Transport Layer Network Layer Data Link Layer Physical Layer TCP/IP Reference Model Application Layer Transport Layer Internet (IP) Layer Host-to-Network or Link Layer 13. Examples of protocols in each layer Application layer: http, POP3, SMTP, DHCP, DNS, IMAP4, TELNET, FTP, SIP. Transport layer: TCP, UDP, RTP, RTCP Network layer: IP (Versions 4 & 6), Data-link layer: Ethernet, IEEE802.11, etc. Physical layer: Ethernet for wired LANs, IEEE for wireless LANs, PPP for modems, RS232, etc
8 COMP VoIP 17/2/2013 Looking at the packet formats defined by the TCP/IP model, moving downwards from the application, each layer adds an extra header to what came before, as shown. The DLL layer adds a trailer as well. Message (AL) Applic Layer T Message (AL) Transport N T Message (AL) D N T Message (AL) D P D N T Message (AL) D Network (IP) DLL Phy Layer To study this stack of layers, packet formats and protocols, instead of starting at the top or the bottom, it is convenient to start with the most important and fundamental layer which is the network (or IP) layer. 14. TCP/IP network layer: TCP/IP network layer protocols deal with addressing and routing of IP packets (often referred to as datagrams) which have the following agreed format. IP Header ( 20 bytes) IP Data (Variable length) Vers IHL Type Length ID Frag TTL Prot Check Source Dest The IP header contains the following information: Vers: IP version number (4 or 6) IHL: Header length (in 32 bit words) Type: can choose between fast delivery & reliability (look this up if you are interested) Length: Overall datagram length ID & Frag: Allows long packets to be broken up into smaller ones (fragmentation) and the fragments to be identified and recombined. (Don t worry about this for now) TTL: Time to live (8 bits) see later Prot: Specifies transport layer process required (TCP, UDP, etc) Check: A Check-sum for the header (16 bits) see later Source and destination IP addresses (32 bits each) Note, first of all, that IP addresses originate at this layer (unsurprisingly). The host s IP address (source) and the destination IP address are paced in each IP packet. The TTL and Check fields
9 COMP VoIP 17/2/2013 deserve some brief explanation, but the others fields are either self-explanatory or need not be studied at this stage. The time-to-live (TTL) is an 8 bit binary number which is decremented by one each time the datagram is read by a router. If this number ever reaches zero, the datagram is discarded. This time to live mechanism is widely used in distributed systems, and here it will eliminate the possibility of a datagram being passed endlessly among a number of routers due to some error in the routing procedure.. Traditional Internet Protocol (IP) provides connectionless communication where IP datagrams are conveyed independently by routers towards their destination IP addresses. IP is the fundamental interconnect mechanism of the Internet and many private networks. The service is unreliable as there are no guarantees about correct delivery. Datagrams may be delayed, damaged, lost or arrive out of order. Different routes may be taken to the same destination. The IP packet has no sequence numbers, no time-stamp and there are no port numbers in the IP header. These items must provided by higher layers to allow transmission problems to be detected and corrected, and to identify applications. 15. Check-sum A check-sum is a number of extra bits included in a bit-stream to allow the receiver to detect the occurrence of bit-errors that may have occurred during transmission. The term checksum sometimes means cyclic redundancy check (CRC), and it could also mean the number of 1 s in a bit-stream. The simplest example of a check-sum is the parity-bit that is often placed at the end of a binary number to make the total number of one s even (parity =0) or odd (parity=1). If the parity is made even at a transmitter and is found to be odd at a receiver, there have clearly been an odd number of bit errors; an even number of bit errors would not be detected. A similar conclusion may be drawn if the parity is made odd at the transmitter and found to be even at the receiver. The parity of any sequence of bits, b 1, b 2,, b N say is just the exclusive or of all the bits; i.e. b 1 b 2 b 3 b N. In IP packet headers, a 16-bit checksum is used and is defined as the one s complement inverse of the one s complement sum of all 16-bit words. What on earth could this mean? Consider three 8-bit numbers: The normal sum of these 8-bit numbers is the nine-bit number: This is the 8-bit number with a carry out bit of 1. Add in the carry out to obtain: Then produce the ones complement inverse by inverting all the bits to obtain: This is the required 8-bit checksum. The same idea is used for the IP header check-sum except we have 16 bit words. The significance of ones complement arithmetic can be disregarded here. The important point is that if we apply the same checksum procedure at the receiver as we have just performed at the transmitter and get a different answer, we know there has been a transmission error. If we get the same answer, we cannot be sure the data is correct as many combinations of 16-bit numbers can produce the same sum. But we have some confidence that it might be correct. The 16-bit checksum allows error in the IP header to be detected at routers and at the receiver. If bit-errors occur, the datagram is discarded. Errors are surprisingly rare in wired networks but they do occur. Bit-errors occur frequently in wireless networks. Note that the header is actually changed by each router when it decrements the time to live counter. Other changes to the header can also be made by routers. If the header changes, its check-sum must be recalculated. Note that the IP packet format has no checksum for bit-error checking of the payload; only the header.
10 COMP VoIP 17/2/2013 A cyclic redundancy check (CRC), often called a checksum also, is more powerful. To illustrate the concept suppose we need to transmit the decimal number 139. Divide it by 7 in integer arithmetic and express remainder in binary. We obtain 19 with remainder 6 or 110. Use 110 as the check-bits. The same division is done at receiver, and if we get different remainder, we know that a bit-error has occurred. Note that exactly 3 check bits are always produced when we divide by 7. The generator number 7 is agreed in advance and carefully chosen. Again, not all combinations of bit-errors are detectable by this method. Any combination that adds or subtracts multiple of 7 not detected. In practice, CRCs use a much higher number than 7 and don t divide in normal decimal arithmetic, but in excusive or arithmetic. But the idea is similar. A parity check is a 1-bit CRC capable of detecting the occurrence of an odd number of bit-errors. Sixteen and 32 bit CRCs are commonly used. 16. Data-link Layer (DLL) The DLL has its own agreed packet formats and protocols for sending and receiving DLL packets using the services of the physical (Phy) layer below. The Phy layer sends voltage pulses representing 1 s and 0 s along wires or across wireless connections and here is the source of biterrors that make links unreliable. The data link layer has the responsibility for detecting, and where possible correcting, bit-errors, and also for medium access control (MAC) when connection channels are shared among many users. Ethernet can share one wire between many users, using an elegant carrier sensing multiple access (CSMA) mechanism implemented within the data-link layer. Medium access control (MAC) requires collision detection, collision avoidance and randomised back-off times when collisions occur (as explained by Steve in a previous lecture). The IEEE wi-fi data-link layer has a MAC mechanism which is very similar to that used by Ethernet, for sharing the capacity of a single wireless channel. The Ethernet (IEEE802.3) DLL packet format is as follows: 48 bits Variable 32 MacAddr1 MacAddr2 Len/Type IP data CRC32 This format has 48-bit source & destination MAC (Phy) addresses, the length or type of packet and a 32 bit cyclic redundancy check (CRC32) which is a sort of checksum allowing bit-errors to be detected. The IEEE DLL format for wi-fi is similar except that it has 4 MAC addresses. In both cases, the data can include extra bits for forward error correction (FEC). 17. Physical layer The Phy layer sends voltage pulses representing 1 s & 0 s by wire or radio. A preamble consisting of seven bytes of is first sent to allow the receiver to synchronise to the transmission and then a start of frame (SOF) code: is sent to allow the start of the payload data to be identified. The data then follows as illustrated below: Preamble SOF DLL data When communicating over wire, Ethernet uses Manchester Coding as illustrated below, instead of straightforward rectangular pulses. A 1 is represented by a positive followed by a negative pulse whereas a zero is represented by a negative followed by a positive voltage pulse. This
11 COMP VoIP 17/2/2013 really was invented here at Manchester University. What do you consider to be the advantages and disadvantages of this type of signalling in comparison to straightforward rectangular pulses? volts t Transport layer Looking upwards from the network layer now, the transport layer has protocols which use the IP layer below to achieve packetised data transfer in a way which is suitable for particular application layer protocols. The most important are: - TCP: transmission control protocol - UDP: user datagram protocol Two others, strongly related to UDP but adapted to real time applications, such as VoIP, are: - RTP: Real time protocol - RTCP: Real time control protocol. (RTP and RTCP are sometimes considered to be in the application layer) 19. Transport layer protocol: TCP TCP makes use of IP to provide connection-oriented reliable transmission. This protocol is suited to data which cannot tolerate any bit-errors but can tolerate some delay. TCP introduces port numbers for distinguishing data streams, sequence numbers and an over-all check-sum within its 20 byte header. Reliability is achieved by a mechanism for acknowledging correct receipt and re-transmitting packets when necessary. Since this incurs delay and increases congestion, TCP is not ideally suited to VoIP. The format of a TCP packet (or segment ) is as follows: 16 bits variable variable Port1 Port2 SEQ ACK Length Flag WS Check Etc. AL payload s Source Dest Sequence no. & Ack no. Header length TCP Header Urg, Ack, Psh, Rst, Syn, Fin Checksum over all Window size (for flow control) Other stuff TCP is a two-way protocol capable of efficiently conveying data in both directions between two hosts. The acknowledgement mechanism is quite clever in that each TCP packet may be both an acknowledgement packet and a data carrying packet. Host1 can send data to Host 2, then Host 2 can acknowledge the data and at the same time send some data back to Host1. Setting a one-bit
12 COMP VoIP 17/2/2013 flag, called the Ack-flag, to 1 rather than 0 makes the IP packet an acknowledgement packet which may, or may not, carry data. There are other one-bit flags, including the Syn flag which is set to 1 to request (Ack=0) or acknowledge and accept (Ack=1) the setting up of a TCP connection. A Fin flag is set to 1 to request the release of a TCP connection. Other TCP fields are as follows: SEQ: the 32-bit index of the first byte in this packet (TCP numbers every byte) ACK: 32-bit index for acknowledging receipt of one or more packets. When the one-bit Ackflag is set to make the packet an acknowledgement packet as well as a data packet, the TCP packet is sending the following message: I acknowledge receipt of samples up to this ACK index, so please sent the next packet. Note that the ACK index and the Ack-flag are not the same thing, WS- Window-Size specifies how many bytes the host is willing accept before sending an acknowledgement packet. This is flow control. If WS were set for just for 1 packet, the other host would have to wait for an acknowledgement before sending another packet. This one packet at a time acknowledgement scheme would work, but it would be slower than necessary. It is better to make WS larger, so that many packets are acknowledged at once Port numbers Port numbers are introduced by transport layer protocols such as TCP. They are addresses which identify protocols & applications. They allow the receiver to distinguish between different data streams. Examples are: Three typs of addresses Port Protocol Application 22 FTP File transfer 23 Telnet Remote log-in 25 SMTP send 143 IMAP receive 80 HTTP WWW 443 HTTPS Secure web 110 POP-3 Receive We have seen 3 types of addresses now. Remember that: IP address (32-bits) are global. MAC (Phy) addresses (48-bits) are local. PORT numbers (16-bit addresses) identify applications. Consider the analogy shown below.
13 COMP VoIP 17/2/2013 LAN Your message Port no. My IP Addr Mac Addr: of your router R R LAN Open with port no. s application open Receive at my MAC addr You wish to send me a message, and it is written using an App such as a word-processor. The message is put in a green (Transport Layer) envelope with a port address written on it (to identify the type of App that will be required to open it). The green envelope (with its pink contents) is then put into yellow (Network Layer) envelope with my IP address written on the front. You may have to consult a DNS server to find my IP address. Now put the yellow envelope into a blue (Data Link Layer envelope) and write on it the MAC address of your router. The blue DLL envelope is opened by your router to find my IP address, and your router then decides which intermediate router it should send the blue envelope to in order to send it on its way to me. It writes the MAC address of the intermediate router on the blue envelope The intermediate router opens the blue envelope, finds my IP address and re-addresses the blue envelope to another intermediate router & so on from router to router until it reaches the router on my local area network (LAN). When it reaches router on my LAN, it is re-addressed and sent to my MAC address. I can then open the yellow env, read the port number, then send pink message inside to the right App for reading it How does TCP achieve reliable & connection orientated transmission? A short answer is that reliability is achieved by a mechanism for acknowledging correct receipt of packets (sometimes called segments) and retransmitting them when necessary. Connectivity is achieved by requiring users to exchange state information, such as sequence numbers, which is remembered at both ends of a connection until the connection is terminated. Let s now fill in a bit of detail.
14 COMP VoIP 17/2/ TCP communication mechanism Host 1 Host 2 Syn=1,Ack=0, SEQ=N time Syn=1,Ack=1,SEQ=K,ACK=N+1 Connect (with data) Syn=0,Ack=1,SEQ=N+1, Ack=K+1 Syn=0,Ack=1,SEQ=K+1, Ack=N+2 Synchronised (exchanging data) Not necessarily packet by packet. Etc. Fin Release Fin Ack Fin Ack Fin Release Release Host2 may not want to release until he is sure Host1 has released. Has his Fin Ack been received? Two army problem. Each byte in a TCP payload is numbered, so the header has a 32 bit sequence number which actually specifies the sequence number of the first byte in the payload. There is also a 32 bit acknowledgement number whose purpose we shall now illustrate by considering an exchange of TCP packets between two hosts. Host1 sends a packet with the index of the first payload byte as its sequence number N say, and then sets a timer. If Host2 receives the packet, it sends a return packet (with or without data in its payload) with an ACK flag set to say it is an acknowledgement. It puts a binary number, M+1 say, in its 32-bit Acknowledgement number to say: I ve got bytes up to and including byte M. It also puts a 32-bit sequence number, K say, for the first data byte of its payload (assuming it is returning some data) into the header. TCP is a 2-way communication protocol, designed for sending data in both directions between two users. Host1 waits for acknowledgement from Host2. If the acknowledgement is not received in time, it resends the packet. (Remember: packets may be lost or duplicated). This mechanism will work fine, but it will be slow if Host1 has to wait quite a long time for each acknowledgment from Host2. We can speed up the communication by allowing Host1 to send several datagrams in succession without waiting for acknowledgements for each one. Host2 could then be allowed to acknowledge several datagrams at once. The procedure above caters for this already with its use of the 32-bit acknowledgement number. Host2 just has to acknowledge receipt of all up to a certain byte number which could cover several packets. Clever! TCP also does congestion control and flow control. For example, a host can say Please don t send me any more packets for a while! Detail omitted for now.
15 COMP VoIP 17/2/2013 The fin mechanism for tearing down is straightforward, but what happens if the Fin acknowledgement is lost in the network? The problem of catering for this eventuality is often illustrated by the Two Army Problem (see the Computer Networks ed4 textbook by A Tanenbaum (pub 2003) Two army problem. A blue army has two equal divisions camped at either side of a valley occupied by a white army. The blue army has superior numbers overall, but must send messengers unreliably across the valley to decide when to attack the white army. (The messengers may be captured, poor chaps). If only half the blue army attacks, because the other half has not received an appropriate message or acknowledgement, it will be defeated. How can each half be sure the other half has received a message to attack? How can each host be sure it s time to shut down (attack the connection) when messages and acknowledgements cannot be guaranteed? You may like to think about this. 20. Transport layer protocol: UDP UDP is simpler than TCP, connectionless and unreliable. It is a fire and forget protocol. It simply encapsulates the following UDP datagram within the payload of an IP datagram: 16 bits Variable Source port Dest port no. Length Check Data no UDP is widely used for applications which do not need or cannot wait for acknowledgements and retransmissions; or where the increased congestion caused by such things would be unacceptable. The unreliability of UDP is not such a problem for VoIP because voice not quite as sensitive as data to bit-errors and lost packets. An occasional bit-error might hardly be noticed in a voice stream. Also, the loss of a whole packet can be concealed, to a degree, by filling in the gap by a waveform segment that looks and sounds approximately right, perhaps because it resembles the previous segment. This is packet loss concealment (PLC), and is possible because speech is, to a degree, predictable especially when people speak slowly. The UDP datagram header also has port numbers like TCP. Also there is a length indication for the whole UDP packet, and a 16-bit checksum over the whole packet (rarely used in wired links in practice). 21. Real Time Transfer Protocol (RTP) UDP still has some problems for VoIP which are remedied by the real time transfer protocol (RTP) Since UDP datagrams may be lost, damaged or re-ordered, the receiver must know when this happens. RTP was designed to allow this information to be determined. Given a block of speech samples to be sent, RTP adds a time-stamp, a sequence number, and some other things to this data and places the resulting bit-stream in the payload of a UDP packet. The UDP packet is then encapsulated inside an IP datagram as usual, and the IP datagram is eventually transmitted across the IP network. At the receiver, the sequence numbers introduced by RTP allow the application to re-order any datagrams received out of sequence, and to recognise the need for
16 COMP VoIP 17/2/2013 packet-loss concealment when a datagram is found to be missing. Duplicate packets can also be recognised and eliminated by the same mechanism. 8 bits Var Variable Info P-Type SEQ TimeStamp Source Id Etc. Video/audio stream UDP Header (64) UDP Payload The time-stamp is for the first sample in the RTP packet and specifies the number of clock ticks from the start of the conversation as observed by the transmitter. It is not an actual time read from our PC clock and there is no attempt to synchronise clocks at transmitter and receiver. Absolute time-stamps have no meaning only differences are important. The number of ticks/second is equal to the sampling frequency which may be say 8000 Hz (with G711). Therefore there will be 8000 ticks/second for VoIP telephony with G711 (A-law PCM). The receiver cannot measure one-way delay from timestamps. The RTP time-stamp allows different voice and video streams to be synchronised since we will know exactly when each packet was generated according to the transmitter s clock. Note that the receiver s clock will probably not correspond exactly to the transmitter s clock, but this does not matter as it is only the synchronisation of data from the transmitter that is needed. Remember that in distributed systems, clocks at different locations will usually show slightly different times. If they did try to become synchronised by accessing one single clock at a common reference point, the geographical distance between the reference point and each location would be different. Therefore the delay in receiving a synchronising signal from the common reference point would be different for each location. So each location would set its clock slightly differently and the synchronisation would not have work. The RTP SEQuence number just numbers each RTP packet: 1,2,3, So the receiver will know if one or more packets are lost, or if packets are received out of order. Even when it is not feasible to acknowledge every single datagram or packet, it is useful for a transmitter to know how many of its datagrams are getting through the network and with what delay variation. RTP s sister protocol, RTCP (real time transfer control protocol), conveys this information at the expense of generating a small number of extra datagrams. It is described below, in detail in RFC1889 and is summarised in the Tanenbaum textbook. 22. Real time transport Control Protocol (RTCP) RTP cannot acknowledge every packet. It is fire & forget like IP and UDP. It is useful for the transmitter to know how many of its packets are getting through and with what delay variation. RTCP is sister to RTP (why not brother?) It sends feedback reports to the transmitter periodically, specifying: - Average round trip delay, - Percentage of lost packets - Average or worst case jitter introduce by network - Other measurements These are quality of service (QoS) measurements measured over the short period of time (typically about five seconds) between reports.
17 COMP VoIP 17/2/2013 Average round trip delay is the time it would take for a packet of information to reach its destination, be instantaneously copied into a returning packet, and be received by the sender. If a VoiP user says hello, round trip delay determines the amount of time the user must wait to hear the other person say hello, if the other person responds as soon as the first hello is heard. Round trip delay is easily measured because the time of sending and the time of receiving are both recorded with reference to the same clock, i.e. the transmitter s clock. End to end or one way delay is difficult to measure because of the need to accurately calibrate clocks at each end of a link. Jitter means delay variation or the variation in one-way delay. If we can t measure one-way delay, how can we hope to measure variation in one-way delay or jitter? Actually it s easy by examining the time-stamps within RTP or RTCP packets. 23. Measuring timing jitter Assume we receive two packets with consecutive sequence numbers (this is important) and their time-stamps, converted from clock ticks to seconds, are respectively: Assume a timer at the receiver reads 99 seconds when the first packet arrives. If the receive-time for the second packet is seconds, the delay variation, or jitter, is clearly zero. Now assume the second packet is received at seconds. If the one-way delay for the first packet is D, the delay for the 2nd packet is D The delay variation is therefore seconds and we have done this calculation without knowing the one-way delay D. In general, if the time-stamps are S1 and S2 and the receive times are R1 and R2 respectively, the delay variation is (R2 S2) (R1 S1) = (R2 R1) (S2 S1) seconds. This is the absolute value of the receive-time difference minus the time-stamp difference. RTCP will average over the number of packets received between reports or take the worst case time variation over this period. [ref Stallings]. Measurements of jitter can be useful in predicting the future behaviour of the network link since jitter tends to increase when the network starts to become congested. 24. Reminder of a few key points UDP introduces port numbers, a check-sum and little else to IP. UDP, like IP, is not connection oriented and not reliable. RTP introduces a sequence number, a time-stamp and other information into the UPD payload, but is still not reliable nor connection oriented. IP, UDP and RTP are all fire and forget protocols. RTP is useful for real time QoS applications and has a brother protocol, RTCP, for measuring and reporting the current QoS being obtained from a network link. TCP is more complicated and powerful, and is said to be connection orientated even over traditional IP which is not-connection orientated. With TCP, connection is made at the transport layer and the connectivity and reliability of TCP are at the expense of additional delay and network usage. Any of these protocols may be used over a Virtual Circuit (see later) to gain the benefits of improved QoS. Let s ask a direct question about TCP: 24.1, Connectivity at the network layer TCP provides reliability and connectivity at the transport layer at the expense of latency. VoIP would like circuit switched connectivity as provided by traditional telephony, but not at the cost of additional latency. What is needed is connectivity at the network layer. Traditional, IP provides connectionless service where IP datagrams are conveyed independently by routers towards their
18 COMP VoIP 17/2/2013 destination IP addresses. They may take different routes & arrive out of order. However, the demand for connectivity based network layer service led to: - Asynchronous transfer mode (ATM) networks. - Multi-protocol label switching (MPLS) - Enhancements to IP Network layer connectivity is different from transport layer connectivity as provided by TCP. To provide network layer connectivity, routing is modified to establish and maintain fixed paths. The network links are still not reliable. However, disordering of datagrams is eliminated and delay variation (jitter) is reduced. The choice of routing can try to minimise delay. 25. Virtual circuits (VC) A connection orientated network link is a virtual circuit. Such a circuit may be established by sending set-up packets and causing routers to remember the route taken. With MPLS, an extra header ( a label ) is added to each IP packet to specify the route. VC s must be set up, maintained while needed and then torn down. This is not a fire & forget mechanism as used by traditional IP and UDP. It can offer improved quality of service while not being totally reliable. VC s have clear advantages in providing QoS for VoIP. But this is at a cost as summarised by the comparison below. 26. Comparison of traditional IP network with a VC Issue Trad IP Virtual Circuit Set up/tear down Not needed Needed Addressing Each packet contains full addresses Only VC number needed Router memory states None for connections Each VC stored Routing Independent/variable Fixed Effect of router failing Not catastrophic Many VCs may fail QoS Not good Much better Congestion control Difficult Much easier 27. Quality of Service Requirements of Common Applications Remember that the quality of service (QoS) of a network is defined in terms of: - Degree of unreliability (no. of damaged/lost packets) - Delay (latency) - Jitter (variation of delay) - Bandwidth Consider the QoS requirements of a number of applications
19 COMP VoIP 17/2/2013 Application Reliability Delay Jitter Bandwidth high Can be high Can be high Not high Web access high Not too high Not too high Medium Streamed lower Can be high Can be high Medium MM VoIP Teleph lower Must be low Must be low Low VideoConf lower Must be low Must be low High 28. Setting up and maintaining a VoIP session All messages could be sent by TCP Barry Proxy Server Invite Look Reply Invite Location server Proxy Server Invite Multiple clientserver (C/S) connections Alvaro B and A register names and IP addresses with proxy servers. To call A, B sends invite with A s name to B s server. B s server looks up the address of A s server, then sends invite to it. A s server sends invite to A. Hence B invites A, via the proxy server that looks up A s IP address. A can accept B s invitation by returning a message, and B must then acknowledges this acceptance. Now A and B may start to interchange RTP/RTCP packets. They could do this via the proxy server, as in the second laboratory exercise, Instant Messaging. But this is not a good idea especially if A and B are in the same building and the server is miles away. Instead, establish a direct RTP/RTCP link between B and A. This is a peer-to-peer connection. A and B negotiate the bit-rate for the speech coding by monitoring network congestion. They do this by examining the contents of each RTCP packet. If the round trip delay, jitter and the number of lost packets starts to increase this is a strong indication of congestion building up, and a good reason for changing to a speech coder with more compression.
20 COMP VoIP 17/2/2013 Barry Proxy Server Accept Ack Accept Ack Location server Proxy Serve r Accept Ack Alvaro RTP/RTCP packets P2P 29. Proxy and location servers The location server is just like a dynamic name server (DNS) as covered in Steve s first lecture. In general, a proxy server (PS) is intermediate software that acts as both server and client to make requests on behalf of other clients. Proxy servers are useful for routing, and as a gatekeepers for enforcing policy such as admission control (i.e. determining who has permission make VoIP calls), preventing congestion and having efficient access to the location server. Since UDP datagrams are sometimes blocked by firewalls, a PS may use tunnelling by encapsulating them within TCP datagrams to get them through. It could also provide convenient access to a gateway between the IP network and the public switched telephone network (PSTN). Encryption is not widely used in VoIP yet, but it is not difficult to apply to the voice stream with encryption keys agreed in advance via the proxy server. Without encryption it is very easy for an evesdropper to intercept a VoIP packet stream and listen in on a call. 30. Session Initialisation Protocol (SIP) SIP is an application layer protocol defined by the IETF (Internet Engineering Task Force) for setting up, maintaining and tearing down VoIP calls. SIP uses Ports 5060 & Addresses are URLs such as Negotiates suitable voice compression, and other things H323 is a similar protocol - but older and more complex. There are currently more than 100 different VoIP systems and most use SIP. Win Live Messenger is based on SIP. SKYPE is not, but does similar things. SIP messages (Connect, Accept, Ack, etc) may be conveniently sent by TCP and conform to a client/server mechanism. VoIP voice-streams are best communication by RTP/RTCP (based on UDP) and conform to a peer-to-peer ( P2P) mechanism 31. Quality of Service in VoIP A buffer at each receiver allows for variation in delay. This is often called a jitter buffer. Remember that jitter is variation in one-way delay. Assume, for example, the one-way delay varies between 0.02 & 0.12 seconds. A jitter-buffer of size 0.1 seconds (i.e. 800 bytes with 64