Multichannel Voice over Internet Protocol Applications on the CARMEL DSP 1 Introduction Multichannel DSP applications continue to demand increasing numbers of channels and equivalently greater DSP performance Infineon Technologies high-performance CARMEL DSP allows designers to efficiently implement Voice over Internet Protocol (VoIP) solutions This application note shows system designers how to maximize the number of channels per CARMEL DSP core in VoIP applications The solution presented is modular, cost-effective, and easily scaled to multiple-core implementations, which enable even higher numbers of channels shall refer to the complete solution, including fax and data functions, as VoIP The typical role of the DSP processor in a fullduplex VoIP system is to convert the analog signal to a bitstream on the transmitting side and to regenerate an analog signal from the bitstream on the receiving side (Figure 2) A host microprocessor, also located in the gateway, handles data packetization and manages the Internet Protocol (IP) stack 2 VoIP Systems In VoIP applications, the DSP is located in a special voice gateway (Figure 1) This gateway translates the analog voice signal into a compressed bitstream, packetizes it, adds the network protocol and transmits it over the packet-switched Public Data Network (PDN) to a receiving gateway The receiving gateway performs the same processing in reverse order PCM voice/vb signal DSP GateWay Packet voice/vb signals Host Packet Data Network Packet voice/vb signals Host ure 1: VoIP signal flow GateWay DSP PCM voice/vb signal Fig A complete VoIP gateway is normally expected to correctly handle fax and data (analog modem) transmissions by relaying these signals to the gateway on the other side of the PDN These functions are sometimes referred to as Fax over Internet Protocol (FoIP) and Data over Internet Protocol (DoIP), respectively In this document we Figure 2: Software Architecture This application note focuses on the role of the DSP processor and its software in implementing a full duplex VoIP system The PCM interface, the host processor, and its the host processors software modules are not within the scope of this application note 3 Functional Blocks The functional DSP requirements for a VoIP solution are: Line echo cancellation Voice/fax/data classification Voice compression and expansion Silence compression and expansion Lost-frame compensation Fax relay Signaling relay: DTMF, MF, R1, and R2 Analog modem relay Page 1 / 6
signal, along the receiving path Data and fax channels are remodulated before being relayed to the PCM interface Speech decoders decompress the voice channels and the comfort noise generator decodes the silence intervals A bad-frame handler compensates for any lost voice packets to minimize the disturbance at the receiving end 4 CARMEL Implementation The CARMEL VoIP solution presented here uses standard ITU vocoders to process voice input The following section lists the algorithms used for this task and their main features Figure 3: Transmission Path (Single Channel) Processing in the transmission path proceeds as follows The echo canceller processes the signal received by the PCM interface and passes it to the voice/fax/data classifier, which forwards it to the appropriate software module for further compression (Figure 3) The modem demodulator and the fax demodulator extract the payload from the data and fax channels, respectively The resulting output is forwarded to the packet network as a bitstream 41 Functional Specifications Voice Compression G711 packetized PCM (64 kbps) G722 SB-ADPCM (64 kbps, WB/audio) G7231 Dual Rate Coder (53, 63 kbps) G726/727 ADPCM (16-40 kbps) G729 Annex A,B CS-ACELP (8 kbps) Silence Compression: G7231 Annex A G729A Annex B Error Concealment for voice coders Group 3 Fax Relay: V17 (72, 96, 12, 144 kbps) V27ter (48 kbps) V29 (96 kbps) CNG, CED (detection/generation) Modem Relay: V21 (300 bps) V22bis (24 kbps) V32bis (144 kbps) Figure 4: Receive Path (Single Channel) The voice activity detector forwards the signal from the voice channels to one of several speech coder modules or, during intervals of silence, to the comfort noise encoder, which provides very high compression ratios for optimal bandwidth utilization The DTMF relay preserves any tone signaling superimposed to the voice Simultaneously (Figure 4), data from the host interface are processed to reconstruct the original Gain Control: Programmable per channel Tx: -24 to +6 db, 1 db step Rx: -20 to +10 db, 1 db step Adaptive Line Echo Canceller: Standard compliance to G165/G168 Echo path length: 16 msec ERLE: > 30 db Convergence rate: better than G165 Page 2 / 6
End-to-End Processing delay: < 50 msec Tone Signaling: DTMF relay (TIA/EIA 464A) MF-R1 MF-R2 Diagnostics: 1 khz tone injection Local loop: near-end loopback Remote loop: far-end loopback 42 Task scheduling A real-time operating system is required to handle the different tasks required of this multichannel VoIP application The proposed, non-preemptive operating system is application-oriented, rather than general-purpose (Figure 5) It schedules tasks with different priorities, depending on their individual requirements Because the operating system is optimized for a specific application, it does not need full context switching and minimizes the dynamic allocation of memory Thus, tasks do not need to be reentrant and can be coded to minimize CPU load and memory requirements Data signals PCM I/O Figure 5: OS Architecture New Task Task structue PRIORITY Channel Task Type CONTROL I/F TASK Scheduler Task Task Task Task Data Flow Task Handler LEC G165/8 A task is a channel-related instance of processing, such as create session, encode frame, and decode packet The task-related events are frame-length sensitive and each task runs with a minimum of interruptions When a frame is ready to be processed, the operating system creates a task to handle the frame, and inserts it in the task schedule G711 G729 FAX DTMF Configuration Commands Data signals Near loopback Data flow Host I/O queue The decision on when a frame is ready depends, among other things, on the direction of the communication (transmission or reception) and on the type of vocoder used for that channel The task schedule queue contains tasks for each active channel in both directions In addition, it handles administrative tasks, such as statistic collection commands, sent by the host processor Priorities are assigned to each task in the queue, depending on how time-critical the task response is at any given time This flexible priority mechanism allows tasks to be dynamically reprioritized when conditions change A channel manager tracks the state of the session for each active channel and manages the channel status and a number of channel parameters, such as the session type, the vocoder type, and the packet length The host processor sends commands through the host interface to create sessions A protocol defines the commands that can be exchanged between the CARMEL DSP and the host processor to support the VoIP application 43 Algorithm Load The processing overhead of the operating system is as low as 06 CARMEL MIPS per channel The load of the voice/fax/data classifier is also proportional to the number of channels The G7231 vocoder algorithm has the greatest computational intensity and determines the critical path (in bold in the following table) for processor load For the CARMEL DSP, the critical path has a computational load of 100 CARMEL MIPS Algorithm: MIPS G711packetized PCM 02 G7231 vocoder 75 G726 vocoder 50 G727 vocoder 50 G729A vocoder 50 V17 fax relay 80 V32bis modem relay 85 Line Echo Canceller 15 DTMF relay 05 Voice/fax/data classifier 04 Real-time executive overhead 06 Maximum load per channel 100 Thus a 250 MHz CARMEL-based single-core VoIP system is capable of processing up to 24 channels, Page 3 / 6
which is equivalent to the capacity of one T1 trunk This solution also enables the application of several single- or multi-core VoIP devices, to serve as many channels as required 5 Chip Reference Design The following chip design describes the multichannel, single-core solution that might serve as a reference platform for multi-core VoIP devices (Figure 6) The VoIP chip s architecture is composed of the following main parts: the CARMEL DSP core with on-chip data SRAM (A/B) and program SRAM; an external memory, three DMA engines; and interfaces for the connection to the external memory, the PCM interface and the host processor The use of external memory is a key feature of the flexible and scalable VoIP chip solution proposed here Therefore, the DMA s are also key elements of this architecture The three DMA s transfer data between: the PCM interface and the external memory the external memory and the CARMEL core the host interface and the external memory Management of External Memory To minimize the amount of on-chip memory required, the CARMEL VoIP reference chip keeps most of the data for its 24 channels in external memory The relatively small on-chip data A/B SRAM is used as a temporary scratchpad for the current computations, whereas a much larger external RAM stores most of the channel s static data The DMA transfers I/O data between the interfaces and the external memory without requiring intervention from the CARMEL DSP The DMA is also in charge of moving the data ready to be processed from the external memory into the internal memory, as well as moving the results back out of the internal memory and into the external memory This memory management mechanism offers two main advantages: Minimal chip size (large memories increase the chip size considerably) Scalability (more channels can be served by increasing the amount of external memory) The overall memory requirement: Figure 6: Single-Core System Architecture (Shaded blocks show multi-core solution) Program ROM 300 KB Data ROM 60 KB Dual-port data RAM (temp) 16 KB Single-port data RAM (temp) 20 KB External single-port data RAM (4 KWords per channel) 192 KB The program code can be located in ROM or RAM, depending on how stable the code is when the chip is fabricated The program memory might be larger or smaller than the one chosen, depending on the number of processing modules (eg vocoders) required The data ROM holds coefficient tables and constants for all algorithms The 16-KB dual-port data memory is divided into two 8-KB sections One holds the data of a channel during the processing of this channel, while the DMA uses the other for swapping data with the external memory Eight KB of external RAM per channel, totaling 192 KB, are required for a 24-channel implementation Page 4 / 6
The 20-KB single-port memory is used by the running task and the operating system as temporary scratchpad The stack is also located in this memory Memory Swapping As noted, the 16-KB dual-port RAM is divided into two 8-KB regions The CARMEL DSP core uses the first of the two 8-KB RAM regions to process the currently active channel This region is referred to as the active section Simultaneously, the DMA transfers the static data of the next channel scheduled to be processed, to the second 8-KB section, which is referred to as the inactive section After processing of the current channel is completed, the DMA copies the 8 KB static channel data from the active section to the external RAM Then the section becomes inactive and will be used for the data of the next channel At the same time, the processing of the next channel starts on the previously inactive section, which now becomes active The DMA must be able to transfer 8 KB from external to internal memory, and again from internal to external memory (for a total of 8192 16-bit word transfers) while one channel frame is being processed The application running on the CARMEL DSP core alternates addressing between the two 8-KB sections, therefore, no hardware switching support is necessary To offer the necessary memory access bandwidth, the 16-KB data section is implemented in dual-port SRAM This allows a maximum of 2 accesses per cycle to each of the A and B memories, independent of the address By contrast, a single-port memory would yield this bandwidth only if half of the addresses are even, and the other half are odd The goal is to ensure that the DMA operates in cyclestealing mode (ie wait for the requested CARMEL data memory bus to be free before performing an access to memory) and does not cause any delay to the core Only the core accesses the rest of the onchip memory, and the external data memory Thus, these memories can be implemented as single-port memories to minimize chip size Of course it must be verified that real-time constraints are not violated by the required DMA accesses As mentioned above, 8 K 16-bit words must be transferred during the processing time of one frame, which is 10 msec With 24 channels served on one core, the maximum processing time per frame is: t process = 10 msec / 24 = 416 µsec = = 104,166 CARMEL cycles The DMA must transfer 8K words on the A/B memory buses during t process This is possible if the CARMEL DSP core does not use more than 92% of the A/B bus bandwidth It is straightforward to show that this condition is met in the VoIP application described The resulting data stream is: 24 ch * 100 frames/sec/ch * 8 K words/frame = = 197 M words/sec The DMA is connected to the external RAM via the Flexible Peripheral Interconnect (FPI) bus The FPI bus operates at a frequency of 125 MHz (one-half of the CARMEL core frequency) and it can transfer one 16-bit word at every cycle Thus, only 157% of the FPI bus total gross bandwidth of 125 Mwords/sec is necessary for memory swapping The external memory must support an access bandwidth of at least 196 Mwords/sec A 40-nsec, 16-bit-wide SRAM has a gross access bandwidth of: 16 bit / 40 nsec = 25 Mwords/sec This type of memory is more than sufficient for the 24-channel single-core VoIP solution presented here A multi-core device that supports more channels might require the use of faster and/or wider external memory 6 Multi-core VoIP Implementation Whenever 24 channels are not enough for a planned design, two or more CARMEL-based VoIP subsystems can be combined in a multi-core system Thanks to the scalability of the solution proposed, this can be achieved with minor modifications to the reference design Figure 7 illustrates a single-chip, multi-core, multichannel solution for VoIP, which serves 96 channels in parallel Thus, this multi-core VoIP solution processes four T1 or three E1 trunks The 96 channels are split between 4 CARMEL-based, single-core VoIP subsystems The data swapping between each core s inactive memory and the external memory occurs via the Bus Interface Unit (BIU) The Bus Control Unit (BCU) arbitrates every FPI access of the four Carmel subsystems Of course, more peripherals can be connected to the FPI bus, if necessary Note, there is no hardware switching of active and inactive memory blocks in this architecture, because the switching is performed in software by Page 5 / 6
toggling between two scratchpad memory addresses The DMA updates the inactive section via the FPI bus and the BIU It can be seen that this faster memory is sufficient for the CARMEL DSP core-based multi-core VoIP solution for 96 channels A lesser number of channels will of course reduce the memory bandwidth requirements As shown previously, a 40nsec memory suffices for a single core, 24 channel VoIP solution For a 48 channel solution, using 2 cores, a memory bandwidth of 197Mwords/sec/core*2cores = 394 Mwords/sec is required, which can be satisfied using a 25nsec memory 7 Conclusion Figure 7: Multi-core VoIP Solution As shown for the single-core VoIP chip, within the processing time of one frame the DMA transfer for each core must be finished and the core must not use more than 92% of the A/B memory bus bandwidth These constraints are still valid in the multi-core environment Within 416 µsec, 8 K words must be transferred four times The resulting data bandwidth is: 4 * 24 ch * 100 frames/sec/ch * 8 K words = = 786 M words/sec The DMA s are only connected to the external memory via one FPI bus (Figure 7) Thus, for memory swapping the maximum load of the FPI bus is 628% The data access bandwidth available for the external memory is a very important factor that affects system performance In a multi-core solution like this, the external memory bandwidth might become a bottleneck In this example, wider or faster memory must be used, to ensure that the processing is not stalled A 12-nsec, 16-bit-wide SRAM offers an access bandwidth of: 16 bit / 12 nsec = 833 M words/sec This application note describes how multi-channel VoIP solutions can be implemented as single- or multi-core devices with the CARMEL DSP core The system designer will have to develop their own design solution that best fits the given objectives One has to take into account the bandwidth of the external memory and of the FPI bus, particularly in multi-core implementations A possible solution for very high channel density is to use two FPI buses, as described in [1] In summary, the CARMEL DSP core offers a very flexible platform for VoIP applications, as well as outstanding performance, that can be achieved in a very short time-to-market 8 References [1] Data Memory Concept for Multicore and Multichannel Applications (Infineon CARMEL Application Note) [2] Data Transfer between external and internal memory (Infineon CARMEL Application Note) All rights are reserved Reproduction in whole or in part is prohibited without the prior written consent of the copyright owner The information presented in this document does not form part of any quotation or contract, is believed to be accurate and reliable and may be changed without notice The publisher for any consequence of its use will accept no liability Publication thereof does not convey nor imply any license under patent or other industrial or intellectual property rights Page 6 / 6