SIP Overload Control IETF Design Team Summary Eric Noel Carolyn Johnson AT&T Labs 27 AT&T Knowledge Ventures. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Knowledge Ventures.
Introduction The IETF has chartered a design team to perform research on SIP overload controls and lay the groundwork for possible standardization: The design team was chaired by Alcatel-Lucent, It included members from AT&T Labs, Bell Labs/Alcatel-Lucent, Columbia University, Sonus Networks, Nortel, British Telecom, The team has developed four independent simulation tools and has conducted extensive simulations: Confirm the problem of the current SIP specification, Evaluate potential solutions in steady-state and transient simulations, Close collaboration between AT&T labs (Eric Noel, Carolyn Johnson), Alcatel-Lucent (Volker Hilt, FangzheChang), Columbia University (Charles Shen) and SonusNetworks (Ahmed Abdelal). 2
SIP Overload Control Design Team Team Members: Eric Noel, Carolyn Johnson (AT&T Labs) Volker Hilt, Fangzhe Chang (Bell Labs/Alcatel-Lucent) Charles Shen, Henning Schulzrinne (Columbia University) Ahmed Abdelal, Tom Phelan(Sonus Networks) Mary Barnes (Nortel) Jonathan Rosenberg (Cisco) Nick Stewart (British Telecom) Four independent simulation tools: AT&T Labs, Bell Labs/Alcatel-Lucent, Columbia University, Sonus Networks Bi-weekly conference calls. 3
Motivation Users expect VoIP/SIP networks to perform as well as the existing PSTN. So efficient handling of overload conditions is crucial to cope with, for instance: Call floods: emergencies, media stimulated calling, Decrease in processing capacity: component failures, Avalanche restart: recovery from failures (i.e. power outage), provisioning errors, 14 12 8 6 4 2 Congestion Statefull 53 collapse AT&T Sim3 gput CU Sim3 gput BL Sim3 gput 2 4 6 8 Offered load (cps) Researchers have demonstrated VoIP/SIP networks can experience congestion collapse, SIP retransmission algorithm and limited overload control capability are the main causes, Therefore a SIP overload control mechanism that provides high throughput during times of overload is needed. 4
Benchmark Model 2 Core Proxies 5 Edge Proxies D UA s W h W l μ 1,μ 2 All proxies modeled as a queuing system: Message rate of 5/sec for accepted messages, 3,/sec for rejected INVITEs, Queue size: 5 messages Internal overload control considered Media path congestion is not considered. We focused on server-to-server controls, Overload controls to throttle UA s should be very different than the ones to throttle servers, Our benchmark network consisted of UA s establishing calls with one another across a network of 5 edge proxies and 2 core proxies, We used the standard 7 SIP messages call model, All proxies were assumed to be statefuland signaling messages within the same call traverse the same set of proxies, To evaluate our server-to-server controls and eliminate other sources of interaction, UA s and Edge Proxies were assumed to have infinite capacity. 5
SIP Protocol Implementation Client Transaction B A A INVITE INVITE INVITE Server Transaction Client Server Transaction Transaction 3xx-6xx 3xx-6xx 3xx-6xx G G H UAC UAS 2xx 2xx 1 1 2xx 2 SIP is an IETF standardized signaling protocol (application-layer) for creating, modifying, and terminating sessions with one or more entities. SIP uses timers for application-level retransmission and message timeouts, Most retransmissions timers are used for UDP only, Short timer initialized to T1(5msec) is doubled every retransmission while long timer is proportional to T1 (64xT1=32sec), SIP provides the 53 response (Service Unavailable) for overload control. After receiving a 53 response, a proxy will: Stop sending requests to this server for a number of seconds defined in the Retry-After header (if honored by recipient), Retry at an alternative proxy if one is available or reject the request back to the UAC, Each simulation model included detailed implementation of SIP INVITE & non- INVITE client & server transaction state machines. March 29 Page 6
Controls Test Scenarios and Evaluation Metrics Carried load Offered load L2=cps Max Ideal carried load Processed Actual carried load Offered load C=143cps L1=.8 C T1=5 mins T2=1 mins T3=15 mins Overload control algorithms were evaluated under steady-state conditions, For a given offered load, simulations ran until steady-state was achieved, Sensitivity analysis on transmission impairments (message loss and delay), number of proxies and traffic matrix. Transient analysis was also considered, initially offered load consisted of a step function, Metrics considered were: Goodput or carried load, Core proxy queue and CPU utilization, Convergence speed, Post dial delay. 7 Time
Simulation models calibration (no controls) 14 12 8 6 4 2 14 12 8 6 4 2 Sim1 AT&T Sim1 gput CU Sim1 gput BL Sim1 gput 2 3 4 5 6 7 8 Offered load (cps) Stateless 53 AT&T Sim3 gput CU Sim3 gput BL Sim3 gput 2 3 4 5 6 7 8 Offered load (cps) 14 12 8 6 4 2 Statefull 53 AT&T Sim2 gput CU Sim2 gput BL Sim2 gput Sonus Sim2 gput 2 4 6 8 Offered load (cps) Within team, two modes of operation were suggested: rejecting new INVITE s statelessly (minimal processing penalty) or statefully, Stateless mode of operation uncovered state transition inconsistency in RFC3261 that was resolved in parallel elsewhere (draft-sparks-sipinvfix-.txt), Simulation calibration between different teams turned out to be challenging as each interpreted RFC3261 somewhat differently, It took ~6months for models to show a reasonable level of agreement. February 19 28 Page 8
Results for steady-state conditions Goodput (CPS) 16 14 12 8 6 4 2 14 12 8 6 4 2 AT&T (Window AT&T control) AT&T Sim3 rate control AT&T Sim3 no control AT&T Sim3 Window Baseline CP AT&T Sim3 Window Optimal CP 2 4 6 8 Offered load (cps) 5 15 2 25 Sonus Networks 3 35 399 449 Load (CPS) Receiver Based OC 5 55 Sender Based OC 6 65 699 75 799 AT&T rate & window controls: Core proxies internal overload control based on queue fill, Rate and window tuned to activate prior internal control limits, In rate control, core proxies estimate control rate based on queue delay deviation from target and distribute allowed rates via SIP responses to edge proxies, Edge proxies throttle based using an percent blocking algorithm, In window control, edge proxies maintain a window of unacknowledged INIVITEs based of core proxies responses (53 s shrink window 1xx & 2xx open window) Sonus Networks receiver & sender based controls: Core proxies internal overload control adjusts allowed call rate based on CPU utilization deviation from a target value. Excess calls are rejected via 53-Service Unavailable. Core proxies distribute individual allowed rates to upstream proxies via SIP messages. Upstream proxies adjust their throttles based on the received allowed rates. In sender based control, upstream proxies dynamically adjust rates based on the overload notifications from downstream proxies (53-Service Unavailable) such that the overload notification rate converges close to a target overload notification rate. 9
Results for steady-state conditions Goodput (cps) 16 14 12 8 6 4 2 Columbia University CU Overload Control Results Sim1 Sim3 CU-WIN-II Theoretical Sim2 CU-WIN-I CU-WIN-III 2 4 6 8 Load (cps) Bell Labs/Alcatel Lucent Columbia University window-based overload control: SIP session as control unit, dynamically estimated from processed SIP messages, Receiver (Core Proxy) dynamically computes available window and feedback piggybacked in responses/requests, Three different window adaptation algorithms: CU-WIN-I: keep current estimated sessions below total allowed sessions given target delay CU-WIN-II: open up the window after a new session is processed CU-WIN-III: discrete version of CU-WIN-I, divided into control intervals Bell Labs/Alcatel Lucent loss-based overload control simulation : Feedback-loop between receiver (core proxy) and sender (edge proxy). Receiver driven control algorithm (estimates current processor utilization, compares to target processor utilization, multiplicative increase and decrease if loss-rate to reach target utilization). Sender adjusts the load it sends to receiver based on the feedback received using percentblocking. Overload control algorithms: Occupancy algorithm (OCC), ARO algorithm (ARO), improved ARO (IAR). 1
Results for sensitivity on transmission impairments 14 Rate control sensitivity on delay (AT&T) 14 Loss based control, OCCsensitivity on message loss (Bell Labs/Alcatel Lucent) 12 8 6 4 AT&T Rate Control 114cps gput AT&T Rate Control 5cps gput AT&T Rate Control cps gput Calls/Second 12 8 6 4 114 5 2 5 15 2 Delay between EdgeProxies & CoreProxies (msec) 2 1 2 3 4 5 Message Loss Rate (%) 14 Rate control sensitivity on message loss (AT&T) 12 8 6 4 2 AT&T Rate Control 114cps gput AT&T Rate Control 5cps gput AT&T Rate Control cps gput 1 2 3 4 5 Packet loss between EdgeProxies & CoreProxies (%) 11
Results for transient conditions 18. 16. 14. 12.. 8. AT&T (window control) Theoretical Window control 6.. 2. 4. Time (sec) 6. 8. 7 6 5 4 3 2 Bell Labs/Alcatel-Lucent (rate and loss controls) Load Rate feedback Loss feedback 2 3 4 5 6 18. 16. AT&T (rate control) Theoretical Queue delay control Columbia University (rate and window control) 14. 12.. 8. 6.. 2. 4. Time (sec) 6. 8. 12
Conclusion With increasing size of VoIP/SIP deployment, efficient handling of overload conditions is crucial for VoIP/SIP service providers, IETF design team simulation results showed overload control algorithms produce stable VoIP/SIP traffic behavior and maintain high throughput under various overloads, Additional Info: IETF draft RFC draft-hilt-soc-overload-design (V. Hilt coordinator), 7 ITC2 and 9 ITC21 (E. Noel, C. Johnson), ICNP 8 (V. Hilt, I. Widjaja), IPTComm 8 (C. Shen, H. Schulzrinne, E. Nahum), IEEE CCNC 11 (A. Abdelal, W. Matragi). 13