Joint ITU-T/IEEE Workshop on Carrier-class Ethernet Quality of Service for unbounded data streams Reactive Congestion Management (proposals considered in IEE802.1Qau) Hugh Barrass (Cisco) 1
IEEE 802.1Qau Congestion Management TG Background Significant growth in data center networking demand New applications, growth in usage Also driven by server virtualization Interest in all areas of datacenter networking Including Ethernet, Fibre Channel, Infiniband etc. Customers asking for I/O consolidation Ethernet performance needs to match rivals Not just raw b/w goodput requires QOS approach Ethernet s biggest weakness is loss due to congestion 2
Area of Task Group study Bounded scope encourages effective solution Focus on datacenter networks Limited network diameter (b/w * delay product) Single administrative domain Assume moderate to very large scale datacenter 1000 s of nodes, all within small locale highly meshed Layer 2 edge to edge No backward inter-operability (only compatibility) Systems with new functions benefit Legacy systems in same network unchanged 3
CM cloud Compliant devices in cloud, edge behavior Bigger cloud = better performance Compliant bridges Non-compliant bridges Compliant end-stations Non-compliant end-stations Edge behavior 4
Reactive Congestion Management Why reactive? What target application? Prescriptive mechanisms (i.e. traffic management) not scaleable, require significant expertise Needs predictable data flows Not conducive to ad-hoc (plug and play) networks Block data transfers (apparently random) e.g. ftp, tftp, rdma, iscsi, etc. Logically meshed topology (any source to any destination) Life of flow >> network latency Otherwise reaction ineffective Buffering requirement proportional to delay b/w product 5
Control theoretical approach Use feedback from congestion points Control loop operates between congestion point (in network) and reaction point (at edge) As many control loops as entry points Either queue length or data rate is control target i.e. keep queue at fixed length during congestion or keep net data rate at CP egress speed Traffic entry rate is dependent variable Rate limiter structures reduce offered load from edge devices Algorithm & stability analysis using control theory 6
Sample system (for description) 1 congestion point, 1 reaction point defined Network cloud: many to many connectivity Edge data source Congestion point Edge data destination control packets Other destinations Other data sources assumed equivalent Limiter control state machine 7
Reaction Point Located at edge where flows enter the network New queue, with rate limiter mechanism Multi-path (run around) may be needed State preserved, based on notifications received Granularity dependant on implementation Could be SA/DA/PRI, DA/PRI, PRI-only, or entire link Suggest multiple rate limiters, with fall-back React to multiple congestion points If # congestion points exceeds # rate limiters fall-back to coarser granularity More than 2 or 3 simultaneous congestion points unlikely 8
Reaction Point Logical architecture similar for all proposals Bypass for non-match Input queue Output More rate limiters rate limiter Flow identification DA/SA/PRI, DA/PRI or PRI Limiter control state machine control packets 9
Forward probing FECN proposal Rate based Source at edge sends probes at fixed intervals Reflected at destination All potential congestion points amend control packets Returns worst b/w constriction in path Source reduces rate if b/w oversubscribed Constant overhead regardless of congestion Faster reaction (before queue starts growing) No unfairness, solves parking lot problem Some scalability concerns 10
Backward notification ECM proposal Queue length based As queue grows beyond threshold, CP generates BCN Contains queue length & delta for 2 nd order control BCNs generated randomly, sent to source address of sampled packet Sample interval key for reactivity & fairness Low overhead, no tagging required unless congestion May cause micro-unfairness Best throughput in nearly-congested case Hybrid approaches also considered 11
Preliminary results Simulations based on artificial hotspots Heavily loaded network node, suddenly reduce egress bandwidth Mimics burst of higher priority traffic Throughput massively better than no CM With reasonably sized network switches Some packet loss if no LL-FC (PAUSE) Much less than no CM, still not acceptable for datacenter Near perfect throughput with LL-FC Throughput matches b/w of congested link No impact on innocent flows Mechanism scales to largest data center tops out when delay.bandwidth > buffering 12
Still under consideration (i) Per priority PAUSE Simulations show benefits of link level flow control Modeled with IEEE 802.3x (PAUSE) only single priority level Real deployments would need priority aware PAUSE Non CM traffic and control traffic not effected by PAUSE Further study required before new project proposal Prove that PAUSE required demonstrable benefits Prevent or mitigate congestion spreading and/or deadlock 13
Still under consideration (ii) Priority scheduling Update to IEEE 802.1Q priority queuing Currently only strict priorities explicitly defined Virtual Lanes model Separated traffic classes aid I/O consolidation WRR (or similar) queuing to divide the b/w Common management definition for draining MIB model allows interoperability Testable definitions for queuing algorithms May fit with Residential Ethernet project requirements 14
Q and A 15
16