Scaling IP Mul-cast on Datacenter Topologies Xiaozhou Li Mike Freedman
IP Mul0cast Applica0ons Publish- subscribe services Clustered applica0ons servers Distributed caching infrastructures
IP Mul0cast Applica0ons Network virtualiza0on overlays Emerging standards (VXLAN, NVGRE) encapsulate L2 MAC frames into a UDP header Virtualize broadcast within each virtual network using IP mul0cast in the physical network
Problems with IP Mul0cast Reliability NAK / gossip / error correc0on Stability rate limi0ng / mul0cast conges0on control Scalability ² number of supported groups
Why is IP mul-cast hard to scale? Control Plane Data Plane
Challenges with scaling control plane IGMP + PIM Switches maintain info about all groups Periodically send queries about group membership Communica0on and memory complexity
Challenges with scaling data plane Switches must maintain per- group forwarding rules Mul0cast addresses cannot be aggregated by IP prefix Limited mul0cast forwarding table size O(100s 1000s) on commodity switches Prior work scaled up # groups per switch compression: FRM [SIGCOMM 06], LIPSIN [SIGCOMM 09], ESM [ToN 10] Transla0on mul0cast to unicast: Dr. Mul0cast [EuroSys 10] Insufficient for large scale datacenter network
Our Approach: Leverage unique topology of datacenter networks to scale out
Datacenter Mul0- rooted Tree Topologies Core Aggrega0on Edge Pod
Topology simplifies mul0cast tree construc0on Core Aggrega0on Edge Pod
Our contribu0ons for scalable DC mul0cast 1. Par00on and distribute the mul0cast address space Increase # of groups at core and aggrega0on layers 2. Enable local mul0cast address aggrega0on Increase # of groups in each pod 3. Handle network failures with fast local rerou0ng Quick response to topology changes
Par00oning the mul0cast address space Core Aggrega0on Edge 00/ 01/ 10/ 11/
Par00oning the mul0cast address space Core Aggrega0on Edge 00/ 01/ 10/ 11/
Par00oning the mul0cast address space Core Aggrega0on Edge 00/ 01/ 10/ 11/
Par00oning the mul0cast address space Many switches to distribute mul0cast addresses at core layer Core Aggrega0on Edge Fewer switches to distribute mul0cast addresses in each pod
Pod s address capacity is the bofleneck Core Aggrega0on Edge
Reduce # of entries in the bofleneck switch G w x y z addr ac-on 000 to A2, 001 to A2, 010 to A2, 011 to A2, C6 C10 upper layer w x y z 000 to E0, E2 001 to E1, E2, E3 010 to E1, E2, E3 011 to E0, E2 A2 bofleneck switch w z 000 011 E0 E1 E2 E3 lower layer
Local address transla0on and aggrega0on G w x y z addr ac-on 000 to A2 (- >100), 001 to A2 (- >110), 010 to A2 (- >111), 011 to A2 (- >101), C6 C10 upper layer wz 10* to E0, E2 xy 11* to E1, E2, E3 A2 bofleneck switch w z 100 (- >000) 101 (- >011) E0 E1 E2 E3 lower layer
Is group aggrega0on easy to compute? Op0miza0on problem: given fixed # of aggregated groups, minimize network overhead NP- Hard channeliza0on problem Our approach: local aggrega0on at the boflenecks Independent local sub- problems Computa0ons can be distributed Reduce network and computa0onal overhead
Pulng it all together...
Fault tolerance Fast path: Reroute traffic through other paths Slow path: Reconstruct mul0cast tree (See paper)
Manage Mul0cast using SDN Sense VMs group subscrip0ons Network topology App #1 App #2 Mul-cast Network Opera-ng System Compute Local aggrega0on Generate mul0cast tree Control Proac0vely install mul0cast forwarding rules into switches
Evalua0on How well do our techniques help a datacenter: Support greater number of mul0cast groups? Handle common mul0cast group dynamics? Survive moderate network failures? (see paper)
Simula0on Environment 3-0ered fat tree with 48- port switches (27,468 end hosts) Mul0- tenant environment # of VMs per tenant follows exponen0al distribu0on VM placement Placed on hosts near to one another Distributed uniformly at random across network Mul0cast communica0on environment Most groups are small, few groups have most servers Generated from trace of IBM WebSphere Virtual Enterprise Group sizes: mean = 51, min = 5, median = 11, max = 5000 Groups size uniform at random (see paper)
Local aggrega0on reduces bofleneck limits C Core A Agg Edge E Edge (la) E 100,000 groups (group size: mean = 51, max = 5000) A tenant s VMs are placed on nearby hosts 0 500 1000 1500 2000 2500 3000 3500 No. of multicast addresses on a switch Low traffic overhead (see paper)
Local aggrega0on increases maximum # of groups in the datacenter network # of groups (group size: mean = 51, max = 5000) 120,000 100,000 80,000 No aggrega0on Loal aggrega0on switch capacity = 1000 60,000 40,000 20,000 Most entries has one outport (60% in agg and 80% in edge) 0 Nearby Random VM Placement Policy
SDN performance sufficient for dynamism Performance of commodity network controllers and switches Controller: 1.6 million request per second [HotICE 2012] Switch: 600 1000 updates per second [NEC Jan 2012] Average switch update rates with group dynamics Assume each pod has 1000 join/leave events per second VM placement: Switch Edge Agg Core nearby 42 31 0.83 random 42 78 133 # of updates per switch per second on average
Conclusion Goal: Support large # of IP mul0cast groups in datacenters Reduce the barrier of adop0ng IP mul0cast Contribu0ons: Leveraged mul0- rooted topology to scale- out by dividing mul0cast address space across mul0ple switches. Introduced local aggrega0on algorithms to overcome boflenecks in pods. Proposed mechanisms for fast failover and mul0cast tree management, prac0cal with today s SDN