RouteBricks: A Fast, Software- Based, Distributed IP Router

Size: px

Start display at page:

Download "RouteBricks: A Fast, Software- Based, Distributed IP Router"

Hugh Perry
9 years ago
Views:

1 outebricks: A Fast, Software- Based, Distributed IP outer Brad Karp UCL Computer Science (with thanks to Katerina Argyraki of EPFL for slides) CS GZ03 / M th November, 2009

2 One-Day oom Change! On Monday the 23 rd of November, we will meet in Chandler House G10 at the usual time! This room change will be one time only; we will revert to this room thereafter.

3 Distributed Systems Context Many models of consistency: sequential consistency (Ivy) close-to-open consistency (NFS) eventual consistency/conflict resolution despite disconnection (Bayou) atomic appends for multiple writers (GFS) Cluster as platform: Ivy, GFS Today: distributing an IP router over a cluster of PCs, for capacity and programmability highly parallel workload: packets forwarded independently (apart from flow ordering constraint) distributed within a single PC: multiple multi-core CPUs

4 outer = N packet processing + switching Fast = high, N Programmable = any processing 4

5 Why programmable routers? New ISP services» intrusion detec.on, applica.on accelera.on Monitoring» measure link latency, track down traffic New protocols» IP traceback, Trajectory Sampling, 5

6 Today: fast O programmable Fast hardware routers» processing by specialized hardware» not programmable» aggregate throughput: Tbps Programmable soiware routers» processing by general- purpose CPUs» aggregate throughput < 10Gbps 6

7 outebricks A router out of off- the- shelf PCs» familiar programming environment» large- volume manufacturing (cheap, widely available, growing in CPU and I/O with PCs) Can we build a Tbps router out of PCs? 7

8 A hardware router N linecards linecards Processing at rate ~ per linecard 8

9 A hardware router N linecards switch fabric linecards Processing at rate ~ per linecard Switching at rate N by switch fabric 9

10 outebricks N commodity interconnect servers servers Processing at rate ~ per server Switching at rate ~ per server 10

11 outebricks N commodity interconnect servers servers Per- server processing rate: c 11

12 Outline Interconnect Server oppmizapons Performance An applicapon Conclusions 12

13 Outline Interconnect Server oppmizapons Performance An applicapon Conclusions 13

14 equirements N commodity interconnect Internal link rates < Per- server processing rate: c Per- server fanout: constant 14

15 Straw Man: Direct Full Mesh N 15

16 Straw Man: Direct Full Mesh N N external links of capacity N 2 internal links of capacity 16

17 BeWer: Valiant Load Balancing [Valiant, Brebner, 1981] N /N /N 17

18 Valiant Load Balancing /N /N N Per- server processing rate: 3» processing overhead: 50% (2 without VLB) N 2 links of capacity 2/N 18

19 VLB with Uniform Traffic /N /N /N N /N /N /N 19

20 Direct Valiant LB: OpPmize for Uniform Traffic Load /N /N N /N /N /N 20

21 Direct Valiant LB /N N If uniform traffic matrix: 2 21

22 VLB Summary /N /N N Worst- case per- server processing rate: 3 If uniform traffic matrix: 2 22

23 Per- Server Fanout? /N N ConnecPvity degree: N per server What if not enough network ports? 23

24 SoluPon #1: Increase Server Capacity 2/N N e.g., double # external ports per server Doubles data rate on internal links Cuts fanout by half 24

25 SoluPon #1: Increase Server Capacity 2 2/N N e.g., double # external ports per server Doubles data rate on internal links, processing rate per server Cuts fanout by half 25

26 Per- Server Fanout? /N N 26

27 SoluPon #2: Add Intermediate Servers /N /N N k- degree, n- stage buwerfly ( k- ary n- fly ) Per- server fanout: k Stages: n = log k N 27

28 The outebricks Interconnect: CombinaPon Assign max external ports per server Full mesh if fanout allows Extra servers otherwise 28

29 Example Assuming current servers» 5 NICs, 2 x 10G ports or 8 x 1G ports» 1 external port per server N = 32 ports: full mesh» 32 servers N = 1024 ext. ports: 16- ary 4- fly» 3072 servers (2 extra servers per port) 29

30 ecap N Valiant Load Balancing + full mesh k- ary n- fly Per- server processing rate:

31 Outline Interconnect Server oppmizapons Performance An applicapon Conclusions 31

32 First Try: a Shared- Bus Server 64- byte packets North bridge Click Ports Memory Cores» Cloverton architecture» FSB: 2 x 1.33 GHz» CPUs: 2 x Xeon 2.4GHz 4- core» NICs: 2 x Intel XFS 2x10Gbps 32

33 First Try: a Shared- Bus Server 64- byte packets North bridge Click Ports Memory Cores 820Mbps 33

34 Problem #1: the Shared Bus 64- byte packets North bridge Click Ports Memory Cores FSB address bus saturated MulP- core alone is not enough 34

35 SoluPon: NUMA architecture I/O hub Mem Ports Cores Mem» Nehalem architecture» QuickPath interconnect» CPUs: 2 x Xeon 2.4GHz 4- core» NICs: 2 x Intel XFS 2x10Gbps 35

36 SoluPon: NUMA architecture I/O hub Mem Ports Cores Mem 820Mbps 1.3Gbps 36

37 Problem #2: Per- Packet Overhead I/O hub Mem Ports Cores Mem Bookkeeping operapons» moving packet descriptors between NIC and memory» upda.ng descriptor rings 37

38 SoluPon: Batching Poll- driven batching» poll mul.ple packets at a.me» reduces updates on descriptor rings» Click already supported it NIC- driven batching» relay mul.ple packet descriptors at a.me» reduces transac.ons on PCIe and I/O buses» changed NIC driver 38

39 SoluPon: Batching I/O hub Mem Ports Cores Mem 820Mbps 1.3Gbps 3Gbps 39

40 Problem #3: Queue Access I/O hub Mem Ports Cores Mem 40

41 Problem #3: Queue Access Ports Cores 41

42 Problem #3: Queue Access ule 1: one core per port 42

43 Problem #3: Queue Access 0.6 Gbps 1.2 Gbps 1.7 Gbps ule 1: one core per port ule 2: one core per packet 43

44 Problem #3: Queue Access 0.6 Gbps Can we always enforce both rules? What about when receiving on one Gbps Gbps port? 1.7 Gbps ule 1: one core per port ule 2: one core per packet 44

45 Problem #3: Queue Access ule 1: one core per port ule 2: one core per packet 45

46 Problem #3: Queue Access ule 1: one core per port ule 2: one core per packet 46

47 SoluPon: MulP- Queue NICs ule 1: one core per port ule 2: one core per packet 47

48 Single- Server Performance I/O hub Mem Ports Cores Mem 9.7Gbps 820Mbps 1.3Gbps 3Gbps 48

49 ecap State- of- the art hardware» NUMA architecture» mul.- queue NICs Wrote NIC driver» batching» lock- free queue access Careful queue- to- core allocapon» one core per queue» one core per packet 49

50 Outline Interconnect Server oppmizapons Performance An applicapon Conclusions 50

51 Single- Server Performance eal packet trace Gbps byte packets Forwarding IP roupng IPsec 51

52 Feasible outer Line ate eal packet trace Gbps byte packets Forwarding IP roupng IPsec Per- server processing rate: 2 3 eal packet mix: = 8 12 Gbps Small packets: = 2 3 Gbps 52

53 BoWlenecks Small packets: CPU» buses far from satura.on eal packet- size mix: none» limited PCIe lanes» only for the prototype Expected evolupon?» next Nehalem: 4 x 8 = 32 cores» 4 8 PCIe2.0 slots 53

54 Projected Single- Server Performance eal packet trace Gbps byte packets Forwarding IP roupng 54

55 Projected outer Line ate eal packet trace Gbps byte packets Forwarding IP roupng eal packet mix: = Gbps Small packets: = Gbps 55

56 B4 Prototype N = 4 external ports» 1 server per port» full mesh eal packet mix: 4 x 8.75 = 35 Gbps» expected = 8 12 Gbps Small packets: 4 x 3 = 12 Gbps» expected = 2 3 Gbps 56

57 What About Packet Order? TCP cuts sending rate by ½ if packets are ever reordered by more than 3 posipons But VLB sprays packets randomly across intermediate nodes! outebricks parpal solupon:» Assign packets on same flow to same receive queue» During any 100 ms interval, VLB forwards packets from same flow to same next hop 0.15% of packets reordered with this mechanism; 5.5% without it 57

58 Latency One server: 24 microseconds Three servers: 66.4 microseconds 58

59 outebricks Summary outebricks: fast soiware router» Valiant Load- Balanced cluster of commodity servers Programmable with Click Performance:» Easily = 1Gbps, N = 100s» = 10Gbps with next- genera.on servers Programming model for more complex funcponality? 59

RouteBricks: Exploiting Parallelism To Scale Software Routers

RouteBricks: Exploiting Parallelism To Scale Software Routers Mihai Dobrescu 1 and Norbert Egi 2, Katerina Argyraki 1, Byung-Gon Chun 3, Kevin Fall 3, Gianluca Iannaccone 3, Allan Knies 3, Maziar Manesh