1
Infrastructure overview Marlon Dutra Production Engineer, Traffic October, 2013 2
Physical infra 3
Data centers 4
Edge locations 5
Prineville, OR 6
Organization Suites Clusters services front end back end etc etc 7
Triplet racks 8
Thousands of them... 9
Clusters Just a big group of servers in a network topology No special software coordination We call logical clusters as tiers (to avoid miscommunication) 10
Servers Very efficient servers Designed in house (opencompute.org) Vanity free, open cabinets, no paint No fancy boxes, manuals, CDs, etc 10G network card Few hardware variances cpu, memory, storage, iops... 11
opencompute.org 12
Logical infra 13
Cloud management We don t use virtual machines We don t care about servers or OSes We do care about services VMs are meant to share resources We want the opposite of that Every 1-2% matters, a lot 14
Cloud management [2] Remote hardware control Console, restart, power on/off, etc Same base OS everywhere Chef for host setup Automatic provisioning, via PXE We provision thousands of servers in a few hours. All plug and play. 15
Cloud management [3] We buy fully assembled triplet racks Connect the rack switch to cluster switches Connect main and backup power Walk away In 1-2 hours, we can SSH into the hosts 16
Service management Services packaged with all dependencies They can run anywhere Everything built to scale Services must run in multiple machines, data centers, etc Binaries deployed with bittorrent No bottlenecks in the distribution 17
Service management [2] Services run with LXC (Linux containers) chroot for filesystem isolation Process namespace isolation Routing isolation Similar to FreeBSD jails 18
Shared pool of servers Utilization example 250 instances of service A (not shared, multiple racks and clusters) 100 instances of service B (can be shared, needs 1 cpu, 4g memory) 700 instances of service C (can be shared, needs 2 cpu, 16g memory) The automatic scheduler takes care of the allocation Not everything can use a shared pool, of course (e.g. databases) 19
Service management [3] A broken server is not a big deal The scheduler moves the services somewhere else Auto remediation system for common issues Canary ability for services and configs 20
Inter service communication Apache Thrift http://thrift.apache.org/ Tip: always avoid XML 21
Storage management Large objects (photos, videos...) BLOB store Computing nodes with lots of disks Small objects (text, numbers...) Databases (MySQL, HBASE, Hive, etc) Huge cache infra between apps and dbs All highly distributed and replicated Tip: never use disk arrays for big loads 22
Network management L3 everywhere Each rack has a /24 (IPv4) and a /64 (IPv6) Rack switches talk BGP-ECMP to CSWs CSWs talk BGP-ECMP to big routers... All the routing is BGP based 10g fiber links to each server Most services are behind load balancers tip: say goodbye to L2/VLANs 23
Traffic 24
Weekly cycle Egress Ingress Monday Tuesday Wednesday Thursday Friday Saturday Sunday 7 days 25
Daily cycle (global) 11 AM 3 PM 24 hours * Pacific time (UTC-8) 26
Daily cycle (global), mapped 27
Daily cycle (Brazil) 10 PM 1 PM 24 hours * Brasilia time (UTC-3) 28
Some numbers Peak HTTP/SPDY rps: ~12.5M Peak TCP conns: ~260M MAU Global: 1.15 billion MAU Brazil: 73 million (march 2013) 29
Cluster Network/LB topology Internet Datacenter DR DR DR CSW CSW CSW RSW RSW RSW L4LB L4LB L4LB L7LB L4LB L4LB L4LB L4LB L4LB L4LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB L7LB WEB BGP/ECMP IPv4: /32s IPv6: /64s Network Traffic Traffic Web DSR/WRR 30
Proportion Singles to Tens Tens to Hundreds Thousands 31
cont. x 10 or more 32
Porto Alegre <-> Forest City, NC 75ms 33
Porto Alegre <-> Forest City, NC 75ms SYN TCP conn established: SYN+ACK 150 ms ACK 75ms 33
Porto Alegre <-> Forest City, NC 75ms SYN TCP conn established: SYN+ACK 150 ms ACK ClientHello 75ms ServerHello SSL session established: 450 ms ChangeCipherSpec ChangeCipherSpec 33
Porto Alegre <-> Forest City, NC 75ms SYN TCP conn established: SYN+ACK 150 ms ACK ClientHello 75ms ServerHello SSL session established: ChangeCipherSpec 450 ms ChangeCipherSpec GET Response Received 600 ms HTTP 1.1 200 33
Edge rack x 1 L4LB x 2 L7LB x 20 PHP 34
POA - GRU - Forest City, NC 60ms 15ms 35
POA - GRU - Forest City, NC 15ms 60ms Sessions established: 90 ms (vs 450 ms) 35
POA - GRU - Forest City, NC 15ms 60ms Sessions established: 90 ms (vs 450 ms) GET GET Request Received HTTP 1.1 200 Response Received: 240 ms HTTP 1.1 200 35
POA - GRU - Forest City, NC 60ms TCP Connect: 150ms SSL Session: 450ms HTTP Response: 600ms 15ms 36
POA - GRU - Forest City, NC 60ms TCP Connect: 150ms 30ms SSL Session: 450ms 90ms HTTP Response: 600ms 240ms 15ms 36
Intl RTT, before and after 37
Intl RTT, before and after 37
Conclusion 38
Tips Never have single point of failures Don't protect only against equipment failure Human failures are the worst ones Make data driven decision Invest on analytics and instrumentation More data, better decisions. Don't fly blind. 39
Tips [2] There's no right or wrong here This is just the way we solve our problem today This will probably be different next year or so, maybe tomorrow Your problem might need a different solution 40
You can push the buttons too http://www.facebook.com/careers 41
42
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0 43