D.R. Network Design The Small College Version
Disaster Recovery Complex, far-reaching I.T. topic Our focus Improve network design to: Enhance ability to recover following a disaster Eliminate or limit effects of some disasters Accomplish improvements Without LOTS of money/people/rocket science By maximizing our inherently distributed, but very useful geography! 2
Common Roots Distributed geography Expensive network build-out A financial enemy A typical history Star topology network Collapsed core Evolutionary development Limit large 1-time expenses Maintain simplicity 3
Potential Pitfalls 4 Centralized resources Single point of failure Arduous recovery Risk = T x V x I Threat Vulnerability Impact (time+money+reputation) Impact increasing dramatically Greater complexity More mission-critical services
What did we do? Over a period of three years: Built consensus around the importance of BC/DR resulting in the development of a strategy Deployed a backup data center to house test/recovery services Distributed network core to four building, using a partial mesh design with OSPF routing Replication of mission critical data between primary and backup data centers 5
Building consensus Involve senior management. Educate! Knowledge is power. Ask for help You do not know all the answers! Enables collaboration and encourages broader ownership 6
Backup Data Center 7 Build-out location for disaster recovery Enables faster operational setup Location Collectively identify a site Physically separate from primary data center Well-connected Devise a fiscal plan Ours was a two-year plan: total cost ~$50k Devise a construction plan Distributed geography is now your friend!
Containing Data Center Costs Use internal resources when possible Install used components when reasonable Raised floor Cabinets UPS, but buy new batteries Start small, but with room to grow quickly! 8
Distributed Network Core Advantages: More is better -- eliminate the SPoF! Increased resiliency A layer 2 or 3 protocol can bypass failures Disadvantages: Increased complexity Geography is now your BEST friend! 9
How to distribute a network core Identify 3+ locations Should facilitate aggregation of fiber links. Available pathways to interconnect all locations. Adequate power and environmental controls. Choose a multi-path network design/protocol Ring Mesh Install links/equipment/protection protocol 10
Ring or mesh? Ring Fewer interconnections / lower cost Possibly increased hop count (L2/3) Mesh Full mesh requires r!/n!(r-n)! links, where r = number of locations, and n = 2. Full mesh = direct link between any two nodes Higher cost 11
L2 or L3 Protection Scheme? 12 Layer 2 Resilient Packet Ring (RPR), IEEE 802.17 SONET-like protection/recovery times Designed for metro networks Expensive; YMMV Proprietary Ethernet ring protection schemes E.g. Ethernet Automatic Protection Switching, RFC 3619 SONET-like protection/recovery times No standardized implementations with interoperability tests; YMMV HSRP, Spanning Tree Protocol
L2 or L3 Protection Scheme? Layer 3 Routing protocols Well understood Reliable and mature implementations Converge in seconds Faster is not always better Interoperability tested for many kinds of equipment(*) Requires IP address space provisioned by location rather than function! 13
We chose OSPF! Standard, non-proprietary Converges slower than fastest L2 options! May be less prone to flapping when faced with transient events. Allows injection of addresses into L3 cloud Default gateway IP AnyCast addresses for query/response protocols, e.g. DNS, RADIUS 14
Gotcha! Re-numbering was required Our IP address spaces were coalesced by function and not by geographic location. Addresses MUST be unique by geography, but CAN be coalesced by: Geography first, then by function (small routing table) Function first, then by geography (shorter ACLs) Done over a weekend after weeks of planning Perl-generated automated scripts for network gear and services (DNS,DHCP) DHCP helped for many devices SneakerNet for other devices, plus testing 15
Address Space by Geography 3 10.0.128.0/18 Acad, Stu, Admin, Labs 4 10.0.192.0/18 Acad, Stu, Admin, Labs 2 10.0.64.0/18 Acad, Stu, Admin, Labs 1 10.0.0.0/18 Acad, Stu, Admin, Labs 16
Address Space by Function 3 Acad 10.0.32.0/20 Stu 10.0.96.0/20 Admin 10.0.160.0/20 Labs 10.0.224.0/20 2 Acad 10.0.16.0/20 Stu 10.0.80.0/20 Admin 10.0.144.0/20 Labs 10.0.208.0/20 4 Acad 10.0.48.0/20 Stu 10.0.112.0/20 Admin 10.0.176.0/20 Labs 10.0.240.0/20 1 Acad 10.0.0.0/20 Stu 10.0.64.0/20 Admin 10.0.128.0/20 Labs 10.0.192.0/20 17
Why was re-numbering good? Eliminate network policy implementation inaccuracies reflected in real configurations. Completely re-worked network security policies, e.g. firewall rules, core ACLs, etc. Cleaned up DHCP and DNS Improve VLAN tag numbering scheme Numbered each network core location. Used these as prefix for VLANs at those nodes. Router/L3 switch interface address schemes 18
The results 19 OSPF area 0 used for core with a single /24, but /29 s per link Separate OSPF area for each core node F/W runs OSPF and injects default route Hosts -> upstream core node as def. g/w
Benefits 20 Increased flexibility Route around link failures Easily reconfigured in disaster recovery mode Shorten recovery time Increased reliability Windows DCs, DNS, and some ERP supporting hardware distributed between data centers IP AnyCast can provide diverse routes Replicate key data from iscsi SAN to backup data center
Next steps [Re-]terminate more fiber in new core sites Better load-balancing of traffic Reduced impact of failure at any single core node 21
Points to remember Improving disaster preparedness does not necessarily have to involve huge investments Evolutionary vs. revolutionary A staged approach is More easily assimilated into I.T. culture Fiscally achievable 22
Points to remember D.R. network design does not need to be complex Can be done without big investments in: Infrastructure Human resources Professional development (no rocket science) Take advantage of your geography!!! 23
Thank you! Questions? Discussion?