Failover and Global Server Load Balancing for Be4er Network Availability Jeremy Hitchcock CEO Dynamic Network Services
Overview Problem space: Keeping services up About Failover and GSLB Case Study: Roll your own CDN in...quick Case Study: Speed and Stability Case Study: DR You can Sleep On General lessons for network availability
You are probably SoJware service provider Completely online UpLme and revenue directly related Audience is internalonal (non- geographical) So is everyone (lot more of us)!
Mean Time Between Failures (MTBF) (Local)
Fiber Cuts (Network/global)
Failures Are a Way of Life Affects bo4om line Gets people paged Brands loose value
A Be4er Way? Current tools: in- house scripts, appliances, CDN networks Either high opex or capex New oplons in infrastructure Example: 5-10 person [boot- strapped] companies rolling self- healing, auto- provisioning networks
OpLmizing The Wrong Part Hardware redundancy is expensive Single point of failures are bad Infrastructure is not a core funclon Things break, everything auto Easier (cheaper) than you think
RealizaLons Things break, route around outages Infrastructure providers a plenty today Users more sensilve to outages Internet users are around the world Speed of light is slll c RTT of 100m with 50 objects adds up Traffic management is cribcal
Different Architectures, Different Results Old Use hardware redundancy, local Super- site build out Page on failure, fix based on page Planned deployments Single master datacenter DR is a passive, manual failover New Use sojware redundancy Regionalize, all over- provisioned Email report in morning AutomaLc load handling Many POPs, all closer to users DR and failover blended together
New Tools (new to some) AutomaLc failover Global server load balancing CDN balancing/managing Opex relalve to actual usage Avoid capex step funclons
Failover Two aclve components, traffic switch Implies external monitoring Standard operalon Hide outages On Failover
Failover Use Cases Two servers for www.domain.com On failure, redirect from one to the other Works via DNS Redirect to a stalc page Requirements External monitoring point External DNS Low DNS caching TTL values
Global Server Load Balancing (GSLB) More than two aclve components Traffic management TargeLng (geo, network) WeighLng (percent) Failover plus oplmize RTT Hostname to A record mapping
Global Server Load Balancing Use Cases Regionalize eyeballs/end- users Internet outages/subpar speeds avoided Weight based on load, percentages Requirements: Same as failover Bit of math/algorithms to balance traffic Many to many mappings
CDN Management Two complete systems Balance between CDNs Bandwidth commits Regional advantages Works on CNAMEs
CDN Manager Try out a mix of networks CDNs, infrastructure providers Be4er manage traffic Cost/performance reasons Requirements Same as GSLB but with DNS alias CNAMEs
Traffic Cop: DNS Internet doesn't care about domain.com twi4er.com 128.121.146.228 Lot of tricks you can do here
Lenses and OpLons EvaluaLon Criteria SoJ/hard costs, capital/operalng costs Outcome based Determine your metrics, test those PotenLal Outcomes Roll it in house CDN Network Hardware appliances SaaS- based
Which one is be4er? Roll it in house Mid- high capex, higher than you think opex Lots of soj- costs, applicalon specific though CDN Network Li4le capex, high opex Some have more knobs than others Hardware appliances High capex, low opex Need to make full investment into architecture SaaS- based Li4le capex, low- mid opex Let others worry about this for you
Case Study 1 Roll your own CDN in...quick Wikia and regionalizing CDNs for be4er delivery
CDN Choice and Transparency Lots of CDNs Two great public ones 30 (more?) private providers Telco/ISP oplons Currently give customer hostname (customer.cdn.com) Only test with live traffic
CDN Manager: Enabling TesLng Segment traffic and test Try 2 or 10 CDNs Low risk method to collect data Data colleclon has to be from end points Your office computer is not the Internet Can be4er rate cost/performance
CDN Manager: Wikia Wikia runs several niche wikis (audience) OpLmize traffic delivery for those niches Wanted to determine the best CDN based on actual data
CDN Manager: Wikia In America, use CDN In Europe, use their own Why? Who knows, but it s the best for their traffic
Discussion Not all CDNs are the same MulLple relalonships to manage Cost control/performance of CDNs Audience and economies drive decisions
Case Study 2 Speed and Stability Twi4er and keeping up
Speed and Stability All Internet sites have DNS Range from good, bad, ugly Online services must be fast and accurate Latency and uplme are what ma4ers Things fail all the Lme, sends users to what works
Speed and Stability: Twi4er Spiky and growing traffic (like a lot) Things change too fast to keep up Load balance a lot Easier to scale core competencies One less thing to worry about
Speed and Stability: Twi4er DNS part of system to make site work Desire not to be an expert in it Huge, wide spread audience Online- only service
Discussion When infrastructure changes rapidly, external monitoring good Failover message is be4er than Lmeouts Keep traffic regionalize through targelng Outsource non- core competencies Latency affects page views or ad revenue
Case Study 3: Disaster Recovery You Can Sleep With 37 Signals and doing what needs to get done
Disaster Recovery ImplementaLon Requirements One good facility (A) One backup facility (B) Ability to recognize facility A is out Ability to direct traffic from A to B
Authorize.net Interlude DR implementalon Lmeline Late- July: move to new DR facility and plan July 2: fire at Fisher Plaza (unplanned) July 3: Only missing a traffic engineering switch TTLs (DNS record caching) a big difference SLll a problem today secure.authorize.net. 86400 IN A 64.94.118.32 Fully discussion: h4p://bit.ly/23mayf
DR: 37 Signals Cloud based SaaS tools, have to be up External DNS important for controlling traffic What if facility A is down and DNS is only at A? External DNS means failover/dr possible
Discussion Ensuring full replicalon is usually easy Traffic management, is usually the problem Confuse cold assets/warm spare/hot aclve People wait unll they have an outage to implement DR
Overall Notes Networked services need to be rock solid Failover, GSLB, and CDNM are within reach Wikia, Twi4er, and 37 Signals using external traffic management for their applicalon Audience ma4ers, so does teslng and benchmarking
DynTini twi4er.com/dynlni
Copy of presentalon? Leave a business card in back (or talk to me ajerwards) and I ll send it to you
Contact Us Uptime Is the Bottom Line. Dynamic Network Services, Inc. 1230 Elm St. FiJh Floor Manchester, NH 03101 +1 888.840.3258 jeremy@dyn.com dyn.com Join us for drinks: dynlni.com Follow us on Twi4er: @DynInc