May 2006 Affordable Enabling Technology 2 Giles Gamon of High-Availability.Com Practical Approaches 1
Defining High-Availability Clustering is common place but what does High-Availability clustering achieve? High-Availability IS The absence of interruptions to an end-to to-end service More than making sure the db is running High-Availability IS NOT High-performance computing / clustering Scientific number crunching 2
Achieving High-Availability Identification of threats to service Systems failures, human errors, sabotage, software bugs, acts of God etc Management of risk Building in redundancy, taking backups, training staff, testing systems, active management solutions 3
Causes of Down Time Source - IEEE 4
Causes - Disaster Planning to cope with disasters is an important component of a High-Availability strategy Flood, fire, power grid failure, terrorism etc Most disasters are classified as environmental causes of downtime Collectively environmental causes approximately 5% of downtime 5
Causes - Environmental Power cuts and brown outs UPS & Generator What do they power? Communication blackouts WiFi saturation Cooling systems error Humidification regulation errors can cause hardware failures 6
Southampton University 2005 A small but very real threat Photo by Adrian Pickering 7
Causes Hardware Failure Probably the most recognised cause of downtime Server failures Disk, CPU, internal cooling fans, memory faults, Network failures DNS, DHCP, router, ISP, switches, cables cut, Other Tape backup corruption, client hardware, 8
Causes - Planned Hardware upgrades OS version upgrades Software version upgrades Data migration / transformation Backups Batch processing Preventative maintenance Testing 9
Causes Human Factor Failure to maintain File systems full Database tables full Patches for known bugs not applied Accidents root # rm rf / tmp/tempstuff Network mis-configuration Incorrect cable removed Inexperience root# reboot Cleaner knocks cables out Malice root# uadmin 1 5 or halt Physical sabotage 10
Causes Software Error Code crashes Application suddenly stops with a core dump Memory leaks Slowly consumes all memory until system crash Run away code Taking all CPU time in a loop Hanging code Code pauses waiting for reply that never comes Resource shortfalls Overflowing logs, failure to allocate memory or process Buffer overflows Possibly exploited or just bad code 11
Managing Risks Identify critical services 12
Identify Critical Services How long can the web server be down? Think - internal, public, distance learning How about Email? Can some Emails be lost? How about SITS / Bb / SCT? How much downtime is acceptable? Who will be affected? Admin, students, lecturers What is the impact on the business Reputation, income, disruption 13
Managing Risks Identify critical services Describe service level targets 14
Service Level Targets Email, Web (external) Downtime < 2 hours per month 8a.m. 2a.m. Collaboration Server Downtime < 30 mins per month 24x7 Distance Learning Downtime < 5 mins per year 24x7 Statistical Server Fix when you can not really required 15
Managing Risks Identify critical services Describe service level targets Map risks to services Quantify the level of threat Design and cost solutions Compromise in a rational way 16
Balancing Risk and Reward Unless you have an infinite budget you will have to make trade-offs Identify and remove SPoFs for critical services SPoF = Single Points of Failure Identify the least reliable MTBFs Moving parts typically have the lowest MTBF Identify the most difficult components to repair/rebuild e.g.:- Security server, database Identify what will have biggest impact on failure Usually a core server Database, Email, Web, authentication server etc 17
Technical Approaches Clustering Replication Transaction / block level Emerging technologies iscsi Multi-domain clusters Oracle RAC 18
Typical Multi-Tier Architecture View the service in a vertical fashion List all SPoFs Network Load balancers Switches Application server Database server Data disks Etc Design in redundancy where possible 19
Resilient Architecture Multi-site solution Replication to remote site Load balancers shown actually provide each other with redundant functionality Multiple switches used but not shown SPoFs reduced near to zero Multiple active blades centres Multiple active application servers Clustered database servers This architecture is resilient to almost every conceivable fault 20
Resilient Architecture 21
Resilient Architecture 22
High-Availability Clustering Intelligent management solution Software only Deployed on critical servers Can be active-active or active-passive Constant monitoring Application availability Server health Network availability Other defined components Automated restart / move in the event of a fault Notifications to administrative staff GUI, Email, SMS 23
High-Availability Clustering Active-Passive Simple setup Externalise shared data Use RAID &/ Mirroring Low cost, fast and simple Very reliable 24
High-Availability Replication Traditional cluster locally Replicate to remote node Replication at transaction level Remote node probably included in cluster Automatic locally Manual remotely 25
High-Availability Replication Typically replication does a log scrape Although newer versions have closer integration Takes committed transactions and copies them across to the other node(s) Other nodes roll back the transactions to a read-only copy of the database 26
High-Availability Replication Block level replication Suitable for user files Not ideal for databases Many better approaches that understand db data Available in different guises - like Sun s s SNDR (remote mirror) in kernel Sync / async Streams type module Rsync user space Periodic checking and copy 27
High-Availability Replication Use db replication for db when possible Use block level for other file types and legacy applications that have no replication option available 28
iscsi Block Level Replication Presented as standard disk Over LAN instead of Fibre / SCSI Very clever but still emerging Can be combined with local attach 29
Multi-Domain Clusters Resilient hardware Good I/O architecture Probably not cheapest solution Cheap 2 nd hand 30
Practical Examples Tokyo Stoke Exchange Dealer connections Surrey Ambulance 999 call handling centre North Yorkshire Police Tasking & operational management Steria SWIFT bureau service InSerTo Telco real time services 31
Tokyo Stock Exchange Trading connections over public telecoms network Requiring FireWall-1 Exchange secure but exposed Network faults Firewall system crashes The exchange needed to eliminate identified exposures Low tolerance to downtime 32
Tokyo Stock Exchange Multiple network connections Multiple firewalls installed at every location Clustering used to provide automated failover Transparent failover 33
Surrey Ambulance Service 999 call centre 24x7 live operations environment Handling calls from the public Live feeds from ambulance GPS devices Automatic escalation and logging 34
North Yorkshire Police 24x7 live CAD system Command and control Custody management Crime management Duty rostering Imaging and biometrics Oracle backend to STORM application Highly integrated systems Mapping systems PNC links DVLA links Firearms database Neighbouring force systems 35
North Yorkshire Police 36
North Yorkshire Police 37
Bristol University Number of Oracle databases & other apps Desire to HA across campus Extensive pre-purchase purchase consultancy with Sun Oracle Elected not to use Oracle RAC Not suited to multiple smaller databases Didn t t suite their consolidation desires More cost effective to build clusters of individual Oracle instances nces Expensive compared with standard clustering & RAC requires clustering regardless Applications not built for RAC extended features Elected not to use block copy replication Despite having hardware in place capable of this Data Guard 38
Bristol University 39
Nottingham University Distributed cluster cross-campus campus Oracle, Bb, SITS, SCT, NFS, Web 40
Example Clusters in UK Education Salford SCT UWE Bb, Library, SunOne Sunderland SITS Newcastle 3 x SAP & Oracle Largest European SAP site in hefe Leeds SAP & Oracle Manchester WebCT Edinburgh Firewall Sheffield Halam SITS & Bb 41
RSF-1 Environments Solaris HP-UX Linux AIX SCO MAC OS X SPARC Intel Opteron / AMD x, p, i & Z-Series Z IBM PA-RISC 42
Typical Further Questions Will users notice a fail-over? How long will it take to get installed? Is it complicated? Can it work on Oracle 10i? Is Oracle RAC a good option? Can I use a WAN connection? What about SVM, VxFS,, EMC? Can I use Solaris 10? What about Linux? 43
Contacting Us Giles Gamon High-Availability.Com sales@high-availability.com support@high-availability.com giles@high-availability.com 01565 754 459 44