Ingres Replicated High Availability Cluster

Ingres High Availability Cluster

The HA Challenge True HA means zero total outages Businesses require flexibility, scalability and manageability in their database architecture & often a Single Point of Failure Multi-node replicated architecture can deliver Capacity to absorb partial failure Flexibility to allow zero downtime maintenance/upgrades Low cost Scalability to cope with growth Geographical distribution of resources High performance 2009 Ingres Corporation Slide 2

Ingres High Availability Cluster (s) (s) (s) production production Passive Business Intelligence reporting One or more active nodes One passive node One Disaster Recovery node Each node has own disks HVR Realtime Data Integration Offsite Disaster Recovery This combination delivers mixture of business benefits; High Availability Load Balancing Business Intelligence... but many other variations possible. 2009 Ingres Corporation Slide 3

Ingres HA Cluster High Availability Load Balancing Business Intelligence Unplanned Downtime Planned Downtime Offsite Disaster Recovery Shared nothing Protects from hardware, OS and power interrupt Zero-downtime Maintenance Abolish nightly or weekly batch windows Zero-downtime upgrades Use when upgrading hardware, OS or DBMS Avoids risk of failed upgrades /active Application must be tolerant to latency Very scaleable /passive Live reporting Off load reports to passive node Real-time Business Intelligence Feed to operational data warehouse Feed to SAP business warehouse 2009 Ingres Corporation Slide 4

Ingres HA Cluster Architecture 2009 Ingres Corporation Slide 5

HA Cluster HA More than 2 servers Multiple s, Multiple databases No wasted resources Connections distributed evenly by Ingres-Net/JDBC s synchronised by peer-to-peer Replication using Ingres HVR failure has minor impact on subset of users Applications auto-reconnect without re-authorising Ingres-Net/JDBC distributes connections evenly across remaining servers Cluster remains intact even with one failed server 2009 Ingres Corporation Slide 6

HA Cluster 4 Node example 4000 Users Connections distributed by Ingres-Net/JDBC 1000 1000 1000 1000 Ingres HVR replication between all databases 2009 Ingres Corporation Slide 7

HA Cluster Flexible, Scalable Solution s can be added to accommodate growth Additional for Read-Only reporting database Replication to testing platforms for patch and upgrade testing Nodes can be withdrawn for zero-downtime maintenance & upgrades Mixed Ingres/OS/Hardware versions allowed Migration can be achieved with near zero down time 2009 Ingres Corporation Slide 8

HA Cluster Node Failure 25% of clients affected Automatically reconnected 333 reconnections per node No Authorisation required 1333 Connections per node Failed Failed 2009 Ingres Corporation Slide 9 Ingres HVR replication between all databases

HA Cluster Node Restoration brought up to date by replication 1333 1333 1333 Restored Out-of-sync restored and synchronised by Ingres HVR 2009 Ingres Corporation Slide 10 Ingres HVR replication between all databases

HA Cluster Full Service Restored Node & restored Connections migrate through natural reconnection 0->1000 1333->1000 1333->1000 1333->1000 Peer to Peer replication between all databases 2009 Ingres Corporation Slide 11

Application Reconnection If a node fails the connection is broken Three approaches 1. Session ends and end-user has to restart application No change needed to application Similar to node failure with Operating System cluster Failing nodes should not be frequent 2. Query gives error but application keeps running Application error handler must attempt reconnection 3. Application reconnects and then automatically retries transaction Similar to deadlock retry loop, which may already exist 2009 Ingres Corporation Slide 12

/active: Application Latency Tolerance Application must tolerate latency in / Examples: Simultaneous update to same row HVR can automatically detect and resolve Simultaneous generation of same key Allocate each node its own key range (e.g. even & odd) Business logic problems Change not immediately visible to all nodes Problem avoided by distributing users Geographic Role based 2009 Ingres Corporation Slide 13

Comparing Ingres HA Cluster with alternatives Shared-disk Cluster Log Shipping DR Mirroring 2009 Ingres Corporation Slide 14

Comparison with Shared-disk Cluster Protection from emergency affecting building (fire or power outage) Shared Disk Cluster no HVR HA Cluster yes Single point of failure yes no Multi-active nodes yes yes (with latency) Live reporting no yes Zero-downtime maintenance and upgrades no yes 2009 Ingres Corporation Slide 15

Log Shipping Disaster Recovery Single production DR refreshed from Logs/Journals Intervals vary, Daily, Hourly Total outage for ALL users on DB Failure Probable Transaction/data loss switching to DR High contention during restart High cost of maintaining idle DR server Can be used for reporting/read Only (Not during refresh) No DR capability after a failover 2009 Ingres Corporation Slide 16

Log Shipping Disaster Recovery /Passive Disaster Recovery All clients connected to one server Ingres-Net Passive Log Shipping to DR Copy 2009 Ingres Corporation Slide 17

Log Shipping Disaster Recovery /Passive Disaster Recovery fails. Possible final DB Refresh on DR (Minutes, Hours?) No DR capability temporarily Clients reconnect and re-authorise. High Contention Failed Copy Log Shipping to DR 2009 Ingres Corporation Slide 18

Comparison with Log Shipping DR Data-loss after disaster Log Shipping hours or minutes HVR HA Cluster max 1 second Multi-active nodes no yes Live reporting no yes Zero-downtime maintenance and upgrades Possibly yes 2009 Ingres Corporation Slide 19

Warm/Hot Standby Mirroring /Passive HA Failover All clients connected to one server Ingres-Net Heartbeat Passive Mirroring Copy 2009 Ingres Corporation Slide 20

Warm/Hot Standby Mirroring /Passive HA Failover fails. ALL Clients reconnect to standby No failover capability temporarily Clients reconnect and re-authorise. High Contention increases down time Failed Failover takes a few minutes Copy Mirroring 2009 Ingres Corporation Slide 21

Warm/Hot Standby Mirroring /Passive HA Failover Single production Mirrored file system (OS,, Proprietary!) Total outage for ALL users on DB Failure After failover there is NO standby system until failed server is restored Slow recovery due to high levels of reconnection and re-authorisation High cost of maintaining idle server Cannot be used 2009 Ingres Corporation Slide 22

Comparison with Mirroring Mirroring HVR HA Cluster WAN requirements high cost low cost Multi-active nodes no yes (with latency) Live reporting no yes Zero-downtime maintenance and upgrades no yes 2009 Ingres Corporation Slide 23

Migration to Ingres HA Cluster 2009 Ingres Corporation Slide 24

Migration to HA Cluster Migration will be gradual and phased 1. Auto-reconnection with no re-authorisation introduced on existing architecture 2. Deal with Latency issues 3. 1 st server introduced (No clients) Proves replication in production 4. Gradual ramp up of connections to new server 5. Add 2 nd replicated server with connections 6. Repeat until all replicated nodes added 7. Decommission old servers 2009 Ingres Corporation Slide 25

Migration to HA Cluster 1 st Node Added Ingres-Net (4000) Passive Read Only Data Copying Copy Ingres HVR Master/Slave Replication 2009 Ingres Corporation Slide 26

Migration to HA Cluster Connections enable to replicated server 1000 3000 Passive Data Copying Copy Ingres HVR Peer to Peer Replication 2009 Ingres Corporation Slide 27

Migration to HA Cluster 2 nd Node added 1000 1000 2000 Passive Data Copying Copy Ingres HVR Peer to Peer Replication 2009 Ingres Corporation Slide 28

Migration to HA Cluster Gradually add all and decommission old servers Connections distributed by Ingres-Net/JDBC 1000 1000 1000 1000 Ingres HVR Peer to Peer replication between all databases 2009 Ingres Corporation Slide 29

Proof of Concept 2009 Ingres Corporation Slide 30

HA Cluster Proof of Concept Data & Application Analysis Compatibility with multi database architecture Auto re-connection challenges Replication latency & Performance Re-connection performance Ability to replay the Workload essential Query Recording and Playback Ingres 2.6 and above Utilise customer test harness or test tools Approximately 20 man days effort 2009 Ingres Corporation Slide 31

More Information Ingres Web Site http://www.ingres.com Success Stories http://www.ingres.com/customers/case-studies.php Community Wiki http://community.ingres.com/wiki 2009 Ingres Corporation Slide 32

Questions?? 2009 Ingres Corporation Slide 33