managing planned downtime with RAC Björn Rost

managing planned downtime with RAC VIP D ASM C CVU SQL UC CP OPS OUI RAC F OUI RAC FCF LBA ONS UI RAC FCF LBA ONS FAN C RAC FCF LBA ONS FAN TAF OD AC FCF LBA ONS FAN TAF CRS VIP A FCF LBA ONS FAN TAF CRS VIP GSD ODA FCF LBA ONS FAN TAF CRS VIP GSD ASM SQ LBA ONS FAN TAF CRS VIP GSD ASM CVU SQL UC F LBA ONS FAN TAF CRS VIP GSD ASM CVU RAC FCF L CRS VIP GSD ASM CVU SQL UCP OPS OUI RAC LBA ON IP GSD ASM CVU SQL UCP OPS OUI RAC ODA LBA FAN T ASM CVU SQL RAC UCP OPS OUI RAC FCF ASM LBA ON CVU SQL UCP OPS OUI RAC LBA ODA ONS TAF FAN TAF CVU SQL UCP OPS ODA OUI RAC FCF LBA ONS FAN U SQL UCP OPS OUI FCF LBA ONS FAN TAF CRS P OPS OUI RAC FCF ONS FAN TAF CRS VIP I RAC FCF LBA ONS FAN TAF CRS VIP C FCF LBA ONS FAN TAF CRS VIP ONS FAN TAF CRS VIP GSD AS S FAN TAF CRS VIP GSD RS VIP GSD ASM CV SD ASM CVU SQ VU SQL UC Björn Rost

Björn Rost founder, manager and DBA RAC SIG European Chair ACE Director

about us Software production company founded 2001 mostly J2EE logistics telecommunication media and publishing customers demand full lifecycle support hardware resale datacenter operations 3rd party software

project lifecycle consulting J2EE Hardware hosting specification php SW-Licenses monitoring documentation database installation patching feasibility studies benchmarking backups tuning planning design integration operation

TAF Minimize downtime! Go implement this TAF thing. Just turn it on, it is completely transparent!

TAF Minimize downtime! Go implement this TAF thing. Just turn it on, it is completely transparent! let me check the docs and get right back

use OCI driver TAF

TAF use OCI driver can do

TAF use OCI driver can do delay or overhead?

TAF use OCI driver can do delay or overhead? was expecting some cost

TAF use OCI driver can do delay or overhead? no DML! was expecting some cost

TAF use OCI driver can do delay or overhead? no DML! was expecting seriously? some cost

TAF use OCI driver can do delay or overhead? no DML! was expecting seriously? some cost yup, only SELECT will fail over...

expectation a clustered HA system should always be UP

the reality even with RAC implemented, there are still many (if not more) outages :(

limits a session can never move between nodes session creation (lb) decided on connection HA needs to be supported in Apps some of this stuff can be confusing

12c app continuity

Agenda introduction walkthrough load balancing connection pools srvctl app continuity

reasons to use RAC http://www.my-idconcept.de/downloads/you_probably_dont_need_rac.pdf

reasons to use RAC You probably don t need RAC! http://www.my-idconcept.de/downloads/you_probably_dont_need_rac.pdf

reasons to use RAC

reasons to use RAC scalability & performance

reasons to use RAC scalability & performance high availability

reasons to use RAC scalability & performance high availability unplanned

reasons to use RAC scalability & performance high availability unplanned planned

RAC One node RAC without scaling across multiple nodes migration to full RAC online possible seamless crash failover

unplanned downtime

unplanned downtime hardware fault

unplanned downtime hardware fault servers come with redundant components disks power supplies fans components are getting better, too

unplanned downtime

unplanned downtime hardware fault

unplanned downtime hardware fault software crash or hang

unplanned downtime hardware fault software crash or hang DOS attacks / security issues

unplanned downtime hardware fault software crash or hang DOS attacks / security issues human error

planned downtime

planned downtime hardware upgrade (RAM, CPU,...)

planned downtime hardware upgrade (RAM, CPU,...) firmware upgrades

planned downtime hardware upgrade (RAM, CPU,...) firmware upgrades OS updates

planned downtime hardware upgrade (RAM, CPU,...) firmware upgrades OS updates Oracle Software patches

planned downtime hardware upgrade (RAM, CPU,...) firmware upgrades OS updates Oracle Software patches network re-patching

planned downtime hardware upgrade (RAM, CPU,...) firmware upgrades OS updates Oracle Software patches network re-patching SAN reconfiguration

downtime

failure types

failure types app not connected (only on demand)

failure types app not connected (only on demand) session open but idle/no tx

failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect

failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only

failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error

failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error tx in progress, DML

failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error tx in progress, DML rollback/replay/handle error

failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error tx in progress, DML rollback/replay/handle error important: don t commit twice

failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error tx in progress, DML rollback/replay/handle error important: don t commit twice (re)join of cluster node

maintenance rqmts

maintenance rqmts remove nodes from cluster without user interruption

maintenance rqmts remove nodes from cluster without user interruption don t break running sessions

maintenance rqmts remove nodes from cluster without user interruption don t break running sessions ok to kill idle sessions, let them reconnect

maintenance rqmts remove nodes from cluster without user interruption don t break running sessions ok to kill idle sessions, let them reconnect don t loose data/transactions/new orders

maintenance rqmts remove nodes from cluster without user interruption don t break running sessions ok to kill idle sessions, let them reconnect don t loose data/transactions/new orders stay up or available

load balancing

load balancing client side tnsnames.ora and/or SCAN

load balancing client side tnsnames.ora and/or SCAN server side on connection long goal: # of connections short goal: system load avg

load balancing client side tnsnames.ora and/or SCAN server side on connection long goal: # of connections short goal: system load avg runtime advisory events sent to conn. pools

SCAN RAC_OLTP = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = scan.db.portrix.net)(port = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = OLTP) )) oracle@rac1:~$ host scan.db.portrix.net scan.db.portrix.net has address 46.30.26.101 scan.db.portrix.net has address 46.30.26.102 scan.db.portrix.net has address 46.30.26.103

Services: OLTP batch

Services: OLTP: RAC1, RAC2 batch: RAC1 RAC1 RAC2

1 Services: OLTP: RAC1, RAC2 batch: RAC1 RAC1 RAC2

Services: 1 OLTP: RAC1, RAC2 batch: RAC1 RAC1 RAC2

Services: OLTP: RAC1, RAC2 batch: RAC1 1 RAC1 RAC2

2 Services: OLTP: RAC1, RAC2 batch: RAC1 1 RAC1 RAC2

Services: 2 OLTP: RAC1, RAC2 batch: RAC1 1 RAC1 RAC2

Services: OLTP: RAC1, RAC2 batch: RAC1 2 1 RAC1 RAC2

3 Services: OLTP: RAC1, RAC2 batch: RAC1 2 1 RAC1 RAC2

Services: 3 OLTP: RAC1, RAC2 batch: RAC1 2 1 RAC1 RAC2

Services: OLTP: RAC1, RAC2 batch: RAC1 2 1 3 RAC1 RAC2

Services: OLTP: RAC2 batch: RAC2 2 1 3 RAC1 RAC2

4 5 Services: OLTP: RAC2 batch: RAC2 2 1 3 RAC1 RAC2

Services: 4 5 OLTP: RAC2 batch: RAC2 2 1 3 RAC1 RAC2

Services: OLTP: RAC2 batch: RAC2 5 2 1 4 3 RAC1 RAC2

Services: OLTP: RAC2 batch: RAC2 1 3 5 4 RAC1 RAC2

5 4 3 RAC1 RAC2

Services: OLTP: RAC1, RAC2 batch: RAC1 5 4 3 RAC1 RAC2

app requirements reconnect regularly handle connection failures set max_sessions to the right value

connection pools pool will open and hold connections app loans session for tx as needed when tx is done, app returns session pool can decide which connection to lend to app

connection pools save resources memory connection time reconnect help load balancing abstraction layer for errors

UCP and FAN

UCP and FAN Fast Connection Failover

UCP and FAN Fast Connection Failover crash

UCP and FAN Fast Connection Failover crash planned outage

UCP and FAN Fast Connection Failover crash planned outage (re)join

UCP and FAN Fast Connection Failover crash planned outage (re)join run-time load balancing

UCP and FAN Fast Connection Failover crash planned outage (re)join run-time load balancing session affinity

UCP and FAN Fast Connection Failover crash planned outage (re)join run-time load balancing session affinity transaction affinty

services A service is an entity to which users connect configured with connection settings on client registered through clusterware each service has: a list of preferred and available instances load-balancing goal TAF and other parameters 12c multitenant: each PDB has it s own service

services default service is always active on all nodes ORA-01033: ORACLE initialization or shutdown in progress seperation might improve performance helpful in other areas of administration resource management EM monitoring grouping

srvctl grid@rac1:~$ srvctl config service -d PTXRAC -s OLTP Service name: OLTP Service is enabled Server pool: PTXRAC_OLTP Cardinality: 2 Disconnect: false Service role: PRIMARY Management policy: AUTOMATIC DTP transaction: false AQ HA notifications: false Failover type: NONE Failover method: NONE TAF failover retries: 0 TAF failover delay: 0 Connection Load Balancing Goal: LONG Runtime Load Balancing Goal: SHORT TAF policy specification: NONE Edition: Preferred instances: PTXRAC1,PTXRAC2 Available instances:

verify service cfg grid@rac1:~$ lsnrctl status listener_scan1 LSNRCTL for Solaris: Version 11.2.0.2.0 - Production on 29-SEP-2011 11:35:58 Copyright (c) 1991, 2010, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN1))) STATUS of the LISTENER ------------------------ Alias LISTENER_SCAN1 Version TNSLSNR for Solaris: Version 11.2.0.2.0 - Production Start Date 30-APR-2011 23:09:28 Uptime 151 days 12 hr. 26 min. 30 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /u01/app/11.2.0/grid/network/admin/listener.ora Listener Log File /u01/app/11.2.0/grid/log/diag/tnslsnr/sun1os/listener_scan1/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN1))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.42.155)(PORT=1521))) Services Summary... Service "BATCH.DB.PORTRIX.NET" has 1 instance(s). Instance "PTXRAC2", status READY, has 1 handler(s) for this service... Service "OLTP.DB.PORTRIX.NET" has 2 instance(s). Instance "PTXRAC1", status READY, has 1 handler(s) for this service... Instance "PTXRAC2", status READY, has 1 handler(s) for this service... Service "PTXRAC.DB.PORTRIX.NET has 2 instance(s). Instance "PTXRAC1", status READY, has 1 handler(s) for this service... Instance "PTXRAC2", status READY, has 1 handler(s) for this service... Service "PTXRACXDB.DB.PORTRIX.NET" has 2 instance(s). Instance "PTXRAC1", status READY, has 1 handler(s) for this service... Instance "PTXRAC2", status READY, has 1 handler(s) for this service... The command completed successfully

srvctl srvctl modify service Moves a service member from one instance to another. Additionally, this command changes which instances are to be the preferred and the available instances for a service. This command supports some online modifications to the service, such as: When there are available instances for the service, and the service configuration is modified so that a preferred or available instance is removed, the running state of the service may change unpredictably: The service is stopped and then removed on some instances according to the new service configuration. The service may be running on some instances that are being removed from the service configuration. These services will be relocated to the next free instance in the new service configuration. srvctl relocate service -d db_unique_name -s service_name {-c source_node -n target_node -i old_instance_name -t new_instance_name} [-f]

srvctl if service only up on one node: relocate up on muliple nodes: modify

shutdown srvctl stop instance -d db_unique_name {[-n node_name] [-i "instance_name_list"]} [-o stop_options] [-f] stops all services on the node (with -f ) better relocate service yourself!

shutdown srvctl stop instance -d db_unique_name {[-n node_name] [-i "instance_name_list"]} -o transactional refuses new connections disconnects sessions after commit/rollback

steps (again) relocate services away (relocate/modify) wait until sessions are done with work shutdown (transactional) perform maintenance restart services relocate services back

rolling upgrades available in a lot of patches two RDBMS versions running simultaneously built-in support in OPatch

rolling upgrades [oracle@rac1 tmp]$ opatch query -is_rolling_patch 10352368 Invoking OPatch 11.1.0.6.6 Oracle Interim Patch Installer version 11.1.0.6.6 Copyright (c) 2009, Oracle Corporation. All rights reserved. Oracle Home : /u01/app/oracle/product/11.2.0/db_1 Central Inventory : /u01/app/orainventory from : /etc/orainst.loc OPatch version : 11.1.0.6.6 OUI version : 11.2.0.1.0 OUI location : /u01/app/oracle/product/11.2.0/db_1/oui Log file location : /u01/app/oracle/product/11.2.0/db_1/cfgtoollogs/opatch/ opatch2011-09-15_11-28-05am.log Patch history file: /u01/app/oracle/11.2.0/db_1/cfgtoollogs/opatch/ opatch_history.txt -------------------------------------------------------- Patch is a rolling patch: true

12c app continuity 2 part system transaction guard reliably determine the state of commits app continuity (replay driver) driver records and caches requests and validation information reconnects and verifies commit state replays and validates requests

activate app continuity driver needs replay boundaries UCP and WebLogic add these automatically beginrequest/endrequest for 3rd party apps jdbc-thin only mutable calls (seq.nextval, sysdate) does not work with default service consider memory&cpu overhead

review TAF Load-Balancing services UCP FAN and FCF App Continuity

summary set up at least one extra service possibly more make sure application reconnects regularly use UCP if possible try and use this make it part of app rqrmts patch regularly

und weiter? RAC SIG Wahlen laufen gerade! RAC SIG - www.oracleracsig.org

DOAG 2013 unconference: DEMO 12c RAC auf laptop, UCP und app continuity mit java app

Danke RAC Attack www.racattack.org RAC SIG - www.oracleracsig.org b.rost@portrix.net http://portrix-systems.de/blog/ @brost