managing planned downtime with RAC VIP D ASM C CVU SQL UC CP OPS OUI RAC F OUI RAC FCF LBA ONS UI RAC FCF LBA ONS FAN C RAC FCF LBA ONS FAN TAF OD AC FCF LBA ONS FAN TAF CRS VIP A FCF LBA ONS FAN TAF CRS VIP GSD ODA FCF LBA ONS FAN TAF CRS VIP GSD ASM SQ LBA ONS FAN TAF CRS VIP GSD ASM CVU SQL UC F LBA ONS FAN TAF CRS VIP GSD ASM CVU RAC FCF L CRS VIP GSD ASM CVU SQL UCP OPS OUI RAC LBA ON IP GSD ASM CVU SQL UCP OPS OUI RAC ODA LBA FAN T ASM CVU SQL RAC UCP OPS OUI RAC FCF ASM LBA ON CVU SQL UCP OPS OUI RAC LBA ODA ONS TAF FAN TAF CVU SQL UCP OPS ODA OUI RAC FCF LBA ONS FAN U SQL UCP OPS OUI FCF LBA ONS FAN TAF CRS P OPS OUI RAC FCF ONS FAN TAF CRS VIP I RAC FCF LBA ONS FAN TAF CRS VIP C FCF LBA ONS FAN TAF CRS VIP ONS FAN TAF CRS VIP GSD AS S FAN TAF CRS VIP GSD RS VIP GSD ASM CV SD ASM CVU SQ VU SQL UC Björn Rost
Björn Rost founder, manager and DBA RAC SIG European Chair ACE Director
about us Software production company founded 2001 mostly J2EE logistics telecommunication media and publishing customers demand full lifecycle support hardware resale datacenter operations 3rd party software
project lifecycle consulting J2EE Hardware hosting specification php SW-Licenses monitoring documentation database installation patching feasibility studies benchmarking backups tuning planning design integration operation
TAF
TAF Minimize downtime! Go implement this TAF thing. Just turn it on, it is completely transparent!
TAF Minimize downtime! Go implement this TAF thing. Just turn it on, it is completely transparent! let me check the docs and get right back
TAF
use OCI driver TAF
TAF use OCI driver can do
TAF use OCI driver can do delay or overhead?
TAF use OCI driver can do delay or overhead? was expecting some cost
TAF use OCI driver can do delay or overhead? no DML! was expecting some cost
TAF use OCI driver can do delay or overhead? no DML! was expecting seriously? some cost
TAF use OCI driver can do delay or overhead? no DML! was expecting seriously? some cost yup, only SELECT will fail over...
expectation a clustered HA system should always be UP
the reality even with RAC implemented, there are still many (if not more) outages :(
limits a session can never move between nodes session creation (lb) decided on connection HA needs to be supported in Apps some of this stuff can be confusing
12c app continuity
Agenda introduction walkthrough load balancing connection pools srvctl app continuity
reasons to use RAC http://www.my-idconcept.de/downloads/you_probably_dont_need_rac.pdf
reasons to use RAC http://www.my-idconcept.de/downloads/you_probably_dont_need_rac.pdf
reasons to use RAC You probably don t need RAC! http://www.my-idconcept.de/downloads/you_probably_dont_need_rac.pdf
reasons to use RAC
reasons to use RAC scalability & performance
reasons to use RAC scalability & performance high availability
reasons to use RAC scalability & performance high availability unplanned
reasons to use RAC scalability & performance high availability unplanned planned
RAC One node RAC without scaling across multiple nodes migration to full RAC online possible seamless crash failover
unplanned downtime
unplanned downtime hardware fault
unplanned downtime hardware fault servers come with redundant components disks power supplies fans components are getting better, too
unplanned downtime
unplanned downtime hardware fault
unplanned downtime hardware fault software crash or hang
unplanned downtime hardware fault software crash or hang DOS attacks / security issues
unplanned downtime hardware fault software crash or hang DOS attacks / security issues human error
planned downtime
planned downtime hardware upgrade (RAM, CPU,...)
planned downtime hardware upgrade (RAM, CPU,...) firmware upgrades
planned downtime hardware upgrade (RAM, CPU,...) firmware upgrades OS updates
planned downtime hardware upgrade (RAM, CPU,...) firmware upgrades OS updates Oracle Software patches
planned downtime hardware upgrade (RAM, CPU,...) firmware upgrades OS updates Oracle Software patches network re-patching
planned downtime hardware upgrade (RAM, CPU,...) firmware upgrades OS updates Oracle Software patches network re-patching SAN reconfiguration
downtime
failure types
failure types app not connected (only on demand)
failure types app not connected (only on demand) session open but idle/no tx
failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect
failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only
failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error
failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error tx in progress, DML
failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error tx in progress, DML rollback/replay/handle error
failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error tx in progress, DML rollback/replay/handle error important: don t commit twice
failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error tx in progress, DML rollback/replay/handle error important: don t commit twice (re)join of cluster node
failure types app not connected (only on demand) session open but idle/no tx app needs to reconnect tx in progress, SELECT only start over or display error tx in progress, DML rollback/replay/handle error important: don t commit twice (re)join of cluster node
maintenance rqmts
maintenance rqmts remove nodes from cluster without user interruption
maintenance rqmts remove nodes from cluster without user interruption don t break running sessions
maintenance rqmts remove nodes from cluster without user interruption don t break running sessions ok to kill idle sessions, let them reconnect
maintenance rqmts remove nodes from cluster without user interruption don t break running sessions ok to kill idle sessions, let them reconnect don t loose data/transactions/new orders
maintenance rqmts remove nodes from cluster without user interruption don t break running sessions ok to kill idle sessions, let them reconnect don t loose data/transactions/new orders stay up or available
load balancing
load balancing client side tnsnames.ora and/or SCAN
load balancing client side tnsnames.ora and/or SCAN server side on connection long goal: # of connections short goal: system load avg
load balancing client side tnsnames.ora and/or SCAN server side on connection long goal: # of connections short goal: system load avg runtime advisory events sent to conn. pools
SCAN RAC_OLTP = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = scan.db.portrix.net)(port = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = OLTP) )) oracle@rac1:~$ host scan.db.portrix.net scan.db.portrix.net has address 46.30.26.101 scan.db.portrix.net has address 46.30.26.102 scan.db.portrix.net has address 46.30.26.103
Services: OLTP batch
Services: OLTP: RAC1, RAC2 batch: RAC1 RAC1 RAC2
1 Services: OLTP: RAC1, RAC2 batch: RAC1 RAC1 RAC2
Services: 1 OLTP: RAC1, RAC2 batch: RAC1 RAC1 RAC2
Services: OLTP: RAC1, RAC2 batch: RAC1 1 RAC1 RAC2
2 Services: OLTP: RAC1, RAC2 batch: RAC1 1 RAC1 RAC2
Services: 2 OLTP: RAC1, RAC2 batch: RAC1 1 RAC1 RAC2
Services: OLTP: RAC1, RAC2 batch: RAC1 2 1 RAC1 RAC2
3 Services: OLTP: RAC1, RAC2 batch: RAC1 2 1 RAC1 RAC2
Services: 3 OLTP: RAC1, RAC2 batch: RAC1 2 1 RAC1 RAC2
Services: OLTP: RAC1, RAC2 batch: RAC1 2 1 3 RAC1 RAC2
Services: OLTP: RAC2 batch: RAC2 2 1 3 RAC1 RAC2
4 5 Services: OLTP: RAC2 batch: RAC2 2 1 3 RAC1 RAC2
Services: 4 5 OLTP: RAC2 batch: RAC2 2 1 3 RAC1 RAC2
Services: OLTP: RAC2 batch: RAC2 5 2 1 4 3 RAC1 RAC2
Services: OLTP: RAC2 batch: RAC2 1 3 5 4 RAC1 RAC2
Services: OLTP: RAC2 batch: RAC2 5 4 3 RAC1 RAC2
Services: OLTP: RAC2 batch: RAC2 5 4 3 RAC1 RAC2
5 4 3 RAC1 RAC2
Services: OLTP: RAC2 batch: RAC2 5 4 3 RAC1 RAC2
Services: OLTP: RAC1, RAC2 batch: RAC1 5 4 3 RAC1 RAC2
app requirements reconnect regularly handle connection failures set max_sessions to the right value
connection pools pool will open and hold connections app loans session for tx as needed when tx is done, app returns session pool can decide which connection to lend to app
connection pools save resources memory connection time reconnect help load balancing abstraction layer for errors
UCP and FAN
UCP and FAN Fast Connection Failover
UCP and FAN Fast Connection Failover crash
UCP and FAN Fast Connection Failover crash planned outage
UCP and FAN Fast Connection Failover crash planned outage (re)join
UCP and FAN Fast Connection Failover crash planned outage (re)join run-time load balancing
UCP and FAN Fast Connection Failover crash planned outage (re)join run-time load balancing session affinity
UCP and FAN Fast Connection Failover crash planned outage (re)join run-time load balancing session affinity transaction affinty
services A service is an entity to which users connect configured with connection settings on client registered through clusterware each service has: a list of preferred and available instances load-balancing goal TAF and other parameters 12c multitenant: each PDB has it s own service
services default service is always active on all nodes ORA-01033: ORACLE initialization or shutdown in progress seperation might improve performance helpful in other areas of administration resource management EM monitoring grouping
srvctl grid@rac1:~$ srvctl config service -d PTXRAC -s OLTP Service name: OLTP Service is enabled Server pool: PTXRAC_OLTP Cardinality: 2 Disconnect: false Service role: PRIMARY Management policy: AUTOMATIC DTP transaction: false AQ HA notifications: false Failover type: NONE Failover method: NONE TAF failover retries: 0 TAF failover delay: 0 Connection Load Balancing Goal: LONG Runtime Load Balancing Goal: SHORT TAF policy specification: NONE Edition: Preferred instances: PTXRAC1,PTXRAC2 Available instances:
verify service cfg grid@rac1:~$ lsnrctl status listener_scan1 LSNRCTL for Solaris: Version 11.2.0.2.0 - Production on 29-SEP-2011 11:35:58 Copyright (c) 1991, 2010, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN1))) STATUS of the LISTENER ------------------------ Alias LISTENER_SCAN1 Version TNSLSNR for Solaris: Version 11.2.0.2.0 - Production Start Date 30-APR-2011 23:09:28 Uptime 151 days 12 hr. 26 min. 30 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /u01/app/11.2.0/grid/network/admin/listener.ora Listener Log File /u01/app/11.2.0/grid/log/diag/tnslsnr/sun1os/listener_scan1/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN1))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.42.155)(PORT=1521))) Services Summary... Service "BATCH.DB.PORTRIX.NET" has 1 instance(s). Instance "PTXRAC2", status READY, has 1 handler(s) for this service... Service "OLTP.DB.PORTRIX.NET" has 2 instance(s). Instance "PTXRAC1", status READY, has 1 handler(s) for this service... Instance "PTXRAC2", status READY, has 1 handler(s) for this service... Service "PTXRAC.DB.PORTRIX.NET has 2 instance(s). Instance "PTXRAC1", status READY, has 1 handler(s) for this service... Instance "PTXRAC2", status READY, has 1 handler(s) for this service... Service "PTXRACXDB.DB.PORTRIX.NET" has 2 instance(s). Instance "PTXRAC1", status READY, has 1 handler(s) for this service... Instance "PTXRAC2", status READY, has 1 handler(s) for this service... The command completed successfully
srvctl srvctl modify service Moves a service member from one instance to another. Additionally, this command changes which instances are to be the preferred and the available instances for a service. This command supports some online modifications to the service, such as: When there are available instances for the service, and the service configuration is modified so that a preferred or available instance is removed, the running state of the service may change unpredictably: The service is stopped and then removed on some instances according to the new service configuration. The service may be running on some instances that are being removed from the service configuration. These services will be relocated to the next free instance in the new service configuration. srvctl relocate service -d db_unique_name -s service_name {-c source_node -n target_node -i old_instance_name -t new_instance_name} [-f]
srvctl if service only up on one node: relocate up on muliple nodes: modify
shutdown srvctl stop instance -d db_unique_name {[-n node_name] [-i "instance_name_list"]} [-o stop_options] [-f] stops all services on the node (with -f ) better relocate service yourself!
shutdown srvctl stop instance -d db_unique_name {[-n node_name] [-i "instance_name_list"]} -o transactional refuses new connections disconnects sessions after commit/rollback
steps (again) relocate services away (relocate/modify) wait until sessions are done with work shutdown (transactional) perform maintenance restart services relocate services back
rolling upgrades available in a lot of patches two RDBMS versions running simultaneously built-in support in OPatch
rolling upgrades [oracle@rac1 tmp]$ opatch query -is_rolling_patch 10352368 Invoking OPatch 11.1.0.6.6 Oracle Interim Patch Installer version 11.1.0.6.6 Copyright (c) 2009, Oracle Corporation. All rights reserved. Oracle Home : /u01/app/oracle/product/11.2.0/db_1 Central Inventory : /u01/app/orainventory from : /etc/orainst.loc OPatch version : 11.1.0.6.6 OUI version : 11.2.0.1.0 OUI location : /u01/app/oracle/product/11.2.0/db_1/oui Log file location : /u01/app/oracle/product/11.2.0/db_1/cfgtoollogs/opatch/ opatch2011-09-15_11-28-05am.log Patch history file: /u01/app/oracle/11.2.0/db_1/cfgtoollogs/opatch/ opatch_history.txt -------------------------------------------------------- Patch is a rolling patch: true
12c app continuity 2 part system transaction guard reliably determine the state of commits app continuity (replay driver) driver records and caches requests and validation information reconnects and verifies commit state replays and validates requests
activate app continuity driver needs replay boundaries UCP and WebLogic add these automatically beginrequest/endrequest for 3rd party apps jdbc-thin only mutable calls (seq.nextval, sysdate) does not work with default service consider memory&cpu overhead
review TAF Load-Balancing services UCP FAN and FCF App Continuity
summary set up at least one extra service possibly more make sure application reconnects regularly use UCP if possible try and use this make it part of app rqrmts patch regularly
und weiter? RAC SIG Wahlen laufen gerade! RAC SIG - www.oracleracsig.org
DOAG 2013 unconference: DEMO 12c RAC auf laptop, UCP und app continuity mit java app
Danke RAC Attack www.racattack.org RAC SIG - www.oracleracsig.org b.rost@portrix.net http://portrix-systems.de/blog/ @brost