Maximizing Oracle RAC Uptime Ian Cookson, Markus Michalewicz Oracle Real Application Clusters (RAC) Product Management / Development September 29, 2014
Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 3
The System Lifecycle Installation Implementation Diagnosis Operation Monitoring 4
The System Lifecycle Installation Implementation Diagnosis Operation Monitoring 5
Installation System assumed for this presentation germany Oracle RAC Oracle GI HUB argentina Oracle RAC Oracle GI HUB Server OS: HUBs 4GB+ memory recommended One HUB at a time will host GIMR database. Only HUBs will host (Flex) ASM instances. Leafs can have less memory, dependent on the use case. Installer enforces HUB minimum memory requirement. brazil Oracle RAC Oracle GI HUB italy Oracle GI Leaf OL 6.5 UEK (other kernels are supported) spain Oracle GI Leaf 6
Installation [root@germany ~]# uname a 3.8.13-16.2.1.el6uek.x86_64 #1 SMP Thu Nov 7 17:01:44 PST 2013 x86_64 x86_64 x86_64 GNU/Linux #Get the pre-install package [root@germany Desktop]# yum list oracle-* oracle-rdbms-server-11gr2-preinstall.x86_64 1.0-7.el6 oracle-rdbms-server-12cr1-preinstall.x86_64 1.0-8.el6 ol6_latest ol6_latest Installation is an infrequent task It should be standardized Follow: http://www.slideshare.net/markusmichalewicz/oracle-rac- 12c-collaborate-best-practices-ioug-2014-version and come to the Oracle RAC demo booth (3787) Tools to use: 1. Linux: pre-install package 2. Cluster Verification Utility (CVU) 3. Oracle Universal Installer (OUI) 7
Oracle Universal Installer (OUI) OUI provides a simple GUI for: Installation and Configuration Upgrades OUI calls cluvfy for: Verification checks Generating fixup scripts 8
The System Lifecycle Installation Implementation Diagnosis Operation Monitoring 9
Implementation Implementation is a recurring task Initial implementation Change implementation(s) as required OraChk Implementation tasks are system-specific CVU Tools to use: 1. CVU 2. OraChk 10
Cluster Verification Utility (CVU) Introduction Purpose: Verification of pre-install & post-install cluster setup Run manually (command: cluvfy) or as part of the OUI Available from OTN and included in Oracle Grid Infrastructure Supports the Oracle RAC stack since version 10g Rel. 1 What does it do?: Runs specified verification tests and optionally generates a fixup script (run under root) Utilizes a stage concept, enabling users to run the necessary tests for a pre or post installation 11
What does CVU Check? System requirements Are the installation requirements met for Clusterware, or RAC? Network and connectivity Cluster Time Synchronization (CTSS or NTP) Existence of required OS users and permissions Prerequisites for adding nodes etc. 12
CVU for Pre-Implementation Checks Purpose: Verification of configuration after installation, prior to implementation (is the system ready?) What Checks to be Made?: Use post checks to verify that system is indeed ready, and Confirm that post-installation changes made to the system will not cause problems Examples: cluvfy comp healthcheck -collect cluster -mandatory deviations -save 13
CVU for Pre-Implementation Checks - Example $ cluvfy stage -post hwos -n germany,argentina verbose Performing post-checks for hardware and operating system setup Checking node reachability... Check: Node reachability from node "germany Destination Node Reachable? ------------------------------------ ------------------------ germany yes argentina yes Result: Node reachability check passed from node "germany Checking user equivalence... Check: User equivalence for user "grid Node Name Status ------------------------------------ ------------------------ argentina passed germany passed Result: User equivalence check passed for user "grid 14
OraChk Engineered Systems require less initial testing OraChk OraChk Formerly RACchk or RACcheck aka ExaChk RAC Configuration Audit Tool For details see MOS note ID 1268927.1 Checks Oracle Stack: Standalone Database Grid Infrastructure & RAC Maximum Availability Architecture (MAA) Validation Oracle Hardware 15
OraChk Installation and Configuration Installation: Download the latest version of orachk (90 day reminder ) Unzip in local directory under the oracle user Check permission are 755 on orachk Configuration: Run manually or in silent mode (via daemon) Implementation run singly (manually) to validate system setup, etc prior to going live 16
OraChk Usage Usage :./orachk [-abvhpfmsuso:c] -a - all checks -b - best practices only -p - patch recommendations only -f - offline (reports from existing data only) -u - pre-upgrade checks -S or -s - for silent installs, with or without SUDO capabilities -c - check individual components (ie. orachk a c ASM) -o - to invoke optional functionality (ie. to display only non-passing audit checks, verbose format, etc) -m - exclude MAA checks -v - what is the tool version? 17
OraChk Example Oracle orachk Assessment Report OraChk report in html format Summary with links to content Database Server Check Id Status Type Message Status On Details E960DB20CA5A634F E04312C0E50A62E0 6580DCAAE8A28F5B E0401490CACF6186 5ADD88EC8E0AFF2E E0401490CACF0C10 84BE4DE1F00AD833 E040E50A1EC07771 66E70B43167837ABE 040E50A1EC02FEA FAIL SQL Check WARNING OS Check WARNING OS Check INFO OS Check System Health Score is 75 out of 100 (detail) Table containing SecureFiles LOB storage belongs to a tablespace with extent allocation type that is not SYSTEM managed (not AUTOALLOCATE) The number of async IO descriptors is too low (/proc/sys/fs/aio-max-nr) net.core.wmem_max Is NOT Configured According to Recommendation Kernel Parameter fs.file-max Is Lower Than The Recommended Value INFO OS Check ORA-00600 errors found in alert log All Databases All Database Servers All Database Servers All Database Servers All Database Servers View View View View View 18
OraChk Example Oracle orachk Assessment Report OraChk highlights failures Here: Data Guard not setup MAA Scorecard System Health Score is 75 out of 100 (detail) FAIL OS Check Active Data Guard is not configured All Database Servers View DATA CORRUPTION PREVENTION BEST PRACTICES FAIL SQL Parameter Check Database parameter DB_BLOCK_CHECKSUM is NOT set to recommended value All Instances View 19
The System Lifecycle Installation Implementation Diagnosis Operation Monitoring 20
Operation CVU OraChk Operation is an ongoing task Oracle Grid Infrastructure provides all necessary tools for normal operation. Operation should not create extra tasks Automation is the key Tools to use: 1. CVU (periodic runs) 2. OraChk (interval runs via daemon) 3. Cluster Health Monitor (CHM/OS) 21
Operations Periodic CVU Checks are the Default [GRID]> crsctl status res -t -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.asmnet1lsnr_asm.lsnr ONLINE ONLINE argentina STABLE ONLINE ONLINE brazil STABLE ONLINE ONLINE germany STABLE... ora.cvu 1 ONLINE ONLINE brazil STABLE ora.germany.vip 1 ONLINE ONLINE germany... [GRID]> crsctl status res ora.cvu -p NAME=ora.cvu TYPE=ora.cvu.type ACL=owner:grid:rwx,pgrp:oinstall:rwx,other::r-- ACTIONS= ACTION_SCRIPT= ACTION_TIMEOUT=60 ACTIVE_PLACEMENT=0 AGENT_FILENAME=%CRS_HOME%/bin/oraagent%CRS_EXE_SUFFIX% AUTO_START=restore CARDINALITY=1 CHECK_INTERVAL=60 CHECK_RESULTS=PRVF-4090 : Node connectivity failed for interface "*",PRVF-4090 : Node connectivity failed for interface "*",PRVF-4090 : Node connectivity failed for interface "*",PRVF-4090 : Node connectivity failed for interface "*",PRVG-1101 : SCAN name "cupscan.cupgnsdom.localdomain" failed to resolve,prvf- 4657 : Name resolution setup check for "cupscan.cupgnsdom.localdomain" (IP address: 10.1.1.55) failed,prvf-4090 : Node connectivity failed for interface "*",PRVG-11050 : No matching interfaces "*" for subnet "172.149.0.0" on nodes "argentina,brazil,germany",prvg-11050 : No matching interfaces "*" for subnet "172.149.0.0" on nodes "argentina,brazil,germany",prvf-7530 : Sufficient physical memory is not available on node "germany" [Required physical memory = 4GB (4194304.0KB)],PRVF-4354 : Proper hard limit for resource "maximum open file descriptors" not found on node "germany" [Expected = "65536" ; Found = "4096 22
Operations Setup Periodic OraChk System Checks <<< Configure & start orachk daemon for scheduled interval runs >>> $./orachk -id DBA -set \ > "NOTIFICATION_EMAIL=your.email@company.com;\ > AUTORUN_SCHEDULE = 4,8,12,16,20 * * *;\ > AUTORUN_FLAGS=-profile dba; COLLECTION_RETENTION=30 $./orachk -d start 23
Cluster Health Monitor (CHM/OS) OLOGGERD osysmond osysmond Service integrated with the Oracle Clusterware stack Introduced in 11.2.0.2 (Linux, Solaris, Windows), 11.2.0.3(AIX) germany Oracle GI argentina Oracle GI Gathers OS level metrics to monitor resource degradation and failure Stores data in a central repository (GIMR) brazil osysmond Oracle GI italy osysmond Oracle GI Runs real time with locked down memory for last gasp analysis Integration with QoS (Memory Guard) and CRS (server pool categorization) Integrated into EM Cloud Control 24
Cluster Health Monitor Deamons / Processes Function Collect OS metrics Process raw data for subset of processes Compress and send data to ologgerd Store/forward in case of network failures osysmond ologgerd oclumon Consume data from all active osysmonds Store data in the repository Service requests from clients Display OS level metrics in historic/ real time mode Perform repository management operations Managed by ohasd osysmond Command line utility Instances and location Every node of the cluster (including leaf nodes) One per cluster (Replica for 11.2.x) Can be invoked from any hub node in the cluster 25
Cluster Health Monitor in EM Cloud Control 26
Cluster Health Monitor in EM Cloud Control 27
Cluster Health Monitor command line reporting Command line reporting of current and historic OS metrics (oclumon) from any hub node in the cluster Example: [germany]: > oclumon dumpnodeview -process 28
The System Lifecycle Installation Implementation Diagnosis Operation Monitoring 29
Monitoring Monitoring is an ongoing task There is optional monitoring available for an Oracle RAC cluster via QoS and Oracle EM Quality of Service Management (QoS) comes with a monitoring only feature Monitoring is a pro-active task. Tools to use: 1. Oracle Enterprise Manager 12c CC 2. Oracle Quality of Service Management (Memory Guard) 30
Monitoring the RAC Cluster with EM Cloud Control 31
Quality of Service Management Memory Guard QoS Feature externalized for general use Memory Guard protects resources Receives a stream of OS Memory metrics from CHM/OS Issues alert should any server be at risk Protects existing work and applications by automatically closing the server to new connections (ie. stops service on at-risk node) Automatically re-opens server to connections once the memory pressure has subsided 32
Autonomous Computing QoS CHM Self- Optimizing Self- Protecting Policy CHA HngMgr Self- Healing Self- Configuring 33
Enabling Autonomous Computing Cluster Health Monitor (CHM)/OS & QoS 11.2+ Further QoS & CHM Enhancements in 12.1.0.2 QoS Support for Measure only with Performance Objectives and Alerts QoS Support for Measuring and Monitoring Admin- Managed Databases Cluster Health Advisor Coming soon LOGGERD CHM/OS sysmond 34
The System Lifecycle Installation Implementation Diagnosis Operation Monitoring 35
Diagnosis Diagnosis is a recurring task Ideally, there will be no incidents on system. Realistically, there will be more than one. Diagnosis is a reactive task. It should be performed as efficiently as possible. Tools to use: 1. Trace File Analyzer (TFA) 36
Trace File Analyzer (TFA) log collection in action Trace File Analyzer Improved comprehensive first failure diagnostics collection Efficient collection, packaging and transfer of data Collect for all relevant components (OS, Grid Infra., ASM, RDBMS), including Exadata cell nodes One command to collect all information, from all nodes (or single-instance, single-node) More information: MOS note ID 1513912.1 37
Trace File Analyzer (TFA) intelligent log collection $./tfactl diagcollect One simple command Sending diagcollect request to host : argentina Getting list of files satisfying time range [Tue Sep 03 14:17:43 PDT 2014, Tue Sep 03 18:17:43 PDT 2014] germany: Zipping File: /opt/oracle/oak/oswbb/archive/oswiostat/germany_iostat_14.09.03.1500.dat.gz germany: Zipping File: /u01/app/oracle/diag/rdbms/bill/bill1/trace/alert_bill1.log Trimming file : /u01/app/oracle/diag/rdbms/bill/bill1/trace/alert_bill1.log with original file size : 109kB germany: Zipping File: /opt/oracle/oak/oswbb/archive/oswtop/germany_top_14.09.03.1500.dat.gz germany: Zipping File: /opt/oracle/oak/log/germany/oak/oakd.log Trimming file : /opt/oracle/oak/log/germany/oak/oakd.log with original file size : 9.2MB germany: Zipping File: /u01/app/12.1.0.2/grid/log/germany/gipcd/gipcd.log germany: Zipping File: /u01/app/12.1.0.2/grid/log/germany/agent/ohasd/oraagent_grid/oraagent_grid.log Trimming file : /u01/app/12.1.0.2/grid/log/germany/agent/ohasd/oraagent_grid/oraagent_grid.log with original filesize 4.3MB germany: Zipping File: /var/log/messages germany: Zipping File: /opt/oracle/oak/oswbb/archive/oswslabinfo/germany_slabinfo_14.09.03.1800.dat Collecting ADR incident files... Total Number of Files checked : 10543 Total Size of all Files Checked : 3.9GB Number of files containing required range : 68 Total Size of Files containing required range : 129MB Number of files trimmed : 10 Total Size of data prior to zip : 144MB Saved 63MB by trimming files Zip file size : 8.6MB Total time taken : 47s. ADR Incident files 144MB pruned and compressed down to 8.6MB Logs are collected to: /opt/oracle/tfa/tfa_home/repository/collection_tue_sep_3_18_17_24_pdt_2014_node_all/germany.tfa_tue_sep_3_18_17_24_pdt_2014.zip /opt/oracle/tfa/tfa_home/repository/collection_tue_sep_3_18_17_24_pdt_2014_node_all/argentina.tfa_tue_sep_3_18_17_24_pdt_2014.zip Pruning Relevant files only OS Watcher files 47 seconds! 1 command, 2 nodes, 4 databases, ASM, Clusterware, OS 38
Trace File Analyzer (TFA) Efficiency from A-Z LOGs germany Oracle RAC Oracle GI HUB LOGs brazil Oracle RAC Oracle GI HUB 39
Utility Cluster Utility Cluster Centralize and standardize storage, deployment, management and diagnostics Architecture: ASM Oracle Clusterware ASM An Oracle Grid Infrastructure based cluster IOsrv Oracle ASM IOsrv Enterprise Management (EM) Server Node1 Node2 Utility Cluster Solution-in-a-Box approach on ODA Flex ASM Storage +1 Grid Home Server (Rapid Home Provisioning) Node 1 Application Domain Node 2 Application Domain Application Domain Application Domain Storage Server Application Domain Application Domain Database Domain Database Domain 40
The System Lifecycle Installation Implementation Diagnosis Operation Monitoring 41
42