Monitoring best practices & tools for running highly available databases Miguel Anjo & Dawid Wojcik DM meeting 20.May.2008
Oracle Real Application Clusters
Architecture RAC1 RAC2 RAC5 RAC3 RAC6 RAC4
Highly Available databases Oracle services Resources distributed among Oracle services Applications assigned to dedicated service On node failure, resources re-distributed CMS_CONDCOND Preferred A1 A2 A3 CMS_C2K Preferred A3 A1 A2 CMS_DBS A2 A3 A1 Preferred CMS_DBS_W A3 A1 A2 Preferred CMS_SSTRACKER Preferred Preferred Preferred Preferred CMS_TRANSFERMGMT A2 Preferred Preferred A1 CMS_CONDCOND Preferred A1 A2 CMS_C2K A2 Preferred A1 CMS_DBS A2 A1 Preferred CMS_DBS_W A1 A2 Preferred CMS_SSTRACKER Preferred Preferred Preferred CMS_TRANSFERMGMT Preferred Preferred A1
Highly Available databases Apps and DB Release cycle Applications release cycle Development service Validation service Production service Database software release cycle Production service version 10.2.0.n 1020 Validation service version 10.2.0.(n+1) Production service version 10.2.0.(n+1)
Why monitor? Monitor (n.) Computer Science. A program that observes, supervises, or controls the activities of other programs. Diagnostics Performance Reporting Need to keep all components in healthy state We are prepared for single failures, some double failures Commitment to give 24/7 best effort service SW misbehavior affecting performance Trends might indicate need to grow system Security breaches
Monitoring participants Presentation title - 7
Monitoring participants Presentation title - 8
What we monitor 25 database clusters 124 servers, 450 cores, 150 disk-arrays, 2000 disks at Tier0 10 Tier1 sites for Streams replication 150+ Oracle services / applications 2000+ user schemas 1M+ connections/day
PDB-BackupBackup 2 node cluster Using Oracle Clusterware Running: RACMon (monitoring agents) StreamMon (monitoring agents) Backups Scripts repository Monitored by Lemon. Set as Critical in Operator procedures
Monitored components Servers Accessibility CDB state Tools: Lemon + RACMon + OEM Disk arrays Accessibility State given by controller Firmware, disk state, disk size, disk speed Tools: Lemon + RACMon Database SW Clusterware state Service accessibility Space available Oracle Streams Tools: RACMon + OEM + StreamMon Database usage OS CPU, I/O User Sessions, CPU, I/O User quotas, tablespace usage Bad usage (short connections, bind variables) Table fragmentation Tools: RACMon, Reports
Best practises (I) No overhead to DB (monitored object) Monitor as much as possible Presentation layer simple & compact Possibility to drill down
Best practises (II) Hierarchy of alarms and notifications Simplicity reliability Centralized version vs. deployed everywhere Independent blocks (monitoring, dashboard, reporting) for HA
Monitoring tools Monitoring tools Lemon, SLS Basic Monitoring (in house development) SQL scripts (reactive monitoring) RACMon (in house development, openlab) StreamMon (in house development, openlab) OEM Oracle Enterprise Manager (Grid Control) - openlab Service oriented monitoring i tools Experiment reports DB Availability & Performance Pages
Basic monitoring SSH SQL*Plus Select * from dual; Checking every 5 minutes Each failure e-mail with error 3 consecutive failures SMS Almost perfect for single instance databases Limitations On RAC, system survives to single HW failures Users connect to service, not database instance No other components (storage, clusterware) monitoring Missing dashboard view
DBA monitoring SQL scripts reactive monitoring (ad-hoc monitoring) Pros: Easy to use Fast real time information Cons: No global overview Diagnosing single problem Requires expert knowledge
RACMon requirements Reliable (24/7) Easy to use and configure Provides up to date information (frequent runs) Centralized no configuration or deployment on RAC side Web interface (RAC monitoring dashboard) one common place for RACs status Monitoring of Oracle services (DB and user level) and Oracle clusterware Monitoring of ASM instances (diskgroups and failgroups) Monitoring other parts of the infrastructure backups, storage, (easy extensibility) Notification send via emails & SMSs to DBAs Availability numbers (over extended periods of time) Disabling monitoring for specific machines or clusters (scheduled and unscheduled intervention logbook)
RACMon Architecture
RACMon - examples
RACMon - examples
RACMon Pros/Features: Customized for our environment Gives an overview of all our HW and RACs Configurable alerts (via email and SMS) and alert levels l (production or non-production systems) Drill down details available via multiple links to other types of monitoring software (OEM, Lemon, StreamMon) Cons: Requires manpower for development
Oracle Streams Oracle Streams enables the propagation and management of data, transactions and events in a data stream either within a database, or from one database to another.
StreamMon
StreamMon
StreamMon Streams availability and usage monitoring Build in alerting in case of any error in streams stack Pros: Monitoring of all T1 sites in one place (streams monitoring not available in any other tool, including OEM) Convenient and easy to use web interface Advanced plotting utilities Cons: Required manpower for development (currently in maintenance only) Uses not-standard libraries, requires customized server
Oracle Enterprise Manager Architecture: Agent running on each server uploads information to central repository, if repository is not available, it caches data Management Service provides insight i into any monitored target t details Management Service based on set-up metrics and policies sends e-mails (SMSes) Proactive monitoring gp possible (actions based on problem diagnostics)
Oracle Enterprise Manager Oracle Enterprise Manager Grid Control features
Oracle Enterprise Manager Pros: Highly configurable alerts, metrics and notification policies Advanced and easy to use web interface Easy drill down External product fully supported Cons: Universal requires more navigation No global overview (per target oriented) Customization for many target requires much work Bugs may by intrusive (e.g. affecting streams, excessive memory/cpu consumption, storage, DB instances) Manpower required for maintenance and configuration Not reliable enough for 24/7 monitoring
Weekly reports Targeted to experiment DBAs and Coordinators Information about Bookkeeping Application names, contacts Resource usage Sessions, CPU, Logical and Physical I/O Security: Connection errors, expiring i passwords, not used schemas Space: consumed, fragmentation, recycle e bin Bad usage: short connections, queries missing bind variables
Weekly reports PHP scripts Generate report over last 7 days Specific to one RAC cluster
Weekly reports
Weekly reports Current functionality Simple way to visualize whole DB usage Concentrates on main users (dynamic) Easy to spot problems (color coded) Very good feedback from our users Now working on user configurable reports
DB availability and performance page PHP, aggregation of other tools Requested by experiments Dashboard of current DB activity Almost real time monitoring i (up to last hour) Application resource usage No extra load uses SLS, RACMon, StreamMon, weekly reports Possibility to drill down
DB availability and performance page
Summary Many monitoring components developed for our environment Out of the box tools not sufficient Open frameworks new features easily added Feedback given to Oracle Enterprise Manager development (openlab) Very good feedback from T1s and experiments Components included in experiment dashboards, WLCG ServiceMaps, SLS