Oracle Clusterware and Private Network Considerations

Similar documents

PERFORMANCE TUNING ORACLE RAC ON LINUX

About the Author About the Technical Contributors About the Technical Reviewers Acknowledgments. How to Use This Book

Real Life RAC Performance Tuning. Arup Nanda Lead DBA Starwood Hotels

Oracle Real Application Clusters (RAC) and Oracle Clusterware Interconnect Virtual Local Area Networks (VLANs) Deployment Considerations

D1.2 Network Load Balancing

An Oracle White Paper September Highly Available and Scalable Oracle RAC Networking with Oracle Solaris 10 IPMP

Gigabit Ethernet Design

IP SAN Best Practices

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

Lessons Learned while Pushing the Limits of SecureFile LOBs. by Jacco H. Landlust. zondag 3 maart 13

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Oracle Database 11g Direct NFS Client. An Oracle White Paper July 2007

Quantum StorNext. Product Brief: Distributed LAN Client

AIX NFS Client Performance Improvements for Databases on NAS

N5 NETWORKING BEST PRACTICES

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

VM-Series Firewall Deployment Tech Note PAN-OS 5.0

Networking Best Practices with the Oracle ZFS Storage Appliance

Dell PowerVault MD Series Storage Arrays: IP SAN Best Practices

A Dell Technical White Paper Dell Storage Engineering

Advanced Network Services Teaming

VMware Virtual SAN 6.2 Network Design Guide

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

High Availability Databases based on Oracle 10g RAC on Linux

Isilon IQ Network Configuration Guide

Multiple Public IPs (virtual service IPs) are supported either to cover multiple network segments or to increase network performance.

Flash Performance for Oracle RAC with PCIe Shared Storage A Revolutionary Oracle RAC Architecture

VMWARE WHITE PAPER 1

Performance and Recommended Use of AB545A 4-Port Gigabit Ethernet Cards

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Oracle Net Services - Best Practices for Database Performance and Scalability

Network Diagnostic Tools. Jijesh Kalliyat Sr.Technical Account Manager, Red Hat 15th Nov 2014

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

QoS & Traffic Management

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

3. The Domain Name Service

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

BASIC TCP/IP NETWORKING

Nutanix Tech Note. VMware vsphere Networking on Nutanix

Contents Introduction... 5 Deployment Considerations... 9 Deployment Architectures... 11

Performance Evaluation of Linux Bridge

ADVANCED NETWORK CONFIGURATION GUIDE

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

Leased Line + Remote Dial-in connectivity

Linux TCP/IP Network Management

Windows 8 SMB 2.2 File Sharing Performance

EVERYTHING A DBA SHOULD KNOW

Keys to Optimizing Your Backup Environment: Tivoli Storage Manager

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade

SCAN, VIP, HAIP etc. Introduction This paper is to explore few RAC abbreviations and explain the concepts behind these acronyms.

Introduction to NetGUI

Pump Up Your Network Server Performance with HP- UX

- An Oracle9i RAC Solution

IP SAN BEST PRACTICES

Boost Database Performance with the Cisco UCS Storage Accelerator

Forensic Network Analysis Tools

Dell EqualLogic Best Practices Series

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

Data Center Fabric Convergence for Cloud Computing (the Debate of Ethernet vs. Fibre Channel is Over)

Wireless LAN Apple Bonjour Deployment Guide

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Lustre Networking BY PETER J. BRAAM

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

PCI Express High Speed Networks. Complete Solution for High Speed Networking

Virtual PortChannels: Building Networks without Spanning Tree Protocol

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

Deploying Riverbed wide-area data services in a LeftHand iscsi SAN Remote Disaster Recovery Solution

2009 Oracle Corporation 1

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

Ultimate Guide to Oracle Storage

This is a product of VCCI Class B Compliance

Wire-speed Packet Capture and Transmission

Oracle Networking and High Availability Options (with Linux on System z) & Red Hat/SUSE Oracle Update

FWSM introduction Intro 5/1

Extending SANs Over TCP/IP by Richard Froom & Erum Frahim

Performance of Software Switching

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

Configuring Your Computer and Network Adapters for Best Performance

AT-2916SX AT-2931SX AT-2972SX AT-2972SX/2 AT-2972T/2

Network Performance Optimisation and Load Balancing. Wulf Thannhaeuser

Maximum Availability Architecture

Networking Topology For Your System

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Oracle Database 11g Release 2 Performance: Protocol Comparison Using Clustered Data ONTAP 8.1.1

An Oracle White Paper September Oracle Exadata Database Machine - Backup & Recovery Sizing: Tape Backups

Layer 3 Network + Dedicated Internet Connectivity

High Speed I/O Server Computing with InfiniBand

Twin Peaks Software High Availability and Disaster Recovery Solution For Linux Server

Transcription:

1 Oracle Clusterware and Private Network Considerations Much of this presentation is attributed to Michael Zoll and work done by the RAC Performance Development group

2 The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle.

3 Agenda Architectural Overview RAC and Cache Fusion Performance Infrastructure Common Problems and Resolution Aggregation and VLANs

Oracle Clusterware Cluster Private High Speed Network L2/L3 SWITCH CSSD Runs in Real Time Priority VIP1 public network VIP2 / / VIPn CSSD OPROCD ONS CRSD EVMD CSSD OPROCD ONS CRSD EVMD CSSD OPROCD ONS CRSD EVMD OS OS OS RAW Devices shared storage OCR and Voting Disks 4

5 Under the Covers Runs in Real Time Priority Cluster Private High Speed Network L2/L3 SWITCH LMON LMD0 DIAG LMON LMD0 DIAG LMON LMD0 DIAG Instance 1 SGA Global Resoruce Directory Dictionary Cache Library Cache Log buffer Buffer Cache Instance 2 SGA Global Resoruce Directory Dictionary Cache Library Cache Log buffer Buffer Cache Instance n SGA Global Resoruce Directory Dictionary Cache Library Cache Log buffer Buffer Cache VKTM LGWR DBW0 VKTM LGWR DBW0 VKTM LGWR DBW0 LMS0 SMON PMON LMS0 SMON PMON LMS0 SMON PMON Node 1 Node 2 Node n Redo Log Files Redo Log Files Redo Log Files Data Files and Control Files

Global Cache Service (GCS) Manages coherent access to data in buffer caches of all instances in the cluster Minimizes access time to data which is not in local cache access to data in global cache faster than disk access Implements fast direct memory access over high-speed interconnects for all data blocks and types Uses an efficient and scalable messaging protocol Never more than 3 hops New optimizations for read-mostly applications 6

7 Cache Hierarchy: Data in Remote Cache Local Cache Miss Datablock Requested Datablock Returned Remote Cache Hit

8 Cache Hierarchy: Data On Disk Local Cache Miss Datablock Requested Disk Read Grant Returned Remote Cache Miss

9 Cache Hierarchy: Read Mostly Local Cache Miss No Message required Disk Read

10 11.1 CPU Optimizations for read-intensive operations Read-only access No messages Direct reads Read-mostly access Message reductions Latency improvements Significant gains From 50-70% reductions measured

11 Performance of Cache Fusion Message:~200 bytes 200 bytes/(1 Gb/sec ) LMS Receive Process block Send Initiate send and wait Receive Block: e.g. 8K 8192 bytes/(1 Gb/sec) Total access time: e.g. ~360 microseconds (UDP over GBE) Network propagation delay ( wire time ) is a minor factor for roundtrip time ( approx.: 6%, vs. 52% in OS and network stack )

Fundamentals: Minimum Latency (*), UDP/GBE and RDS/IB Block size 2K 4K 8K 16K RT (ms) UDP/GE 0.30 0.31 0.36 0.46 RDS/IB 0.12 0.13 0.16 0.20 (*) roundtrip, blocks are not busy i.e. no log flush, no serialization ( buffer busy ) AWR and Statspack reports would report averages as if they were normally distributed, the session wait history which is included in Statspack in 10.2 and AWR in 11g will show the actual quantiles The minimum values in this table are the optimal values for 2-way and 3-way block transfers, but can be assumed to be the expected values ( I.e. 10ms for a 2-way block would be very high ) 12

13 Infrastructure: Network Packet Processing TX RX PROCESS (FG/LMS*) PROCESS (FG/LMS*) SOCKET LAYER (tx Buffers) SOCKET LAYER (rx Buffers) IP LAYER TCP / UDP IP LAYER TCP / UDP INTERFACE LAYER INTERFACE LAYER L2/L3 SWITCH

Infrastructure: Interconnect Bandwidth Bandwidth requirements depend on several factors ( e.g. buffer cache size, #of CPUs per node, access patterns) and cannot be predicted precisely for every application Typical utilization approx. 10-30% in OLTP 10000-12000 8K blocks per sec to saturate 1 x Gb Ethernet ( 75-80% of theoretical bandwidth ) Generally, 1Gb/sec sufficient for performance and scalability in OLTP. DSS/DW systems should be designed with > 1Gb/sec capacity A sizing approach with rules of thumb is described in Project MegaGrid: Capacity Planning for Large Commodity Clusters (http://otn.oracle.com/rac) 14

15 Infrastructure: Private Interconnect Network between the nodes of a RAC cluster MUST be private Supported links: GbE, IB ( IPoIB: 10.2 ) Supported transport protocols: UDP, RDS (10.2.0.3) Use multiple or dual-ported NICs for redundancy and increase bandwidth with NIC bonding Large ( Jumbo ) Frames for GbE recommended if the global cache workload requires it. global cache block shipping versus small lock message passing.

16 Network Packet Processing: Layers, Queues and Buffers TX RX PROCESS (FG/LMS*) PROCESS (FG/LMS*) Rec() USER SOCKET LAYER (tx Buffers) SOCKET LAYER (rx Buffers) KERNEL Socket buffers Socket queues IP LAYER TCP / UDP IP LAYER TCP / UDP Software interrupts TX IP queues RX IP input queue INTERFACE LAYER INTERFACE LAYER Hardware interrupts RX queues, RX buffers Backplane pressure cpu L2/L3 SWITCH Ingress Buffers Egress Buffers

17 Infrastructure: IPC configuration Important Settings: Negotiated top bit rate and full duplex mode NIC ring buffers Ethernet flow control settings CPU(s) receiving network interrupts Verify your setup: CVU does checking Load testing eliminates potential for problems AWR and ADDM give estimations of link utilization Buffer overflows, congested links and flow control can have severe consequences for performance

18 Infrastructure: Operating System Block access latencies increase when CPU(s) busy and run queues are long Immediate LMS scheduling is critical for predictable block access latencies when CPU > 80% busy Fewer and busier LMS processes may be more efficient. monitor their CPU utilization Caveat: 1 LMS can be good for runtime performance but may impact cluster reconfiguration and instance recovery time the default is good for most requirements Higher priority for LMS is default The implementation is platform-specific

Common Problems and Symptoms Lost Blocks : Interconnect or Switch Problems System load and scheduling Contention Unexpectedly high global cache latencies <Insert Picture Here> 19

Miss-configured or Faulty Interconnect Can Cause: Dropped packets/fragments Buffer overflows Packet reassembly failures or timeouts Ethernet Flow control kicks in TX/RX errors lost blocks at the RDBMS level, responsible for 64% of escalations 20

21 Lost Blocks : NIC Receive Errors Db_block_size = 8K ifconfig a: eth0 Link encap:ethernet HWaddr 00:0B:DB:4B:A2:04 inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95 TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0

22 Lost Blocks : IP Packet Reassembly Failures netstat s Ip: 84884742 total packets received 1201 fragments dropped after timeout 3384 packet reassembles failed

23 Finding a Problem with the Interconnect or IPC Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time Wait Class ---------------------------------------------------------------------------------------------------- log file sync 286,038 49,872 174 41.7 Commit gc buffer busy 177,315 29,021 164 24.3 Cluster gc cr block busy 110,348 5,703 52 4.8 Cluster gc cr block lost 4,272 4,953 1159 4.1 Cluster cr request retry 6,316 4,668 739 3.9 Other Should never be here

24 Global Cache Lost block handling Detection Time in 11g reduced 500ms ( around 5 secs in 10g ) can be lowered if necessary robust ( no false positives ) no extra overhead Cr request retry event related to lost blocks It is highly likely to see it when gc cr blocks lost show up

25 Interconnect Statistics Automatic Workload Repository (AWR ) Target Avg Latency Stddev Avg Latency Stddev Instance 500B msg 500B msg 8K msg 8K msg --------------------------------------------------------------------- 1.79.65 1.04 1.06 2.75.57. 95.78 3.55.59.53.59 4 1.59 3.16 1.46 1.82 --------------------------------------------------------------------- Latency probes for different message sizes Exact throughput measurements ( not shown) Send and receive errors, dropped packets ( not shown )

Blocks Lost : Solution Fix interconnect NICs and switches Tune IPC buffer sizes 26

27 CPU Saturation or Long Run Queues Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s) (ms) Time Wait Class ----------------- --------- ------ ---- ----- ---------- db file sequential read gc current block congested gc cr grant congested 1,312,840 21,590 16 21.8 User I/O 275,004 21,054 77 21.3 Cluster 177,044 13,495 76 13.6 Cluster gc current block 1,192,113 9,931 8 10.0 Cluster 2-way gc cr Congested block congested : LMS could 85,975 not dequeue 8,917 messages 104 9.0 fast Cluster enough Cause : Long run queue, CPU starvation

High CPU Load: Solution Run LMS at higher priority (default) Start more LMS processes Never use more LMS processes than CPUs Reduce the number of user processes Find cause of high CPU consumption 28

29 Contention Event Waits Time (s) AVG (ms) % Call Time ---------------------- --------- -------- -------- -------- gc cr block 2-way 317,062 5,767 18 19.0 gc current block 2-way 201,663 4,063 20 13.4 gc buffer busy 111,372 3,970 36 13.1 CPU time 2,938 9.7 gc cr block busy 40,688 1,670 41 5.5 ------------------------------------------------------- Global Contention on Data Serialization Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related

Contention: Solution Identify hot blocks in application Reduce concurrency on hot blocks 30

31 High Latencies Event Waits Time (s) AVG (ms) % Call Time ---------------------- ---------- ---------- --------- -------- gc cr block 2-way 317,062 5,767 18 19.0 gc current block 2-way 201,663 4,063 20 13.4 gc buffer busy 111,372 3,970 36 13.1 CPU time 2,938 9.7 gc cr block busy 40,688 1,670 41 5.5 ------------------------------------------------------- Expected: To see 2-way, 3-way Unexpected: To see > 1 ms (AVG ms should be around 1 ms) Tackle latency first, then tackle busy events

32 High Latencies : Solution Check network configuration Private Running at expected bit rate Find cause of high CPU consumption Runaway or spinning processes

33 Health Check Look for: Unexpected Events gc cr block lost Unexpected Hints Contention and Serialization gc cr/current block busy Load and Scheduling 52 ms gc current block congested 14 ms Unexpected high avg gc cr/current block 2-way 1159 ms 36 ms

Gigabit Ethernet Definition Max Bandwidth 1000 Mbits = 125 MB per sec, excluding header and pause frames 118 MB per sec Equates to 85000 Clusterware/RAC messages or 14000 8k blocks RAC workload has a mix of short messages of 256 bytes and db_block_size of long messages For real life workload only 60-70% the bandwidth can be sustained For RAC type workload 40 MB per sec per interface is optimal load For additional bandwidth more interfaces can be aggregated 34

35 Aggregation Active/Standby (single switch) 1 U ce2 ce4 NIC NIC ce4:1 $ --> ifconfig -a ce2: flags=69040843<up,broadcast,running,multicast,deprecated,ipv4,nofailover STANDBY,INACTIVE > mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4: flags=9040843<up,broadcast,running,multicast,deprecated,ipv4> mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4:1: flags=1000843<up,broadcast,running,multicast,depricated,ipv4,nofailover> mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

36 Aggregation Active/Active (Single Switch) 1 U ce2 ce4 NIC NIC ce4:1 $ --> ifconfig -a ce2: flags=69040843<up,broadcast,running,multicast,deprecated,ipv4,nofailover> mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4: flags=9040843<up,broadcast,running,multicast,deprecated,ipv4,nofailover> mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4:1: flags=1000843<up,broadcast,running,multicast,ipv4> mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

Aggregation Active/Standby (Switch Redundancy) 1 U NIC NIC ce2 ce4 ce2 ce4 ce4:1 NIC NIC ce4:1 ce8 ce10 ce8 ce10 1 U $ --> ifconfig -a ce10: flags=69040843<up,broadcast,running,multicast,deprecated,ipv4,nofailover,standby,inactive> mtu 1500 index 3 inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4: flags=9040843<up,broadcast,running,multicast,deprecated,ipv4,nofailover> mtu 1500 index 8 inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255 groupname private ce4:1: flags=1000843<up,broadcast,running,multicast,ipv4> mtu 1500 index 8 inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255 37

Aggregation Solutions Cisco Etherchannel based 802.3ad AIX Etherchannel HPUX Auto Port Aggregation SUN Trunking, IPMP, GLD Linux Bonding (only certain modes) Windows NIC teaming 38

39 Aggregation Methods Load balance/failover/load spreading spread on sends/serialize on receives Active/Standby Oracle Interconnect Requirement Both Send/Receive side load balancing NIC and Switch port failure detection

40 General Interconnect requirement Recommendations For OLTP Workloads Normally 1 Gbit Ethernet with redundancy (active/standby or load-balance) is sufficient For DW workloads Multiple GigE aggregated 10 Gig E or Infiniband

Oracle RAC Cluster Interconnect network selection Oracle Clusterware IP address associated with Private Hostname (provided during Install interview) Oracle RAC Database Private Network specified during the Install interview Cluster_interconnect parameter provided IP address 41

42 Jumbo Frames Non-IEEE standard Useful for NAS/iSCSI storage Network device inter-operability issues Configure with care and test rigorously Excerpt from alert.log: Maximum Tranmission Unit (mtu) of the ether adapter is different on the node running instance 4, and this node. Ether adapters connecting the cluster nodes must be configured with identical mtu on all the nodes, for Oracle. Please ensure the mtu attribute of the ether adapter on all nodes [and switch ports] are identical, before running Oracle.

43 UDP Socket Buffer (rx) Default settings adequate for majority of customers May need to increase allocated buffer size MTU size increases netstat reporting fragmentation and/or reassembly errors ifconfig reporting dropped packets or overflow

44 Cluster Interconnect NIC settings NIC driver dependent DEFAULTS GENERALLY SATISFACTORY Changes can occur between OS versions Linux 2.4 => 2.6 kernels,flowcontrol on e1000 drivers NAPI interrupt coalescence in 2.6 Confirm flow control: rx=on, tx=off Confirm full bit rate (1000) for the NICs Confirm full duplex auto-negotiate Ensure NIC names/slots identical on all nodes Configure interconnect NICs on fastest PCI bus Ensure compatible switch settings 802.3ad on NICs = 802.3ad on switch ports MTU=9000 on NICs = MTU=9000 on switch ports FAILURE TO CONFIGURE THE NICS AND SWITCHES CORRECTLY WILL RESULT IN SEVERE PERFORMANCE DEGRADATION AND NODE FENCING

45 The Interconnect and VLANs Interconnect should be dedicated non-routable subnet mapped to a single dedicated, non-shared VLAN If VLANs are trunked the interconnect VLAN traffic should not exceed the access switch layer Minimize the impact of Spanning Tree events Monitor the switch(es) for congestion Avoid QoS definitions that may negatively impact interconnect performance

46

47

48 Q & A Q U E S T I O N S A N S W E R S