How Oracle Exadata Delivers Extreme Performance John Clarke Sr. Oracle Architect Dallas Detroit Los Angeles Singapore India www.centroid.com 1
Ø Ø Agenda Oracle Exadata: Hardware and Software Optimized Exadata Hardware Components Ø Ø Ø Ø Ø Ø Ø Ø Ø How Oracle Works on Exadata Exadata Software Components Smart Scan and Cell Offload Processing Storage Indexes Smart Flash Cache Hybrid Columnar Compression IO Resource Management How to Get There from Here A New Way to Look at Optimization Page 2
Exadata: Hardware and Software Optimized The Exadata Database Machine is designed for extreme performance, manageability, consolidation, and high-availability. Oracle achieves this by: Ø Utilizing an optimally-balanced hardware infrastructure Ø Using fast hardware components designed to eliminate IO, CPU, and network bottlenecks Ø Delivering significant Oracle software enhancements to leverage balanced hardware configuration The Exadata Database Machine hardware is just part of the solution the combination of a balanced hardware configuration with Exadata s storage software is what delivers extreme performance. Page 3
Exadata Hardware Database servers, or Compute nodes InfiniBand Interconnect Exadata Storage Servers, or Cells Page 4
Ø Ø Ø Ø Ø Ø Ø Exadata Storage Servers Exadata storage servers are self-contained storage platforms that house disk storage and run the Exadata Storage Server software provided by Oracle A single Exadata storage server is also known as a cell A cell is the building block for the Exadata storage grid More cells provide greater capacity and IO bandwidth Databases are typically deployed across multiple cells The storage cells communicate with the Oracle database and ASM instances over InfiniBand Each cell is wholly dedicated to Oracle database files Page 5
Ø Ø Ø Ø Exadata Storage Servers Exadata storage servers provide high-performance storage for Oracle databases Ø Ø Up to 1.8 Gb/sec raw data bandwidth per cell Up to 75,000 IOPs using Flash per cell Based on 64-bit Sun Fire servers Comes installed with Exadata storage server software, Oracle Linux x86_64, drivers and utilities Storage servers only available with the Exadata Database Machine Page 6
Exadata Storage Servers Component Processors Memory Local Disks Flash Disk Controller Network Remote Management Power Supplies Details 2 six-core Intel Xeon L5640 Processors (2.26GHz) 24 GB (6 x 4GB) 12 x 600 GB 15K RPM High Performance SAS, or 12 x 2Tb 7.2k RPM High Capacity SAS 4 x 96Gb Sun Flash Accelerator F20 PCIe Cards Disk controller HBA with 512 MB battery backed cache Two InfiniBand 4X QDR (40Gb/s) ports 1 dual-port PCIe HCA Four embedded GbE ports 1 Ethernet port for ILOM 2 redundant hot-swappable power supplies Sun Fire X4270 M2 Page 7
Ø Ø Ø Exadata Database Servers The compute node grid consists of multiple database servers Oracle 11gR2 and ASM run on the database servers Companies typically run RAC databases on the database server to achieve high availability and maximize the aggregate CPU and memory horsepower in the compute node grid Page 8
Exadata Database Servers Component Processors Memory Local Disks Disk Controller Network Remote Management Power Supplies Details 2 6-core Intel Xeon X5670 Processors (2.93 GHz) 96 GB (12 x 8 GB) 4 x 300GB 10K RPM SAS disks Disk controller HBA with 512MB battery backed cache Two InfiniBand 4X QDR (40Gb/s) ports Four 1GbE Ethernet ports Two 10GbE Ethernet ports 1 Ethernet port for ILOM 2 redundant hot-swappable power supplies Operating System 64-bit Oracle Enterprise Linux 5.5 Solaris on X2-8 Page 9
Ø Ø Ø Ø InfiniBand Network InfiniBand is the Exadata Storage Network Looks like normal Ethernet to hosts Efficiency of a SAN Used for both storage and RAC interconnect Ø Uses high-performance ZDP InfiniBand protocol (RDS V3) Page 10
Ø InfiniBand Network Oracle uses InfiniBand because of its proven track record with high-performance computing - provides 40 Gb/ sec in each direction Ø Looks like Ethernet but is much faster Ø Ø Uses zero copy, which means data is transferred across the network without intermediate buffer copies in the various network layers Uses buffer reservation, which means that the hardware knows exactly where to place buffers ahead of time Ø Ø Ø Unified network fabric for both Exadata storage and RAC interconnect simplifies cabling and networking On top of InfiniBand, Exadata uses the Zero Data loss (ZDP) UDP protocol. The ZDP protocol has a very low CPU overhead with tests showing only a 2 percent CPU utilization while transferring 1 GB/sec of data. Each Exadata server is configured with one dual-port InfiniBand card designed to be connected to two separate InfiniBand switches for high availability. Each InfiniBand link is able to carry the full data bandwidth of the entire cell, which means you can lose an entire network without losing any performance. Page 11
Other Hardware Components Ø Ø Ø Ø Embedded Cisco switch for data center network uplink KVM switch for management of storage server and compute nodes Multiple PDUs with management interfaces ILOM (Integrated Lights Out Management) capability for each component rack Page 12
Exadata Configuration Options Oracle has four (4) configuration options: Ø Ø Ø Ø Ø Ø Ø Ø Exadata X2-2 Quarter Rack Exadata X2-2 Half Rack Exadata X2-2 Full Rack Exadata X2-8 Full Rack Each configuration is available with either High Performance or High Capacity disks The difference between the configurations is the # of storage servers and compute nodes Oracle has a short list of hardware configuration options because it requires the balanced hardware configuration A half rack can be upgraded to a full rack, and multiple racks can be interconnected Ø Quarter racks have 2 InfiniBand Switches; Half and Full racks have 3 Page 13
Exadata Configuration Options Compute Nodes X2-2 Quarter Rack X2-2 Half Rack X2-2 Full Rack X2-8 Full Rack # Compute nodes 2 4 8 2 Processor Cores per Node Total Processor Cores 12 12 12 64 24 48 96 128 Memory/Node 96 Gb 96 Gb 96 Gb 1 TB Total Memory 192 Gb 384 Gb 768 Gb 2 TB Page 14
Exadata Configuration Options Exadata Storage Servers X2-2 Quarter Rack X2-2 Half Rack X2-2 Full Rack X2-8 Full Rack # Storage cells 3 7 14 14 Disks/Cell 12 12 12 12 Total Disks 36 84 168 168 Flash 1.15 Tb 2.7 Tb 5.4 Tb 5.4 Tb Raw Storage (HP) 21.6 Tb 50.4 Tb 100.8 Tb 100.8 Tb Raw Storage (HC) 72 Tb 168 Tb 336 Tb 336 Tb Page 15
Exadata Configuration Options Exadata Storage Servers Raw disk throughput, High Performance disks Raw disk throughput, High Capacity disks Disk IOPs, High Performance disks Disk IOPS, High Capacity disks X2-2 Quarter Rack X2-2 Half Rack X2-2 Full Rack X2-8 Full Rack Up to 5,400 MBPS Up to 12,600 MBPS Up to 25,200 MBPS Up to 25,200 MBPS Up to 3,000 MBPS Up to 7,000 MBPS Up to 14,000 MBPS Up to 14,000 MBPS Up to 10,800 Up to 25,200 Up to 50,400 Up to 50,400 Up to 4,320 Up to 10,080 Up to 20,160 Up to 20,160 Flash IOPs Up to 225,000 Up to 525,000 Up to 1,050,000 Up to 1,050,000 Page 16
How Oracle Works on Exadata Oracle 11gR2 Oracle RAC Database storage uses ASM Page 17
How Oracle Works on Exadata Each storage cell contains: Exadata Storage Software Disks (LUNs, cell disks, and Grid Disks) ASM disk groups built on Exadata Grid Disks Exadata features implemented on storage servers and exploited by databases on compute nodes Page 18
How Oracle Works on Exadata Oracle communicates with the Exadata cells using idb idb is implemented using LIBCELL Exadata binaries are linked with LIBCELL to facilitate cell communication Page 19
How Oracle Works on Exadata CELLSRV is the primary component and provides majority of services Oracle Databases and ASM processes use LIBCELL to communicate to CELLSRV processes CELLSRV is what delivers the unique Exadata features Page 20
Oracle on Exadata Each Exadata cell has 12 physical disks Oracle reserves 29Gb on each of the first two disks in each cell for the system area Page 21
Oracle on Exadata One LUN is created on each physical disk A LUN is the unit on which a Cell Disk can be created Page 22
Oracle on Exadata Cell Disks are created on LUNs Cell Disks represent the storage area for each LUN Page 23
Oracle on Exadata Grid Disks are built on Cell Disks A Grid Disk can consume all space on a cell disk or an administrator-specified chunk of the Cell Disk The outermost tracks on a Grid Disk have the highest performance characteristics Building multiple Grid Disks per Cell Disk allows the administrator to segregate storage by usage and performance demands Interleaved Grid Disks can increase probability of important extents being on higher-performing disk tracks Page 24
Oracle on Exadata Grid Disk are directly exposed to ASM ASM Disk Groups are built on Grid Disks Page 25
Exadata Software Features at a Glance Exadata Software Goals Fully and evenly utilize all Computing resources in Exadata Database Machine to eliminate bottlenecks and deliver consistent high performance Reduce demand for resources by eliminating any IO that can be discarded without impacting result Page 26
Exadata Smart Scan One of the most important Exadata software features Typically where Exadata software features discussions start One of several cell offload features in Exadata cell offload is defined as Exadata s shifting of database work from the database servers to Exadata storage servers Primary goal of Smart Scan is to perform majority of IO processing on storage servers and returned smaller amounts of data from the storage infrastructure to the database servers Provides dramatic performance improvements for eligible SQL operations (full table scans, fast full index scans, joins, etc.) Smart Scan is implemented automatically on Exadata for eligible SQL operations Smart Scan only works on Exadata Page 27
Traditional SQL Processing User submits a query select state,count(*)! from census! group by state! The query is parsed and an execution path is determined. Extents are identified IO is issued Oracle retrieves blocks based on extents identified. A block is the smallest unit of transfer from storage system, contains rows and columns Oracle filters rows and columns, once loaded to buffer cache, and returns to user Database instance processes blocks all required blocks read from storage system to buffer cache Page 28
Smart Scan Processing User submits a query select state,count(*)! from census! group by state! An idb command is constructed and sent to Exadata cells Each Exadata cell scans data blocks and extracts relevant rows and columns that satisfy the SQL query The database consolidates results from across each Exadata cell and returns rows to the client Each Exadata cell returns to the database instance an idb message containing the requested rows and columns. These are not block images and are returned to the PGA Page 29
Smart Scan and Cell Offload Processing Filtering operations are offloaded to the Exadata storage cell Column Filtering Predicate Filtering (i.e., row filtering) Join Filtering Only requested rows and columns are returned to the database server Significantly less IO transfer over storage network Less memory and CPU required on database tier nodes Rows/columns retrieved to user s PGA via direct path read mechanism, not through buffer cache. Large IO requests don t saturate buffer cache Page 30
Smart Scan and Cell Offload Processing What s Eligible for Smart Scan on Exadata? Single-table full table scans (not IOTs, not BLOBs, no hash clusters) - Oracle is a C program - kcfis_read is function that does smart can - kcfis_read is called by direct path read function, klbldrget - klbkdrget is called from full scan functions Serial direct reads enabled Join filtering using Bloom Filters Bloom filtering determines which rows are required to satisfy a join A Bloom filter is created with the values of the join column for the smaller table of a join, and this filter is used by the Exadata storage server to eliminate row candidates from the larger table Page 31
Smart Scan in Action Our table is 4.53 Gb in size It s got almost 180 million rows All its indexes are invisible A full scan with returned in 5.27 seconds! Page 32
Measuring Smart Scan: cell_offload_plan_display cell_offload_plan_display setting AUTO ALWAYS NEVER Meaning Shows offload-related information in plan display if an Exadata cell is present and the objects are on the cell Shows offload-related information if the SQL statement is off-loadable, whether or not running on an Exadata cell or not. Useful for 11gR2 databases not running on Exadata to do simulated plans Never shows offload-related information in plans. TABLE ACCESS STORAGE FULL indicates that the query is cell-off-loadable, eligible for Smart Scan Page 33
Measuring Smart Scan with Statistics Statistic/Event cell physical IO bytes eligible for predicate offloading cell physical IO interconnect bytes Meaning Number of bytes eligible for cell offload Bytes returned from storage cells to database server cell physical IO interconnect bytes return by smart scan Bytes returned from storage cells via Smart Scan operations Page 34
Measuring Smart Scan with Statistics 4.8 Gb eligible for Smart Scan and 45MB returned from storage grid Smart Scan Efficiency of 99.05% Page 35
Measuring Smart Scan with Statistics PGA max before Smart Scan is about 4Mb PGA max after smart scan is over 10Mb Page 36
Controlling Smart Scan Behavior SQL> alter system set cell_offload_processing=<true FALSE>; Control cell offload systemwide SQL> alter session set cell_offload_processing=<true FALSE>; Control cell offload for current session SQL> select /*+ opt_param ( cell_offload_processing, <true false> ) */. Control cell offload for SQL statement Page 37
Controlling Smart Scan Behavior Full scan in 31+ seconds All 4.8GB returned from cell Page 38
Controlling Smart Scan Behavior Page 39
Smart Scan Predicate Filtering Ran in 6.64 seconds 2.1GB returned via Smart Scan, savings of 56.11% Page 40
Smart Scan Predicate Filtering 410K rows returned in 5.48 seconds 45MB returned via Smart Scan, savings of 99.03% Page 41
Smart Scan Predicate Filtering 2 rows returned in 5.39 seconds 8MB returned via Smart Scan, savings of 99.82% Page 42
Smart Scan Column Projection 1.17 MB of data through interconnect Page 43
Smart Scan Column Projection 651,744 bytes through interconnect Page 44
Smart Scan Column Projection 1,176,032 bytes when selecting BIGCOL 651,744 bytes when selecting BIGCOL 2,048 rows returned (1,874,990 651,744) / 2,048 = 524,288 bytes difference measured Predicted delta = 522,240 bytes, which his pretty close Page 45
Smart Scan Column Projection Another Example Page 46
Smart Scan Column Projection Another Example Query ran in 5.47 seconds Query ran in 13.04 seconds Page 47
Smart Scan: Join Filtering In some cases, Oracle will use bloom filters to optimize join filtering Bloom filters are implemented on the storage server on Exadata Bloom filters work by: Examining the columns in a join For the smaller table in a join, Oracle determines the space required to store the result of data and if small compared to the size of the joined (larger) table, may decide to use a bloom filter Bloom filters identify which columns are required to satisfy a join and build an in-memory set of values to compare against the larger table s rows Typically are built to reduce data communication between slaves for PQ joins and work well with PQO and partitioned tables Page 48
Smart Scan: Join Filtering with offloaded Bloom Filters Result returned in 7.45 second SYS_OP_BLOOM_FILTER indicates Bloom Filters. When it happens on Storage line, it means it s offloaded Page 49
Smart Scan: Join Filtering with offloaded Bloom Filters With _bloom_filter_predicate_pushdown_to_storage = false SYS_OP_BLOOM_FILTER indicates Bloom Filters. Here, does NOT exist on storage line, hence not offloaded Page 50
Smart Scan: Table Scans or Index Scans? Do we just drop all indexes and let Smart Scan do its magic? Completed in a little over 3 minutes Over 6.7 GB returned from cell note CPU and LIO values Page 51
Smart Scan: Table Scans or Index Scans? Do we just drop all indexes and let Smart Scan do its magic? Completed in less than a second with indexes TEST TEST TEST Far less IO and less CPU/LIO Indexes still work better for this type of query! Page 52
Smart Scan: Disabling Serial Direct Read Mechanism Query completed in 3.81 seconds 15Gb eligible for cell offload 100Mb transferred over from storage cell Smart Scan efficiency of 99.34% Page 53
Smart Scan: Disabling Serial Direct Read Mechanism Disable serial direct reads Query time increased from 3.8 seconds to 101 seconds Nothing eligible for cell offload no Smart Scan Page 54
Querying V$SQL to Determine Cell Off-loadable Queries How can you determine which SQL statements are Exadata cell offload-able? Page 55
Querying V$SQL to Determine Cell Off-loadable Queries How can you determine which SQL statements are Exadata cell offload-able? Page 56
Querying V$SQL to Determine Cell Off-loadable Queries How can you determine which SQL statements are Exadata cell offload-able? Page 57
Smart Scan Summary Smart Scan is software engineered for Exadata that can provide significant IO savings and dramatically improved performance Smart Scan works by offloading IO operations to the Exadata storage cell and only returning rows and columns requested by the SQL statement back to the database instance Row filtering provides significant cell scan efficiency gains Column filtering provides IO savings Join filtering using Bloom Filters, performed on Exadata cells, also can offer huge performance gains Don t drop all your OLTP indexes without testing single-row index access will often still outpace full scan with cell offload One of the great things about Exadata and Smart Scan is that nothing needs to be explicitly done by a developer or DBA to take advantage of it - it just works with Exadata Page 58
Exadata Storage Indexes The goal of storage indexes is to eliminate IO requests to Exadata disks Storage indexes are NOT at all like traditional Oracle indexes Storage indexes are like anti-indexes they help Oracle determine which blocks the requested data are NOT in Knowing which storage areas will NOT contain the requested data allows Exadata to skip IO requests to these disks Bypassing IO to disks reduces work required on the storage grid and improves performance Page 59
Exadata Storage Indexes: Architecture Each disk in an Exadata storage cell is divided into 1MB pieces called storage regions For each 1MB storage region, data distribution statistics are held in a memory structure called a region index Region indexes maintain min and max values for the storage region as data as selected from the disk storage region, for up to 8 columns in a table Region indexes are maintained over time as data is access from the disk Storage indexes are maintained automatically by Exadata; there is no way to influence their behavior Storage server reboots will erase storage index data region indexes are stored in memory Performance impact can be dramatic, but performance impact is not deterministic as its subject to Exadata s automatic management of region indexes Page 60
Exadata Storage Indexes: Architecture Region index select * from soe.orders where order_total > 60; Min 1 Max 22 Min: 62 Max 100 Min 1 Max 60 Min 1 Max 101 ASM AU Storage region Min = 1, Max = 60 ASM Disk Table: ORDERS Table: ORDERS Table: ORDERS C1 C2 C3 C4 1 11 C1 C2 C3 C4 62 81 Min = 62, Max = 100 C1 C2 C3 C4 1 59 Min = 1, Max = 60 22 99 22 Page 61
Exadata Storage Indexes: Scenarios Data needs to be well-ordered according to query predicates If all regions contain randomly ordered column values, region index min and max values will be too widespread to eliminate the region from an IO request Since Exadata only maintain regions indexes on 8 columns per table, your range of distinct predicates should be relatively small and static Storage indexes can return false-positive, but will never skip regions and jeopardize query integrity select * from soe.orders where order_total between 10 and 15; Min 1 Max 22 Min: 62 Max 100 Min 1 Max 60 Min 1 Max 101 Page 62
Exadata Storage Indexes in Action This is the key statistic: cell physical IO bytes saved by storage index Page 63
Exadata Storage Indexes in Action 0 rows returned in 4.85 seconds ~ 17Mb of IO saved by storage index Smart scan efficiency of 99.98% Page 64
~ 9Gb of IO sav by storage index Exadata Storage Indexes in Action 00 rows returned in in 2.26 4.85 seconds Smart scan efficiency of 99.98% Page 65
Exadata Storage Indexes in Action 0 rows returned in 0.09 seconds!! ~ 15Gb of IO saved by storage index Page 66
Exadata Storage Indexes in Action 7.8 million rows returned in 4.23 seconds No storage index savings Page 67
Exadata Storage Indexes in Action 7.8 million rows returned in 1.37 seconds Over 12Gb of IO savings with storage indexes Page 68
Exadata Storage Indexes in Action No storage index savings Page 69
Exadata Storage Indexes in Action No storage index savings Page 70
Exadata Storage Indexes and Bind Variables Storage indexes used Page 71
Exadata Storage Indexes and NULL values Storage index used Page 72
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards No storage index savings Page 73
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards 14GB of IO saved by storage index Page 74
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards No storage index savings LIKE is fine, but Wildcards prevent storage index IO pruning Page 75
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards Storage index used Page 76
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards Storage index used but less IO saved with wildcards Page 77
Exadata Storage Indexes with LIKE, BETWEEN, Wildcards No storage index used Page 78
Exadata Storage Indexes with OLTP Tables Page 79
Exadata Storage Indexes with OLTP Tables Eliminated over 4Gb of IO via storage index Page 80
Exadata Storage Indexes with OLTP Tables Ran in 0.26 seconds Eliminated over 4GB of IO from via storage index Page 81
Exadata Storage Indexes with OLTP Tables Page 82
Exadata Storage Indexes with OLTP Tables No storage index when upperbound in range is higher than max value Page 83
Exadata Storage Indexes with Normal Index Access Page 84
Exadata Storage Indexes with Normal Index Access No storage index Page 85
Exadata Storage Indexes with Normal Index Access Storage indexes are only used for cell offload functions Page 86
Disabling Storage Indexes Storage indexes can be disabled by setting _kcfis_storageidx_disabled = TRUE Query ran in < 1 second ~ 12Gb saved via storage index Page 87
Disabling Storage Indexes Storage indexes can be disabled by setting _kcfis_storageidx_disabled = TRUE Query ran in 4.71 seconds 0 Bytes saved by storage index Page 88
Tracing Storage Indexes Storage index operations can be traced by setting _kcfis_storageidx_diag_mode = 2 Enable storage index tracing Indicates storage indexes were in use Page 89
Tracing Storage Indexes Page 90
Tracing Storage Indexes SQL_ID for given transaction DATA_OBJECT_ID Page 91
Tracing Storage Indexes strt = 0 end = 2048 Memory=2K Storage region size = 1Mb Column # 11 Low and high values of actual data Page 92
Tracing Storage Indexes No storage index savings Page 93
Summary Storage indexes are designed to eliminate IO requests on storage servers Storage indexes are automatically maintained in memory areas on the storage cell Exadata tracks minimum and maximum values in each storage region while processing IO requests and stores these region-level boundaries in region indexes Exadata examines region indexes to determine whether a storage region can be excluded from access based on a query predicate Storage indexes are automatically created and maintained Storage indexes work for cell offload functions (Smart Scan) Storage indexes can help performance dramatically and at worst-case, will never hurt performance Storage indexes work on well-ordered tables based on the query predicates issued against the tables Page 94
Exadata Smart Flash Cache Smart Flash Cache goal: intelligently cache frequently used data Exadata caches in PCI flash cards on the storage cell Data is cached to improve IO response time and deliver better database performance Flash cards can deliver tens of thousands of I/Os per seconds SAS or SATA drives can deliver a couple hundred I/Os per second On Exadata, Oracle only caches data likely to be requested again Page 95
Exadata Smart Flash Cache: Hardware Four 96Gb PCI flash cards per cell 384 Gb of PCI flash per cell 5.4 Tb of flash on full rack, 2.7 Tb for half rack, 1.15 Tb for quarter One cell can support 75,000 IOPs Full rack - over 1,000,000 IOPs Page 96
Exadata Smart Flash Cache: Software Smart Flash Cache provides a storage cell caching algorithm to cache appropriate data in the storage cell PCI flash cards Each database IO is tagged with metadata The CELL_FLASH_CACHE setting for the object(s): DEFAULT means Smart Flash Cache is used normally, KEEP means that Smart Flash Cache is used more aggressively, NONE means Smart Flash Cache is disabled for the object A cache hint: CACHE means the IO should be cached, NOCACHE means it shouldn t, and EVICT indicates that the cached data should be removed from Smart Flash Cache Smart Flash Cache takes the following into consideration when caching data: IO size: Large objects with CELL_FLASH_CACHE=DEFAULT are not cached Current cache load: Smart table scans are usually directed to disk but if CELL_FLASH_CACHE=KEEP and the cache load is low, they may be satisfied from Smart Flash Cache Specific operations (backups, Data Pump export/import, etc) are not cached Page 97
Exadata Smart Flash Cache: Software Smart Flash Cache is a write-through cache After a write is acknowledged, it s written to Smart Flash Cache if suitable for caching Write performance is not improved or diminished with Smart Flash Cache Because it s not a write-back cache, writes are not cushioned by PCI flash cards during write operations the writes need to happen to disk A small battery-backed cache on each cell performs write-back caching Technology comparisons: EMC FastCache: EMC EFDs as an extension of storage processor controller cache, write-back caching to cushion write IO, very good write performance FusionIO: data placed directly on PCI flash cards, very high write performance Exadata can match write IO performance of EMC FastCache, FusionIO by using Smart Flash Cache for Flash Grid Disk storage be wary of depleting capacity for normal Smart Flash Cache! Page 98
Smart Flash Cache: Write Operations #1: Database issues a write operation! #2: CELLSRV inspects IO metadata! #3: Data is written to disk! #6: CELLSRV uses LRU algorithm to determine which data to replace! #5: If IO is Smart Flash Cache suitable, IO is written to Flash Cache! #4: IO is acknowledged and database process continues! Page 99
Smart Flash Cache: Read from Previously Cached Data #1: Database issues a read request! #2: CELLSRV inspects IO metadata! #3: If data exists in Flash Cache, it will be read from cache no disk access! #4: Read is satisfied from cache! Page 100
Smart Flash Cache: Read from Un-cached Data #1: Database issues a read operation! #2: CELLSRV inspects IO metadata! #3: Data is read from disk! #6: CELLSRV uses LRU algorithm to determine which data to replace! #5: If IO is Smart Flash Cache suitable, IO is written to Flash Cache! #4: Read is acknowledged and database process continues! Page 101
Smart Flash Cache in Action: Writes exa_fc_mystat.sql Page 102
Smart Flash Cache in Action: Writes Page 103
Smart Flash Cache in Action: Writes Before updating, 97 flash cache hits 1000 rows updated Performed index range scan Page 104
Smart Flash Cache in Action: Writes Query Statistics Update yielded (368 97) = 271 flash cache read hits Page 105
Smart Flash Cache in Action: Writes No physical reads No flash cache reads Page 106
Smart Flash Cache in Action: Writes 180 physical reads 52 1 = 51 cell flash cache read hits Page 107
Smart Flash Cache in Action: Reads with Un-cached Data 1,072 flash cache hits for dbm1 Ran in 1:21.34 4,452 flash cache hits for dbm1 = 3,380 flash cache hits Page 108
Smart Flash Cache in Action: Reads with Cached Data Ran in 16.50 seconds 27.762 flash cache read hits Page 109
Smart Flash Cache in Action: Dropping Flash Cache Page 110
Smart Flash Cache in Action: Dropping Flash Cache Ran in 1:24.36 No flash cache read hits Page 111
Smart Flash Cache in Action: Dropping Flash Cache Page 112
Smart Flash Cache in Action: Smart Scan Data ~ 12Mb cached/cell for the index ~ 18Mb cached for one of the table partitions Page 113
Smart Flash Cache in Action: Smart Scan Data We re using about 350Mb/cell currently Page 114
Smart Flash Cache in Action: Smart Scan Data Page 115
Smart Flash Cache in Action: Smart Scan Data Very small delta in flash cache read hits Page 116
Smart Flash Cache in Action: Smart Scan Data Page 117
CELL_FLASH_CACHE KEEP Page 118
CELL_FLASH_CACHE KEEP Query time dropped to 28 seconds cell flash cache read hits jumped from 136,709 to 252,095 Page 119
CELL_FLASH_CACHE KEEP About 26Gb per cell per partition Page 120
CELL_FLASH_CACHE KEEP Page 121
Summary Exadata Smart Flash Cache is a caching mechanism delivered by Exadata Smart Flash Cache uses PCI flash cards on each storage cell to cache appropriate data Each storage cell has 384Gb of PCI flash available for Smart Flash Cache a full rack delivers 5.3TB of flash storage Smart Flash Cache intelligently caches appropriate data (OLTP) Smart Flash Cache is a write-through cache writes need to be performed on disk before acknowledged to foreground process. After writes and initial reads, Exadata determines whether data is eligible for cache and writes to Flash Cache. Subsequent IO benefits from Smart Flash Cache Segments can be tagged to be kept in Smart Flash Cache, similar to buffer cache keep Grid Disks (and ASM disk groups) can be created on Flash Disks, improving performance in some cases We recommend using all flash storage for Smart Flash Cache Page 122
Oracle 11g OLTP Compression VENDOR_ID VEND_NAME STATE VNDR_RATING VENDOR_TYPE ========== =========== ===== =========== ========== 100 ACME ONE MI 100 DIRECT 101 ACME ONE CA 90 DIRECT 102 NORTON IA 95 INDIRECT 103 WINGDINGS MI 96 INDIRECT 104 WINGDINGS GA 96 INDIRECT Uncompressed <- Header -> Symbol Table-> Compressed ACME ONEDIRECT WINGDINGS96INDIRECT Duplicate values inserted into symbol table 100ACME ONEMI100DIRECT 101ACME ONECA()DIRECT 102NORTONIA95INDIRECT 103WINGDINGSMS96INDIR ECT 104WINGDINGSGA96INDIR ECT <- Data -> <- Free Space -> 100*MI100* 101CA90* 102NORTONIA95INDIRECT 103*MS** 104*GA** As rows are inserted, only unique values are inserted into blocks Results in more free space Free space in block Rows are inserted into blocks, no compression Page 123
Oracle 11g OLTP Compression VENDOR_ID VEND_NAME STATE VNDR_RATING VENDOR_TYPE ========== =========== ===== =========== ========== 100 ACME ONE MI 100 DIRECT 101 ACME ONE CA 90 DIRECT 102 NORTON IA 95 INDIRECT 103 WINGDINGS MI 96 INDIRECT 104 WINGDINGS GA 96 INDIRECT When the first row is inserted into the block, it is not compressed Subsequent row inserts check for duplicates and move duplicate values to symbol table Uncompressed 100ACME ONEMI100DIRECT 101ACME ONECA()DIRECT 102NORTONIA95INDIRECT 103WINGDINGSMS96INDIR ECT 104WINGDINGSGA96INDIR ECT <- Header -> Symbol Table-> <- Free Space -> <- Data -> Compressed ACME ONEDIRECT WINGDINGS96INDIRECT 100*MI100* 101CA90* 102NORTONIA95INDIRECT 103*MS** 104*GA** After free space is exhausted in block, block will be fully compressed There is a slight CPU overhead in compressing data; Oracle has to build/maintain symbol table The unit of compression for OLTP Advanced Compression is the block Page 124
Exadata Hybrid Columnar Compression Data is stored by columns in a compression unit A compression unit is a set of blocks, not a single block Basic compression and Advanced Compression loads data into blocks row by row. As rows are added, compression algorithms are invoked to reference or update symbol tables and duplicate values are not inserted HCC examines data before being placed on a block and divides the data into arrays of columns HCC takes a stream if input values and places all values for column 1 in array 1, all values for column 2 in array 2, and so on HCC takes four 8k blocks of data in the stream, by default HCC then performs its de-duplication process on the entire stream of 4 blocks, so the odds of finding duplicate data are greatly increased over standard or Advanced compression because we re considering a large subset of data HCC then does leading-edge compression on the array values and inserts the columns into a compression unit (set of blocks), not row-by-row Page 125
Exadata Hybrid Columnar Compression VENDOR_ID VEND_NAME STATE VNDR_RATING VENDOR_TYPE ========== =========== ===== =========== ========== 100 ACME ONE MI 100 DIRECT 101 ACME ONE CA 90 DIRECT 102 NORTON IA 95 INDIRECT 103 WINGDINGS MI 96 INDIRECT 104 WINGDINGS GA 96 INDIRECT Uncompressed Hybrid Columnar Compression Logical Compression Unit <- Header -> 100ACME ONEMI100DIRECT 101ACME ONECA()DIRECT 102NORTONIA95INDIRECT 103WINGDINGSMS96INDIR ECT 104WINGDINGSGA96INDIR ECT CU Header-> VENDOR_ID VEND_NAME VNDR_RATING STATE VEND OR_TY PE COL7 COL6 COL8 COL10 COL9 Page 126
Exadata Hybrid Columnar Compression Compression Unit #1 COL2 COL5 COL6 COL8 COL9 4 blocks, first 50,000 rows COL1 COL3 COL7 COL10 COL4 Compression Unit #2 COL2 COL5 COL6 COL8 COL9 Second CU, new set of 4 blocks, 48,500 rows COL1 COL3 COL7 COL10 COL4 Compression Unit #2 Third CU, new set of 4 blocks, 51,000 rows COL2 COL5 COL6 COL8 COL9 COL1 COL3 COL7 COL10 COL4 Page 127
Exadata Hybrid Columnar Compression Recall, data is stored by columns inside a compression unit The greater the degree of duplicate column values, the less space required per compression unit and the fewer compression units required to store the data If queries select a single column or subset of columns, Oracle will only need to read from blocks units on which the columns exist This is different than other types of compression and un-compressed tables Not only are we savings space, but we re saving IO Saving IO means better performance! Page 128
Exadata Hybrid Columnar Compression and DML EHCC compress data for bulk direct path loads only Subsequent bulk load inserts will appends new compression units Single-row inserts will insert rows as OLTP compressed DELETEs against HCC tables lock entire CU When updating EHCC tables: The updated row is moved (i.e., deleted + re-inserted, i.e., migrated) New row is OLTP-compressed Locks impact entire CU, not just row! DML on EHCC tables is very expensive! Page 129
Exadata Hybrid Columnar Compression Types QUERY LOW compression is recommended for data warehouse tables in which data load times are important QUERY HIGH is recommended for data warehouse data in which space savings is important. Offers better compression ratio than QUERY LOW but more expensive ARCHIVE LOW is intended for archive data in which load times are important. Better compression than QUERY compression, in most cases, but more overhead for DML ARCHIVE HIGH is designed for archival data in which space savings is the most important factor. Best compression ratio, most expensive algorithm Page 130
Using the Compression Advisor DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle s compression advisor Indicates compression type in this case, EHCC archive low compression comp_for_archive_low = EHCC Archive Low comp_for_archive_high = EHCC Archive High comp_for_query_low = EHCC Query Low comp_for_query_high = EHCC Query High comp_for_oltp = OLTP Advanced Compression Page 131
Using the Compression Advisor DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle s compression advisor OLTP Compression gives a low 1.1 compression ratio Page 132
Using the Compression Advisor DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle s compression advisor EHCC Query Low Compression gives 2.7 compression ratio Page 133
Using the Compression Advisor DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle s compression advisor EHCC Query High Compression gives 5.0 compression ratio Page 134
Using the Compression Advisor DBMS_COMPRESSION.GET_COMPRESSION_RATIO is Oracle s compression advisor EHCC Archive High Compression gives 9.2 compression ratio Page 135
Hybrid Columnar Compression in Action Uncompressed table: SOE.CUSTOMERS Use PCTAS to create 5 new compressed tables: one compressed for OLTP, one EHCC compressed for query low, one compressed for query high, one compressed for archive low, and one compressed for archive high Measure time it takes to create compressed table Measure time it takes to run sample SELECT against compressed table Measure time it takes to bulk insert rows into compressed table Measure time it takes to update rows on compressed table CUSTOMERS CUST_OLTP CUST_QLOW CUST_QHIGH CUST_ALOW CUST_AHIGH Estimated Blocks 19,670 17,881 7,285 3,934 3,391 2,138 Page 136
Storage Impact: Different Compression Scenarios Page 137
Storage Impact: Different Compression Scenarios In our example, archive compression is not as efficient as HCC query high compression Page 138
Storage Impact: Different Compression Scenarios Good estimates from DBMS_COMPRESSION CUSTOMERS CUST_OLTP CUST_QLOW CUST_QHIGH CUST_ALOW CUST_AHIGH Estimated Blocks 19,670 17,881 7,285 3,934 3,391 2,138 Page 139
Performance Impact: Creating Compressed Tables Page 140
Performance Impact: Creating Compressed Tables Page 141
Performance Impact: Creating Compressed Tables Create Table Time (CUSTOMERS) 50 45 40 35 30 25 20 15 10 5 0 OLTP QUERYLOW QUERYHIGH ARCHIVELOW ARCHIVEHIGH Page 142
Performance Impact: Creating Compressed Tables 2500 Create Table Time (MYOBJ) 2000 1500 1000 500 0 OLTP QUERYLOW QUERYHIGH ARCHIVELOW ARCHIVEHIGH Page 143
Performance Impact: Querying Compressed Tables We expect least IO and best full-scan time against this table Page 144
Performance Impact: Querying Compressed Tables Query ran in 4.63 seconds ~600Mb of IO 96% smart scan efficiency Page 145
Performance Impact: Querying Compressed Tables 4.91 seconds ~700Mb of IO 17% less LIO compared to uncompressed 94.46% smart scan efficiency Page 146
Performance Impact: Querying Compressed Tables 3.03 second runtime 97MB over interconnect Less CPU and LIO 99.07% smart scan efficiency Page 147
Performance Impact: Querying Compressed Tables Query ran in 2.07 seconds 35 MB over interconnect Less CPU and LIO 99.52% smart scan efficiency Page 148
Performance Impact: Querying Compressed Tables Ran in 2.08 seconds 19MB over interconnect Even less CPU and LIO 99.74% smart scan efficiency Page 149
Performance Impact: Querying Compressed Tables 2.06 seconds 17MB over interconnect 99.76% smart scan efficiency Page 150
Performance Impact: Querying Compressed Tables Page 151
Performance Impact: Updating Compressed Tables Page 152
Performance Impact: Updating Compressed Tables Took the longest amount of time More CPU required to update HCC compressed tables Page 153
Proving Oracle migrated HCC Updated Rows Page 154
Proving Oracle migrated HCC Updated Rows Row location: File 11 Block 1569727 Slot 165 Page 155
Proving Oracle migrated HCC Updated Rows Page 156
Proving Oracle migrated HCC Updated Rows Page 157
Proving Oracle migrated HCC Updated Rows Value=1 means fetch was done via migrated row Page 158
Examining the Compression Unit Archive high # Blocks/CU = 32 Cols/Rows CU Length Page 159
Examining the Compression Unit Page 160
Examining the Compression Unit Archive high CU size = 7 blocks smaller than previous block dump 1 deleted row, slot 165 Page 161
HCC Decompression HCC compresses data in-flight as it s inserted via direct path Where and when is the data decompressed when queried? Definitions: Upper Half = Compute Node Lower Half = Storage Server HCC data is decompressed on lower half when queried via Smart Scan (i.e., full scan, i.e., serial direct read). Uncompressed data transmitted over the interconnect, storage server CPUs used for decompression HCC data is decompressed on upper half when queried without Smart Scan (i.e., single-block reads). Compressed data transmitted over interconnect, compute node CPUs used for decompression Plan for CPU impact and application design! Page 162
HCC Decompression 42 cs of CPU time on compute node Page 163
HCC Decompression Index range scan Page 164
HCC Decompression 382 cs of CPU time Page 165
Hybrid Columnar Compression Summary Exadata Hybrid Columnar Compression is a unique compression feature available only to Exadata Hybrid Columnar Compression compresses tables by column for a set of rows and stores inside a logical compression unit The unit of compression for basic and OLTP Advanced Compression is a database block The unit of compression for Hybrid Columnar Compression is a compression unit Oracle provides 4 different flavors of Hybrid Columnar Compression: compress for query low, compress for query high, compress for archive low, and compress for archive high IO against Hybrid Columnar Compressed tables is reduced compared to un-compressed tables SELECTs against Hybrid Columnar Compressed tables can run much faster less IO! Rows need to be loaded into HCC tables using DIRECT PATH load DML against HCC tables can be expensive: single-row INSERT operations use OLTP compression, direct path inserts perform HCC compression, and UPDATEs migrate rows (and can be expensive!) Page 166
Exadata IO Resource Management IO Resource Management (IORM) provides a means to govern and meter IO from different workloads in the Exadata Storage Server Database consolidation is a key driver to customer adoption of Exadata Consolidation means that multiple databases and applications could share Exadata storage Different databases in a shared Exadata storage grid could have different IO performance requirements One of the common challenges with shared storage infrastructure is that of competing IO workloads Batch vs. OLTP Warehouse vs. OLTP Production vs. Test and Development You can mitigate competing priorities by over-provisioning storage, but this becomes expensive Exadata addresses this challenge with IO Resource Management. Page 167
Database Resource Management A single database may have many types of workloads with different performance requirements Resource consumer groups allow you to group sessions by workload After creating resource consumer groups, you specify how resources are used within a resource consumer group Once resource consumer groups are established, you must map sessions to a consumer group based on distinguishing characteristics The combination of resource consumer groups and session mappings comprises a resource plan One resource plan can be active in a database at a time A database resource plan is also called an intradatabase resource plan Let s show an example Page 168
Database Resource Management - Example Database DBM OM OLTP Consumer group Other OLTP Consumer group Database XBM Online query Consumer group Reporting Consumer group Batch query Consumer group Page 169
Database Resource Management - Example Database DBM OM OLTP Consumer group Other OLTP Consumer group Database XBM Online query Consumer group Interactive category Batch category Reporting Consumer group Batch query Consumer group Page 170
IO Resource Management Plans IORM provides different approaches for managing resource allocations If you have multiple workloads within a database that you wish to control database resource usage with, you need to configure an intradatabase resource plan If you only have one database in your Exadata Database Machine, your intradatabase resource plan is all you need IO resource management is handled automatically inside the storage servers based on this intradatabase resource plan If you have multiple databases in your Exadata Database Machine that you wish to govern IO resource amongst, you create an interdatabase resource plan Rules in an interdatabase resource plan specify allocations to databases, not consumer groups Category resource management is used when you want to control resources primary by the category of work being done it allows for allocation of resources amongst categories spanning multiple databases An IORM plan is the combination of an interdatabase plan and a category plan Page 171
IORM Architecture CELLSRV queue 3 CELLSRV queue 1 CELLSRV queue 4 CELLSRV queue 2 Page 172
IORM Architecture Resource Plans IORM schedules IO according to resource plans onto disk queues Page 173
IORM Architecture: Rules IORM is only engaged when needed IORM does not intervene if there is only one active consumer group on one database Any disk allocation that is not fully utilized is made available to other workloads in relation to the configured resource plans Background IO is scheduled based on their priority relative to user IO Redo and control file writes always take precedence DBWR writes are scheduled at the same priority as user IO For each cell disk, each database accessing the cell has one IO queue per consumer group and three background queues Background IO queues are mapped to high, medium, and low priority requests with different IO types mapped to each queue If no intradatabase plan is set, all non-background IO requests are grouped into a single consumer group called OTHER_GROUPS Page 174
IORM in Action: Planning DBM has three consumer groups, OM OLTP, OTHER OLTP, and REPORTING XBM will have two consumer groups, ONLINE QUERY and BATCH QUERY DBM Intradatabase Resource Plan 50% of resources allocated to OM OLTP 30% of resources allocated to OTHER OLTP 20% of resources allocated to REPORTING XBM Intradatabase Resource Plan 70% of resources allocated to ONLINE QUERY 30% of resources allocated to BATCH QUERY Page 175
IORM in Action: Planning Category Plan 70% of resources allocated to INTERACTIVE category OM OLTP and OTHER OLTP in INTERACTIVE category for DBM ONLINE QUERY in INTERACTIVE category for XBM 30% of resources allocated to BATCH category REPORTING in BATCH category for DBM BATCH QUERY in BATCH category for XBM Interdatabase Plan 60% of resources allocated to database DBM 40% of resources allocated to database XBM Page 176
IORM in Action: Understanding the Math All User IO = 100% Category Plan 70% Interactive 30% Batch Interdatabase Plan 40% XBM 60% DBM 40% XBM 60% DBM Intradatabase Plan 50% 30% 70% 20% 30% IORM Allocation DBM OM OLTP 26.25% DBM OTHER OLTP: 15.75% XBM: ONLINE QUERY 28.00% DBM: REPORTING 18.00% XBM: BATCH QUERY 12.00% Page 177
IORM in Action: Understanding the Math CG% = (Intra CG% / sum (X)) * db% * cat% CG% = IORM determined resource allocation for consumer group sessions Intra CG% = resource allocation for consumer group within an intradatabase plan X = sum of intradatabase consumer group allocations for all consumer groups in the same category Db% = percentage of database allocation in the interdatabase plan cat% = percentage of resource allocation for the category in which the consumer group belongs Page 178
IORM in Action Page 179
IORM in Action: Configuration on DBM Database Page 180
IORM in Action: Configuration on XBM Database Page 181
IORM in Action: On the Exadata Cells Page 182
IORM in Action: On the Exadata Cells Page 183
Mapping Sessions to Consumer Groups in DBM Page 184
Validate Consumer Group Mapping in DBM Page 185
Mapping Sessions to Consumer Groups in XBM Page 186
Validate Consumer Group Mapping in XBM Page 187
Using Cell database IO metrics Use to measure DB load Use to measure DB load Use to monitor what databases waited for IO to be queued by IORM Metric Name DB_IO_RQ_SM DB_IO_RQ_LG DB_IO_RQ_SM_SEC DB_IO_RQ_LG_SEC DB_IO_WT_SM DB_IO_WT_LG Meaning Total number of IO requests issues by the database since any resource plan was set IO requests per second issued by the database in the last minute Total number of seconds that IO requests issued by the database waited to be scheduled Page 188
Using Cell database IO metrics Small IO load per database Large IO load per database Page 189
Using Cell database IO metrics Small IO load per database/ sec, last minute Large IO load per database/ sec, last minute Page 190
Using Cell database IO metrics Small IO Waits/DB/ Min Large IO Waits/DB/ Min Page 191
Testing our IORM Plan Page 192
Imposing Database Limits with IORM Page 193
Imposing Database Limits with IORM Page 194
Imposing Database Limits with IORM Page 195
Imposing Database Limits with IORM Page 196
Summary IO Resource Management is an Exadata feature designed to govern IO to the Exadata Storage Cells IORM is used in conjunction with DBRM Consumer groups, resource plans, resource group mappings and directives are part of an intradatabase resource plan; i.e., resource allocation management within a database Interdatabase plans are used to control IO resource allocation between multiple databases sharing an Exadata storage server Category plans are a way to group resource consumer groups from an intradatabase plan into an IORM plan and further control IO resource allocation in the Exadata storage servers. If you re consolidating on Exadata, you should be using IORM Instance caging is a way to control CPU utilization for a database in an Exadata compute grid Page 197
Top Reasons to Migrate to Exadata Your database requires the extreme performance that Exadata can deliver You have business goals that can only be met by Exadata s ability to satisfy extreme performance and consolidation capabilities You need to consolidate your database tier platform into a single high-performance, high-capacity, fault tolerant platform You want to reduce the number of people and amount of time required to build and maintain Oracle database tier infrastructures Your capacity or performance requirements have placed you in a position to evaluate enterprise database platform re-architecting You are looking to simplify your database tier infrastructure into a common technology stack with a single-point-of-contact for support and management Exadata is a business enabler Page 198