Oracle Big Data, In-memory, and Exadata - One Database Engine to Rule Them All Dr.-Ing. Holger Friedrich
Agenda Introduction Old Times Exadata Big Data Oracle In-Memory Headquarters Conclusions 2
sumit AG Consulting and implementation services in Switzerland Experts for Data Warehousing, Business Intelligence, and Big Data solutions Focussed on Oracle technology BI Foundation specialized partner Data Warehousing specialized partner Our motto: Get Value From Data Visit our web site: www.sumit.ch (in German) 2013 sumit AG 3
Computer Science diploma of Karlsruhe Institute of Technology (KIT) Ph.D. in Robotics and Machine Learning More than 16 years experience with Oracle technology Expert for Data Integration Data Warehousing, Data Mining and Business Intelligence Technical Director of sumit AG First Oracle ACE for DWH/BI in Switzerland Holger Friedrich 2013 sumit AG 4
Agenda Introduction Old Times Exadata Big Data Oracle In-Memory Headquarters Conclusions 5
DB Architecture - Old Times Old times = 1977-2008 SGA - System Global Area - Shared Pools (Library Cache etc.) - Redo Log Buffer - Buffer Cache Persistent Storage - Disk & Tape - serve database blocks PGA - Program Global Area - Query specific processing and storage Query processing done in PGA by query specific server processes 6
Query Processing - Old Times Server Process Block Buffer Disk 7
Agenda Introduction Old Times Exadata Big Data Oracle In-Memory Headquarters Conclusions 8
2008 - Times Are a Changing 9
Exadata - Architecture Databases and applications deployed and configured without any adaptations Fast network via Infiniband Regular compute servers Dedicated storage servers - organised in cells - discs & flash attached - run Exadata Storage Software 10
Exadata - The Secret Sauce Three reasons for outstanding Exadata performance Hardware engineering Local query processing functionality in storage layer Database engine aware of intelligent storage layer - extended optimizer costing model and transformations - extended SW to use Exacta Storage APIs Divide and conquer for query processing not just with slave processes (PARALLEL) not just between compute nodes (RAC) but between compute and storage nodes 11
Exadata - Storage Software Evolution Smart Scanning - execute sub-query in storage cells - project results in storage already Keep hot data in Flash Cache Storage Indexes - collect min/max column values - reduce disc access Smart scanning directly on HCC data - no decompression required Offload mining tasks like scoring Additional data caching in columnar format in Flash Cache 12
Agenda Introduction Old Times Exadata Big Data Oracle In-Memory Headquarters Conclusions 13
Information Mgmt Reference Architecture Big Data 14
The HADOOP Zoo 15
Information Managament Data Flow 16
Big Data - Challenges Dynamic ecosphere - Pre-packaged distributions - Oracle Big Data Appliance Analytics - Tools of Hadoop ecosphere - Oracle Big Data Analytics Data Integration - Ever changing Hadoop tool set - Oracle Data Integrator - Big Data SQL 17
Big Data Appliance - The Secret Sauce Three reasons for outstanding BDA performance Hardware engineering Local query processing functionality in storage layer - Big Data SQL = Exadata Storage Software on HADOOP - Added as process engine to the HADOOP process layer - BDS agents run independently on HADOOP nodes Database engine aware of intelligent big data layer - extended and enhanced External Table API - extended optimizer costing model and transformations Exadata success and performance on Big Data Big Data transparently available for DB queries 18
Big Data SQL - Smart Scan 1.Read data from HDFS data node - Direct-path reads - C-based readers when possible - native HADOOP classes otherwise 2.Translate bytes to Oracle 3.Smart scan on Oracle format - apply storage indexes (BDS2.0) - filtering - column projection - parsing JSON/XML - model scoringmodels High compression benefits (except cols with distinct values) 19
Big Data SQL 2.0 - Storage Indexes New feature of Big Data SQL 2.0 Avoid unnecessary disc access on HADOOP nodes Index built during first full scan Granularity in HDFS blocks (256MB) Index application - receive filter predicate - check storage index for blocks where predicate between min and max - only smart scan matching blocks 20
Big Data SQL - Query Execution 21
Extended External Tables - HIVE CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total (NUMBER(8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hive ACCESS PARAMETERS ( com.oracle.bigdata.tablename: order_db.order_summary com.oracle.bigdata.colmap: {"col":"item_cnt", \ "field":"order_line_item_count"} com.oracle.bigdata.overflow: {"action":"truncate", \ "col":"description"} com.oracle.bigdata.erroropt: [{"action":"replace", \ "value":"invalid_num", \ optional settings ) PARALLEL 4; ) new type ORACLE_HIVE "col":["cust_num","order_num"]},\ {"action":"reject", \ col":"order_total}] 22
Extended External Tables - HDFS CREATE TABLE order (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total (NUMBER8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hdfs ACCESS PARAMETERS( com.oracle.bigdata.rowformat: \ SERDE 'org.apache.hadoop.hive.serde2.avro.avroserde' com.oracle.bigdata.fileformat: \ INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.avrocontainerinputformat'\ OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.avrocontaineroutputformat' com.oracle.bigdata.colmap: {"col":"item_cnt", \ "field":"order_line_item_count"} com.oracle.bigdata.overflow: {"action":"truncate", \ "col":"description"} LOCATION ("hdfs:/usr/cust/summary/*")); Location on HDFS new type ORACLE_HDFS optional settings 23
Agenda Introduction Old Times Exadata Big Data Oracle In-Memory Headquarters Conclusions 24
Columnar Stores - Oracle s Flavour transparent column store managed next to the row store not either/or persistent storage row-based as before column store DML-synched in real-time the entire Oracle DB-ecosphere remains unchanged - security - backup - disaster recovery - RAC - NO application changes required! 25
Advantages Best for queries that - scan large quantities of data - on a rather small set of columns - compute aggregates on the results High compression benefits on most columns (except ones containing distinct values) Well suited for OLAP/BI 26
Technology Gems 1. In-memory storage index 2. Filtering on binary compressed data 3. Columnar storage of selected columns 4. Transparent querying across storage hierarchy 5. Real-time background actualization of columnar store 6. Parallel query execution on the columnar store 7. SIMD vector processing 8. In-memory fault tolerance on RAC 9. In-memory aggregation 27
New optimizer transformation Vector Group By Resembles well-known star transformation Two phase, 6 step process Phase 1 - preparation 1. Scan dimensions 2. Build key vectors 3. Prepare accumulator 4. Build tmp-tables for dim select attributes Phase 2 - computation 5. Scan facts w.r.t. key vectors 6. Join filtered facts with tmp-tables Example - In-Memory Aggregation 28
In-Memory - The Secret Sauce Many reasons for outstanding In-Memory performance Conceptual advantage of columnar format Speed of processing in DRAM Sum of technology gems (see earlier) Database engine aware of columnar stores capabilities - extended optimizer costing model and transformations - extended SW to use columnar stores APIs Unprecedented performance for analytics Transparently available for DB queries 29
Agenda Introduction Old Times Exadata Big Data Oracle In-Memory Headquarters Conclusions 30
Headquarters Wikipedia: "Headquarters (HQ) denotes the location where most, if not all, of the important functions of an organization are coordinated." Big Data Storage Exadata Storage Query Process in DB HQ Columnar Store Block Buffer Disks 31
The Database Kernel Rules Them All Query Franchising in action optimizer generates execution plan partial queries are sent out to other engines - Big Data (SQL) - Columnar in-memory store - Exadata storage partial results are received & further processed security policies are applied final results are delivered Divide and conquer between data management technologies 32
The Key Lies in The Kernel Database optimizer and execution engine make it happen Transformer: - new transformations Estimator: - new cost estimation models Execution engine: - extended calls and APIs Only possible because Oracle owns all implementations and APIs involved 33
Crucial Part - The Dictionary The optimizer s estimates rely on - the data dictionary - statistics Data Dictionary knows all objects - Exadata: regular db objects - In-memory: regular db objects - Big Data: defined through External Table declaration Estimating statistics about Big Data objects is challenging 34
Agenda Introduction Old Times Exadata Big Data Oracle In-Memory Headquarters Conclusions 35
Conclusions Exadata - boosts execution for traditional applications and analytics Big Data - provides affordable data management for lots of and unstructured data In-Memory - serves mighty fast scans, joins, and aggregations for analytics With other vendors these technologies are either - not available in the desired quality - or not tightly integrated, if at all Data silos & isolated solutions are being built again But: Oracle provides top solutions for each In fact: Oracle provides the only portfolio with - all three technologies tightly integrated - and central data management through the Oracle Database 36