Oracle Database 12c and the Future of Data Warehousing in the Era of Big Data George Lumpkin Data Warehousing Neil Mendelson Big Data & Advanced AnalyEcs Vice Presidents Server Technologies September 29, 2014
Safe Harbor Statement The following is intended to outline our general product direceon. It is intended for informaeon purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or funceonality, and should not be relied upon in making purchasing decisions. The development, release, and Eming of any features or funceonality described for Oracle s products remains at the sole discreeon of Oracle. 3
Big Data Opportunity Typical use cases in today s world of fast exploraeon of big data Big Data PorZolio Analysis Financial Services Fraud UEliEes Manufacturing Retail Session- izaeon Telcos Call Quality Tracking Stock Market Money Laundering Network Analysis Quality Assessment Supply Planning Buying Pa\erns Returns Fraud SIM Card Fraud Money Laundering Slide - 4
Extending Data Management Big Data = Hadoop + NoSQL + Rela5onal Hadoop Change the Business Disrupt compeetors Disintermediate supply chains Leverage new paradigms Exploit new analyses NoSQL Scale the Business Meet mobile challenges Accelerate developer agility Scale- out economically Serve data faster Rela5onal Run the Business Integrate exiseng systems Support mission- criecal tasks Protect exiseng expenditures Insure skills relevance Oracle ConfidenEal Internal/Restricted/Highly Restricted 5
But fundamental architectures remain Oracle InformaEon Management Reference Architecture Data IngesEon Access & Performance Layer Past, current and future interpretaeon of Access and enterprise Performance data. Structured to support Layer agile Foundation Data Layer Raw Data Reservoir access & navigaeon Immutable modelled data. Business Process Neutral form. Abstracted from business process changes Immutable raw data reservoir Raw data at rest is not interpreted InformaEon InterpretaEon
New features for Data Warehousing and Big Data Oracle Database 12c Release 1 (12.1.0.2) Oracle Database In- Memory SIMD Vector Processing Column- Store Storage Indexes In- Memory AggregaEon New SQL Capabili5es A\ribute Clustering Zone Maps for Exadata JSON SQL FuncEons Approximate Count DisEnct Big Data SQL Public 7
Oracle In- Memory Columnar Technology Pure In- Memory Columnar SALES Pure in- memory column format Not persistent, and no logging 2x to 20x compression Enabled at table or pareeon level Public 8
Scans Billions of Rows per Second per CPU Core Memory CPU Load muleple region values REGION Vector Register CA CA CA CA Example: Find all sales in region of CA Vector Compare all values in 1 cycle > 100x Faster Each CPU core scans local in- memory columns Scans use super fast Single InstrucEon muleple Data Values (SIMD) vector instruceons Originally designed for graphics & science Billions of rows/sec scan rate per CPU core Public 9
In- Memory Column Store Storage Index In- Memory IMCU IMCU IMCU IMCU SALES ORDER_DATE Min 1992-01- 01 Max 1996-01- 01 Min 2004-01- 01 Max 2007-01- 01 Min 2009-06- 01 Max 2013-07- 01 Min 1999-01- 01 Max 2014-03- 01 Data stored in In- Memory Compression Units (IMCU s) A storage index records min/ max values for each column unit Storage indexes allow IMCU pruning select * from SALES where ORDER_DATE between 2013-01- 01 and 2014-01- 01 10
In- Memory AggregaEon New opemized algorithm for star query processing Processing steps Transform joins into scan of the fact table Fast in- memory scan with array lookups OpEmize aggregaeon using in- memory arrays OpEmized in- memory data structures Late joins to dimension data Minimizes data movement in the execueon plan Example: Report sales by Quarter and Region Time Quarters Customers Regions Regions In- Memory Report Outline Quarters $ $$ $$$ $ Sales Sales Oracle ConfidenEal Internal/Restricted/Highly Restricted 11
Zone Maps and A\ribute Clustering X AUribute Clustering Orders data so that columns values are stored together on disk Zone maps Stores min/max of specified columns per zone Used to filter un- needed data during query execueon Combined Benefits: Improved query performance and concurrency Reduced physical data access Significant IO reduceon for highly seleceve operaeons OpEmized space uelizaeon Less need for indexes Improved compression raeos through data clustering Full applicaeon transparency Any applicaeon will benefit Public 12
A\ribute Clustering Concept and Benefits Orders data so that it is in close proximity based on selected columns values: a\ributes A\ributes can be from a single table or muleple tables e.g. from fact and dimension tables Able to cluster data during MOVE PARTITION Benefits Significant IO pruning when used with zone maps Also, reduced block IO for table lookups in index range scans Improved performance for queries that sort and aggregate pre- ordered data Improved compression raeos Ordered data is likely to compress more than unordered data Public 13
Zone Maps Persisted storage index X Stores minimum and maximum of specified columns Analogous to a coarse index structure Much more compact than an index Zone maps filter out what you don t need, indexes find what you do need Significant performance benefits with complete applicaeon transparency IO reduceon for table scans with predicates on the table itself or even a joined table using join zone maps (a.k.a. hierarchical zone map ) ParEEoning pruning for every column of a pareeoned table, not only the pareeon key columns Benefits are most significant with ordered data Used in combinaeon with a\ribute clustering or data that is naturally ordered Public 14
A\ribute Clustering With Zone Maps Example X CLUSTERING BY INTERLEAVED ORDER (category, country) Zone map benefits are most significant with ordered data INTERLEAVED ORDER Pruning with: SELECT.. FROM table WHERE category = BOYS ; SELECT.. FROM table WHERE country = US SELECT.. FROM table WHERE category = BOYS ; AND country = US
Zone Maps with A\ribute Clustering Star Schema Benchmark X Overall, 2.6X elapsed Eme improvement over baseline Comparing with and without zone map and a\ribute clustering Query Elapsed Time Improvements Improvement X 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 Query Step Public 16
New Performance Features MulEply the Benefits 100 TB of User Data 10 TB of User Data With 10x Compression 2TB of User Data With ParEEon Pruning 2 TB of User Data 100 GB of User Data 1TB on disk, 1TB in- memory With Storage Indexes and Zone Maps 30 GB of User Data With Smart Scan Sub second Scan No Indexes
EvoluEon of AnalyEcal SQL IntroducEon of window funceons StaEsEcal funceons SQL model clause ParEEon Outer Join In- database Data Mining PaUern matching Top N clause Approx Count dis5nct JSON support 8i 9i 10g 11g 12c Enhanced window funceons (percenele, etc) Rollup, grouping sets, cube SQL Pivot Recursive WITH ListAgg, Nth value window
Why SQL? 1. Enhanced ProducEvity Using SQL, users simply describe the results they want They do not have to describe how to get those results Widespread availability of SQL skills and tools 2. Increased Performance The SQL engine, not the user, determines how to opemize each query Mature SQL engines have broad arsenal of performance techniques 3. Adaptability SQL has proven extensible to new data types and analyecs
Approximate Count DisEnct Not every query requires a completely accurate result How many disenct individuals visited our website last week? New SQL funceon for approximate results for COUNT DISTINCT aggregates APPROX_COUNT_DISTINCT() Approximate results can be significantly faster and use less resources than exact calculaeons 5x to 50x ++ Emes faster (depending upon number of disenct values and complexity of SQL) Accuracy > 97% (with 95% confidence) Public 20
Full power of SQL over JSON documents Sample customers document: { "firstname": "John", lastname : "Smith", cused :55241 "age": 25, "address": { "streetaddress": "21 2nd Street", "city": "New York", "state": "NY", "postalcode": "10021, "isbusiness" : false}, "phonenumbers": [ {"type": "home, "number": "212 555-1234 }, {"type": "fax "number": "646 555-4567 } ] } select J.CUSTOMER_DOC.postalCode, count(*) from JSON_CUSTOMERS J group by J.CUSTOMER_DOC.postalCode; select J.CUSTOMER_DOC.postalCode, sum(s.sales_revenue) from JSON_CUSTOMERS J SALES S where J.CUSTOMER_DOC.custid = S.custid group by J.CUSTOMER_DOC.postalCode;
Future of Data Warehousing in the Age of Big Data
Barriers to Big Data AdopEon Complexity Skills Lack tools and training to exploit Big Data IT OperaEons ability administer and manage Big Data IntegraEon Adding Big Data to exiseng architecture is complex Too much effort required in data preparaeon Security No clear route to governance or enforcement
Big Data Management Hadoop + NoSQL + Rela5onal The Power of Oracle SQL Wide variety of Big Data types Structured data Numeric, string, date, Unstructured data LOBs, Text, XML, JSON, SpaEal, Graph, MulEmedia Rich SQL AnalyEc FuncEons Ranking, Windowing, LAG/LEAD, Aggregate, StaEsEcal, Linear Regression, CorrelaEons, Cross Tabs, Hypothesis TesEng, DistribuEon Fing, 24
What gives Exadata extreme performance? Exadata: Applies SmartScan Close to the Data Query Data in RDBMS Oracle SQL Exadata Oracle Exadata Storage Server Oracle Exadata Storage Server
Oracle Big Data SQL Exadata & Big Data SQL: Applies SmartScan Close to All Data Query Data in RDBMS and Hadoop Oracle SQL Exadata Fast Massive Parallelism Filtered Locally Minimized Data Movement HDFS Data Node BDS Server HDFS Data Node BDS Server Oracle Exadata Storage Server HDFS Data Node BDS Server Big Data Appliance HDFS Data Node BDS Server Oracle Exadata Storage Server
Oracle Big Data SQL: A New Hadoop Processing Engine MapReduce and Hive Processing Layer Spark Impala Search Big Data SQL Resource Management (YARN, cgroups) Storage Layer Filesystem (HDFS) NoSQL Databases (Oracle NoSQL DB, Hbase) Oracle ConfidenEal Internal/Restricted/Highly Restricted 27
Apply Advanced Security on Hadoop & NoSQL Same security policies apply to Hadoop & Rela5onal JSON JSON data unconverted in Hadoop SQL Customer data in Oracle RedacEon Virtual Private Database Fine- grain Access Control Hadoop Redacted data subset Oracle Database 12c Small data subset quickly returned DBMS_REDACT.ADD_POLICY( object_schema => 'txadp_hive_01', object_name => 'customer_address_ext', column_name => 'ca_street_name', policy_name => 'customer_address_redaction', function_type => DBMS_REDACT.RANDOM, expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'', ''REDACTION_TESTER'')=''TRUE''' ); 28
Govern and Secure Your Data With Oracle Hadoop, NoSQL & Rela5onal BDA Capability AuthenEcaEon through Kerberos AuthorizaEon through Apache Sentry AudiEng through Oracle Audit Vault EncrypEon for Data- at- Rest Network EncrypEon Big Data SQL adds Advanced Security on Hadoop & NoSQL RedacEon Virtual Private Database Fine- grain Access Control Oracle ConfidenEal Internal 29
When eaeng an elephant take one bite at a Eme. General Creighton Abrams
Experiment Big Data Appliance X4-2 6 Node Starter Rack 2 * 8 Core Intel Xeon E5 Processors/Node 384 GB / 3 TB (64 GB Memory / expandable to 512 GB/Node) 288 TB (48TB Disk space/node) Integrated So ware Oracle Linux, Oracle Java VM Oracle Big Data SQL*, Oracle Big Data Connectors* Cloudera DistribuEon of Apache Hadoop EDH EdiEon Cloudera Manager Oracle R DistribuEon Oracle NoSQL Database 40 % Cost Savings 33 % Faster Time to Value * Licensed separately 31
Data Lifecycle Management & Query Offload More data on- line and available at a lower cost Month 14- n Oracle Big Data SQL Move Par55on to BDA Rolling 13 months Big Data Rolling DRAM Windows Process PCI FLASH Copy older pareeon AcEve to Data BDA Update views Drop older Exadata pareeon Oracle Ho\est Data Warm Data Offloaded Data data can be accessed via Oracle & Hadoop No ApplicaEon changes required Hadoop Data Deep Data
Offload ETL from Data Warehouse Offload long running ETL jobs to Hadoop New Sources via Hadoop Leave exiseng ETL in place Sources Staging Files MR Detail MR Temp Fast load Data Warehouse Files SQL Oracle GoldenGate Oracle Data ETL Tool Integrator Oracle Company ConfidenEal 33
StaEsEcal & PredicEve AnalyEcs Bring the Analy5cs to the Data Hadoop / Big Data Appliance Oracle R DistribuEon 1 Oracle R Advanced AnalyEcs for Hadoop 2 SAS High Performance AnalyEcs Oracle Database / Exadata Oracle Advanced AnalyEcs OpEon SAS High Performance AnalyEcs 1 Included with BDA 2 Included w/oracle Big Data Connectors
Oracle Big Data Discovery + Advanced AnalyEcs Changing the Game for Agile Business Innova5on on Big Data Profile Find Understand Transform Discover Predict Collaborate Easily add data and see it automaecally and conenuously cataloged, enriched and related Use familiar guided search across massive amounts of diverse data Know what s important from diagnosec analysis of millions of data characterisecs Powerful tools to quickly clean up and wrangle dirty data so it s ready to go Uncover valuable new insights Use new insights to define and refine prediceve models Publish, share and evolve as you learn more Oracle ConfidenEal Internal 35
Cloud PlaZorm: Big Data AnalyEcs Big Data Service Integrated with DBaaS SQL on Hadoop Hadoop 2.0 Cluster NoSQL Service for key value data Persistent Data Reservoir in Storage Service Single tenant or muletenant IaaS offerings for performance/qos commodity with NAS, Big Data Appliance Big Data Discovery The Visual Face of Big Data Business user and data scienest collaboraeon Self- service data discovery and exploraeon to separate signal from noise Fully managed infrastructure by Oracle Cloud operaeons Hadoop scalability and cost economies 36
Summary
Oracle Data Warehousing in the era of Big Data Innova5ng and preserving customer investments Leverage 12c innovaeons Real Eme analyecs with Oracle Database In- Memory Performance of Exadata Power of SQL Extend your Data Warehouse with Big Data Oracle SQL across Oracle, Hadoop & NoSQL Fast, massively parallel and interaceve data access Reduced data movement throughout the enterprise Securing access to Big Data analyecs Deploy on choice of private and public Clouds 38
39