The Cubetree Storage Organization



Similar documents
Data Warehouse: Introduction

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

DATA WAREHOUSING - OLAP

ROLAP with Column Store Index Deep Dive. Alexei Khalyako SQL CAT Program Manager

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

DATA WAREHOUSING AND OLAP TECHNOLOGY

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Innovative technology for big data analytics

<Insert Picture Here> Enhancing the Performance and Analytic Content of the Data Warehouse Using Oracle OLAP Option

SQL Server 2012 Performance White Paper

14. Data Warehousing & Data Mining

Data Warehousing: Data Models and OLAP operations. By Kishore Jaladi

Multi-dimensional index structures Part I: motivation

SQL Server 2008 Performance and Scale

IST722 Data Warehousing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Oracle Database In-Memory The Next Big Thing

Week 3 lecture slides

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Indexing Techniques for Data Warehouses Queries. Abstract

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Jet Enterprise Frequently Asked Questions Pg. 1 03/18/2011 JEFAQ - 02/13/ Copyright Jet Reports International, Inc.

Information management software solutions White paper. Powerful data warehousing performance with IBM Red Brick Warehouse

The Art of Designing HOLAP Databases Mark Moorman, SAS Institute Inc., Cary NC

Whitepaper. Innovations in Business Intelligence Database Technology.

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

LEARNING SOLUTIONS website milner.com/learning phone

Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0

SQL Server 2005 Features Comparison

Oracle Database 11g for Data Warehousing and Business Intelligence. An Oracle White Paper July 2007

Monitoring Genebanks using Datamarts based in an Open Source Tool

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Cognos Performance Troubleshooting

Data Warehousing Concepts

Oracle Database 12c for Data Warehousing and Big Data ORACLE WHITE PAPER SEPTEMBER 2014

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Using distributed technologies to analyze Big Data

When to Use Oracle Database In-Memory

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

SAS BI Course Content; Introduction to DWH / BI Concepts

IBM Cognos 8 Business Intelligence Analysis Discover the factors driving business performance

New Approach of Computing Data Cubes in Data Warehousing

Rackspace Cloud Databases and Container-based Virtualization

Week 13: Data Warehousing. Warehousing

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

SQL Server Analysis Services Complete Practical & Real-time Training

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

Data Warehousing. Paper

DB2 V8 Performance Opportunities

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Oracle OLAP 11g and Oracle Essbase

Oracle Warehouse Builder 10g

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Hybrid OLAP, An Introduction

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

OLAP Services. MicroStrategy Products. MicroStrategy OLAP Services Delivers Economic Savings, Analytical Insight, and up to 50x Faster Performance

Creating BI solutions with BISM Tabular. Written By: Dan Clark

SQL Server Business Intelligence on HP ProLiant DL785 Server

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Designing Business Intelligence Solutions with Microsoft SQL Server 2012 Course 20467A; 5 Days

2009 Oracle Corporation 1

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Oracle Database 11g for Data Warehousing

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS

CS2032 Data warehousing and Data Mining Unit II Page 1

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

SQL Server 2012 Business Intelligence Boot Camp

Topics in basic DBMS course

The Vertica Analytic Database Technical Overview White Paper. A DBMS Architecture Optimized for Next-Generation Data Warehousing

Oracle MulBtenant Customer Success Stories

MOC 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

OLAP Systems and Multidimensional Expressions I

DBMS / Business Intelligence, SQL Server

Data Warehousing and OLAP

Transcription:

The Cubetree Storage Organization Nick Roussopoulos & Yannis Kotidis Advanced Communication Technology, Inc. Silver Spring, MD 20905 Tel: 301-384-3759 Fax: 301-384-3679 {nick,kotidis}@act-us.com 1. Introduction The Relational On-Line Analytical Processing (ROLAP) is emerging as the dominant approach in data warehousing. In order to enhance query performance, the ROLAP approach relies on selecting and materializing in summary tables appropriate subsets of aggregate views which are then engaged in speeding up OLAP queries. However, a straight forward relational storage implementation of materialized ROLAP views is immensely wasteful on storage and incredibly inadequate on query performance and incremental update speed. ACT Inc. has developed a unique storage organization for on-line analytical processing (OLAP) on existing relational database systems (RDBMS). The Cubetree Storage Organization 1 (CSO) logically and physically clusters data of materialized-views, single- and multi-dimensional indices on these data, and computed aggregate values all in one compact and tight storage structure that uses a fraction of the conventional table-based space. This is a breakthrough technology for storing and accessing multi-dimensional data in terms of storage reduction, query performance and incremental bulk update speed. Each Cubetree structure is not attached to a specific materialized view or an index but acts like a place holder which stores multiple and possibly unrelated views and multidimensional indices without having to analyze and identify the dense and sparse area of the underlying multi-dimensional data. In CSO tuples are clustered regardless of the skewedness of the underlying data. Given a set of views that need be supported, the CSO optimizer finds the best placement on one or more cubetrees for achieving maximum clustering. The underlying cost model is tailored to the CSO and uses standard greedy algorithms for selecting the optimal configuration. The DBA can tune the optimization to upper-bound either the duration of the incremental update window or the maximum storage used. CSO offers a scalable and yet inexpensive alternative to existing pricey data warehousing and indexing tools. CSO is developed to be used as a stand-alone system, or as a co-resident of an RDBMS or tightly coupled with the extensible facilities of an Object-Relational DBMSs. 2. OLAP and the Data Cube On Line Analytical Processing (OLAP) is critical for decision support in today's competitive and low margin profits era. The right information at the right time is a necessity for survival and data warehousing industry has an explosive market growth compared to a stagnant one for the database engines. The data cube is a multi-dimensional aggregate operator and has been very powerful for modeling multiple projections of data of a data warehouse. It is, however, very expensive to compute. And to access the some or all of these projections efficiently, requires extensive multidimensional indexing which adds an equal or even higher cost to the data warehouse maintenance cost. Computing the data cube requires enormous hardware resources and takes a lot of time. Indexing can be done using either a multi-dimensional (MOLAP) approach or a Relational (ROLAP) modeling exemplified by star-schemas and join bitmap indexes. Both approaches speed up queries but suffer enormously in maintenance of the indexes which essentially takes the data warehouse out for an extensive period. The downtime window is critical in applications globally available for the 1 US Patent Pending 1

most of the day. 1. Performance- the ultimate reality check The CSO avoids the high cost of creation and maintenance by sorting and bulk incremental mergepacking of cubetrees. The sorting portion of it is the dominant factor in this incremental bulk update mode of operation while the merge-packing portion achieves rates of 5-7 GB per hour on a single disk I/O. No other storage organization offers this scaleable bulk merge-packing technology. Experiments with various data sets showed that multi-dimensional range queries perform very fast and this performance is the result of the packing and compression of CSO and the achieved reduction of disk memory over conventional relational storage and indexing. In a recent test using the TCP-D benchmark data the CSO implementation achieved at least a 2-1 storage reduction, a 10-1 better OLAP query performance, and a 100-1 faster updates over the conventional (relational) storage organization of materialized OLAP views indexed in the best possible way. The CSO optimizer selects the best set of views to materialize and the best placement on one or more cubetrees for achieving maximum clustering. The underlying cost model is tailored to the CSO and uses standard greedy algorithms for selecting the optimal configuration. The DBA can tune the optimization to upper-bound either the duration of the incremental update window or the maximum storage used. Another feature of the CSO software is that the incremental maintenance can be done on-line while the data warehouse is operational using the previous version of the cubetrees while creating the new ones. Although such a background continuous maintenance steels I/O from query processing, it achieves no down-time for the warehouse and can take advantage of low-peak periods. 4. The Informix CSO Datablade The CSO has also been also developed as a datablade for the Informix Universal Server (IUS) Environment. IUS and CSO harmoniously inter-operate in a tightly coupled configuration that has no overhead. This architecture takes advantage of the extensibility of Object-Relational database management system (O-R DBMS) and allows to include user-defined aggregate and statistical functions not normally offered by RDBMSs. The CSO datablade offers all the advantages of the stand-alone CSO software, namely efficient OLAP aggregate queries and bulk-incremental update on them, but at the same time transparent calls of the CSO software from within the IUS SQL environment. Results from the CSO search are merged with results from the IUS database. 5. Extensibility The CSO has been developed with extensibility in mind. New data types for the dimension and measure attributes can be facilitated. Similarly, the aggregate functions on the measure attributes can be user-defined for specialized computation. Such functions can return either a scalar or a vector value. Additional indexing on the aggregate values themselves can also be stored in the same cubtree storage. Other functions that extend the SQL have been developed and incorporated into the Cubetree SQL some of which are illustrated in the examples given in the appendix. 2

6. A Grocery Benchmark This benchmark uses a synthetically generated grocery demo dataset that models supermarket transactions. The data warehouse is organized according to the star schema below. The sales fact table includes 12 dimension attributes and two measure attributes, namely revenue and cost. The hardware platform used for this benchmark is a 180MHz Ultra SPARC I with a single SCSI 9GB hard drive. We have precomputed seven aggregate tables as materialized views and stored them using the Cubetree Storage Organization. These views aggregate data over some attributes chosen from the fact and the dimension tables and compute the sum function for both measures The following table summarized the characteristics of the stored data: View Name Gross revenue & cost by: Number of dimensions Number of rows view 1 State 1 49 view 2 store, state 2 2,000 view 3 product, store 2 17,158,551 view 4 Customer, 3 39,199,723 product, store view 5 pay method, 4 29,382,902 time, store view 6 store, product 2 10,000 view 7 store, product, customer, pay method, 6 40,067,142 employee, category Total 12 125,820,367 Loading the views data into the cubetrees requires sorting it first in the CSO order and bulk-packing it into the cubetrees. The following table shows the time for the initial load of the cubetrees and the time for each bulk-incremental update of approximately 0.6 GB of data. In the initial load of the views the dominant cost is the sort time as it was done on a single disk with no parallel I/O. Bulk-packing and incremental bulk merge-packing of of the CSO utilizes almost the full disk bandwidth, roughly 6 GB per hour on a single disk. Increment (million rows) Total rows Sort time (hours:mins:secs) Pack time (hours:mins:secs) Cubetree size (GB) 88.2 88.2 2:07:04 0:22:21 2.40 6.44 18.2 106.4 0:25:18 0:33:01 2.93 5.33 19.4 125.8 0:27:20 0:40:22 3.50 5.20 Packing rate (GB/Hour) To emphasize the storage advantage of the CSO, we loaded summary tables as materialized views on a commercial RDBMS using plain unindexed relational tables to store the data. The space requirements of the views stored as tables was 5.15GB compared to a total of 3.5 for the CSO, i.e. 47% more. The CSO at the same time provides full indexing capabilities for the stored OLAP data, whereas to make the summary tables of acceptable query performance they require indexing that adds more than three times the raw data. This graph illustrates the storage required in each of these three storage configurations: 3

OLAP queries using the CSO are very fast because they take advantage of the unique clustering of the data. The retrieval has almost zero noise to data ratio, that is after the first record satisfying the view is found, the remaining records are very close with not much of noise data interleaved with useful data. CSO utilizes both the high sequential transfer speed rate of the storage and the clustering of the data. The table below is a small sample of queries, their result size, and the real time of execution for the grocery benchmark. QUERY DEFINITION select product_id, store_id, sum(revenue), sum(cost) from view3 where (store_id = 155) or (store_id = 156) group by product_id, store_id; select store_id, basket_start_tm, basket_end_tm, sum(revenue-cost) as profit from view5 where (store_id >=50) and (store_id <=1500) and (pay_method= ATM ) and (between(basket_start_tm, 12:00, 12:40 )) and (between(basket_end_tm, 12:00, 12:40 )) group by store_id, basket_start_tm, basket_end_tm; select customer_id, product_id, sum(revenue) from view7 where (store_id=901) and (pay_method= Cash ) and (category= FOOD ) and (employee= Jim_G ) group by customer_id, product_id select store_id, product_id, sum(revenue), sum(cost) from view4 where (customer_id=5561) group by store_id, product_id select store_id, basket_start_tm, basket_end_tm, sum(revenue), sum(cost) from view5 where (pay_method= Check ) group by store_id, basket_start_tm, basket_end_tm; select store_id, sum(revenue-cost) as profit from view2 where (state= MD ) group by store_id; # OF ROWS IN VIEW # OF ROWS RETURNED TIME (SECS) 17,158,551 17,143 0.8 29,382,902 234,699 1.87 40,067,142 7 0.12 39,199,723 260 0.06 29,382,902 7,344,419 50.21 2,000 41 0.01 4

Appendix: The Cubetree SQL The Cubetree SQL is easy to use and in many cases linguistically simpler than standard SQL. To create a new cubetree the user can execute the following procedure from an SQL interface: execute procedure create_cube(<cube name>,<dimension list>,<measure list>, <view list>, <function list>,<target directory>); where: <cube name>: <dimension list>: <measure list>: <view list>: <function list>: <target directory>: can be any legal relational table name and is being used for accessing the data later on. is a list of all dimensions along with their data types present in the cube. defines the measure attributes. There can be multiple measure values for each record is a selection of views from the defined dimensions that the user wants to be materialized. If omitted the full Data Cube is created. defines a set of functions that will be computed for the given measures. specifies where the cubetree is being stored. For example in order to create a three dimensional partial Data Cube in the store_id, product_id and customer_id dimensions with two measure attributes revenue and cost for the following views: View1: select from group by store_id, sum(revenue) as gross_revenue, sum(cost) as total_cost fact_table store_id View2: select store_id, store_id, customer_id, sum(revenue) as gross_revenue, sum(cost) as total_cost from fact_table group by store_id one can use the following procedure call: execute procedure create_cube( cubed, store_id int product_id int customer_id int, revenue float cost float, ( (store_id), (store_id, customer_id, product_id) ), sum(revenue) as gross_revenue sum(cost) as total_cost, /data/cubes ); Updating the Cubetree is extremely simple. Suppose that a temporary table delta holds a new set of updates we can make the following call to refresh all views stored in cubed: execute procedure update_cube( cubed, delta ); For greater flexibility delta can also be a predefined view that filters-in data for the Cubetree or even an external to the RDBMS source that provides the system with batch updates. Querying the Cubetree is simpler than straight SQL. For example, to find the profit for a given store the Cubetree SQL query is: select (gross_revenue total_cost) as profit from cubed where store_id = 419; which is equivalent to the straight SQL query: 5

select sum(revenue-cost) as profit from fact_table where store_id = 419 group by store_id; Similarly, to find all sales to a customer for a range of products at a given store which generated revenue above a certain value one writes the Cubetree SQL query: select product_id, gross_revenue from cubed where product_id>=10000 and product_id<12000 and customer_id=711 and store_id = 26 and gross_revenue>5000; which is simpler than the the straight SQL query: select product_id, sum(revenue) from fact_table where product_id>=10000 and product_id<12000 and customer_id=711 and store_id = 26 group by product_id having sum(revenue) > 5000; 6

7