Database Applications Advanced Querying Transaction processing Online setting Supports day-to-day operation of business OLAP Data Warehousing Decision support Offline setting Strategic planning (statistics) Transaction Processing Transaction Processing Transaction processing Operational setting Up-to-date = critical Simple data Simple queries Flight reservations ticket sales do not sell a seat twice reservation, date, name Give flight details of X List flights to Y Database must support simple data tables simple queries select from where consistency & integrity CRITICAL concurrency Relational databases, Object-Oriented, Object-Relational Decision Support Data Warehouse Decision support Off-line setting «Historical» data Summarized data Different databases Statistical queries Flight company Evaluate ROI flights Flights of last year # passengers on line L Passengers, fuel costs, maintenance info Average % of seats sold/month/destination A decision support DB that is maintained separately from the organization s operational databases. Why Separate Data Warehouse? High performance for both systems DBMS tuned for OLTP access methods, indexing, concurrency control, recovery Warehouse tuned for OLAP complex OLAP queries, multidimensional view, consolidation. Different functions and different data Missing data: Decision support requires historical data which operational DBs do not typically maintain Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled 1
Three-Tier Architecture OLAP other sources Operational DBs Metadata Extract Transform Load Refresh Monitor & Integrator Data Warehouse Data Marts OLAP Server Serve ROLAP Server Analysis Query/Reporting Data Mining OLAP = OnLine Analytical Processing Online = no waiting for answers OLAP system = system that supports analytical queries that are dimensional in nature. Data Sources Data Storage OLAP Engine Front-End Tools This Lecture Examples of decision support queries Data Cubes Conceptual data model Typical operations Implementation ROLAP vs MOLAP Indexing structures SQL:1999 support for OLAP Examples of Queries Flight company: evaluate ticket sales give total, average, minimal, maximal amount per date: week, month, year by destination/source port/country/continent by ticket type by # of connections Characteristics One special attribute: amount measure Other attributes: select relevant regions dimensions Different levels of generality (month, year, ) hierarchies Measure data is summarized: sum, min, max, average aggregations Dim. Supermarket example Evaluate the sales of products measure Product cost in $ Customer: ID, city, state, country, Store: chain, size, location, Product: brand, type, hierarchies What are the measure and dimensional attributes, where are the hierarchies? 2
Why dimensions? Multidimensional view on the data Cross Tabulation Cross-tabulations are highly useful Sales of clothes June August 06 store Cost in $ Product: color Blue Red Orange Total customer :month, June August 2006 June July August Total 51 58 65 174 25 20 22 67 158 120 51 329 234 198 138 570 product Data cubes Data Cubes Extension of Cross-Tables to multiple dimensions Conceptual notion June July August Blue Red Orange 51 25 158 Data Points/ 58 20 120 1 st level of aggregation 65 22 51 Dimensions Total 234 198 138 Aggregated w.r.t. X-dim TV PC VCR sum Product 1Qtr 2Qtr 3Qtr 4Qtr sum Ireland France Germany sum Country Total 174 67 329 570 Aggregated w.r.t. Y-dim Aggregated w.r.t. X and Y Data Cubes Base cuboid = n-dimensional cube with n number of dimensions The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid The lattice of cuboids forms a data cube Lattice of Cuboids all product date country date, country product, date product, country product, date, country 3
Operations with Data Cubes Scenario: Before starting the analysis task: what data? select a few relevant dimensions define hierarchy aggregation functions of interest Pre-materialize load data compute counts/max, min, avg, on beforehand Operations with Data Cubes What operations can you think of an analyst might find useful? (e.g., store) Operations with Data Cubes What operations can you think of that an analyst might find useful? (e.g., store) only look at stores in the Netherlands look at cities instead of individual stores look at the cross-table for product-date restrict analysis to 2006, product O1 go back to a finer granularity at the store level Roll-Up Move in one dimension from a lower granularity to a higher one store city cities country product product type Drill-down Move in one dimension from a higher granularity to a lower one city store country cities product type product Pivoting Change the dimensions that are displayed ; select a cross-tab. look at the cross-table for product-date display cross-table for date-customer Drill-through: go back to the original, individual data records 4
Slice & dice Select a part of the cube by restricting one or more dimensions restrict analysis to city = Eindhoven Summary of Concepts Cube: Multidimensional view on data dimensional attributes measure attribute Operations: roll-up/drill-down pivoting slice and dice Implementation To make query answering more efficient: consolidate (materialize) aggregations Obvious implementation: multidimensional array. Fast lookup: cell(prod. p, date d, prom. pr): look up index of p1, index of d, index of pr: index = (p x D x PR) + (d x PR) + pr Implementation Multidimensional array obvious problem: sparse data can easily be solved, though. Example: binary search tree, key on index hash table. Implementation However: very quickly people were confronted with the Data Explosion Problem Consolidating the summaries blows up the data enormously! Reasons are often misunderstood and confusing. Why? Suppose: n dimensions, every dimension has d values d n possible tuples. Number of cells in the cube: (d+1) n So, this is not the problem 5
Why? Suppose n dimensions, every dimension has d values every dimension has a hierarchy most extreme case: binary tree 2d possibilities/dimension Why? Suppose n dimensions, every dimension has d values every dimension has a hierarchy most extreme case: binary tree 2d possibilities/dimension 2 n x d n cells Only partial explanation (factor 2 n comes from an extremely pathological case) Why? The problem is that most data is not dense, but sparse. Hence, not all d n combinations are possible. Example: 10 dimensions with 10 values 10 000 000 000 possibilities Suppose «only» 1 000 000 are present Example: 10 dimensions with 10 values 10 000 000 000 possibilities Suppose «only» 1 000 000 are present Every tuple increases count of 2 10 cells! With hierarchies: effect even worse! If every hierarchy has 5 items: 5 10 = 9 765 625 cells! View Selection Problem Suffices to precompute some aggregates, and compute others on demand. aggregate on (item-name, color) from an aggregate on (item-name, color, size) For all but a few non-decomposable aggregates such as median Several optimizations for computing multiple aggregates Compute aggregate on (item-name, color) from an aggregate on (item-name, color, size) Compute aggregates on (item-name, color, size), (item-name, color) and (item-name) in single DB sort View Selection Problem product all date country date, country product, date product, country product, date, country 6
View Selection Problem all product date country product, date Which views to select: hard research problem! product, country product, date, country date, country Implementation Nowadays systems can be divided in three categories: ROLAP (Relational OLAP) OLAP supported on top of a relational database MOLAP (Multi-Dimensional OLAP) Use of special multi-dimensional data structures HOLAP: (Hybrid) combination of previous two ROLAP Cubes can easily be represented in relational tables: special value all Month Prod. Cust. Price Jan p1 c1 10 Jan p2 c1 8 Jan p1 c2 10 Feb p1 c1 9 all p1 c1 102 Jan all c1 18 Jan p1 all 1 230 all all c1 4 235 all all all 1 253 458 ROLAP Typical database scheme: star schema fact table is central links to dimensional tables Extensions: snowflake schema dimensions have hierarchy/extra information attached Star constellation multiple star schemas sharing dimensions Example of a Star Schema Order Order No Order Customer Customer No Customer Name Customer Address Salesperson SalespersonID SalespersonName Quota Fact Table OrderNO SalespersonID CustomerNO ProdNo Key Name Quantity Total Price Product ProductNO ProdName ProdDescr Category CategoryDescription UnitPrice Key Name State Country Order No Order Customer No Customer Name Customer Address SalespersonID Example of a Snowflake Schema SalespersonName Quota Order Customer Salesperson Fact Table OrderNO SalespersonID CustomerNO ProdNo Key Name Quantity Total Price ProductNO ProdName ProdDescr Category Category UnitPrice Key Month Product Name State Country Month Month Year State Category CategoryName CategoryDescr StateName Country Year 7
Example of Fact Constellation Multiple fact tables share dimension tables Time time_key day day_of_the_week month quarter year Branch branch_key branch_name branch_type Measures Sales Fact Table Time_key Item_key Branch_key Location_key Unit_sold Euros_sold Avg_sales Item item_key item_name brand type supplier_key Location location_key street city Province/street country Shipping Fact Table Time_key Item_key shipper_key from_location to_location Euros_sold unit_shipped shipper shipper_key shipper_name location_key shipper_type SQL 1999 support for OLAP see other set of slides 8