572 DB Sys. Design & mpl. Data Cubes Christos Faloutsos www.cs.cmu.edu/~christos Roadmap ) Roots: System R and ngres 2) mplementation: buffering, indexing, qopt 3) Transactions: locking, recovery 4) Distributed DBMSs 5) Parallel DBMSs: Gamma, Alphasort 6) OO/OR DBMS 7) Data Analysis data mining data cubes association rules 8) Benchmarks 9) vision statements extras (streams/sensors, graphs, multimedia, web, fractals) 572 C. Faloutsos 2 Detailed Outline Problem Getting the data: Data Warehouses,, OLAP Supervised learning: decision trees Unsupervised learning association rules (clustering) Citation Gray, et al.: "Data Cube: A Relational Aggregation Operator Generalizing Groupby, CrossTab, and Sub Totals." Data Mining and Knowledge Discovery (): 2953 (997) 572 C. Faloutsos 3 572 C. Faloutsos 4 Problem Given: multiple data sources Find: patterns (classifiers, rules, clusters, outliers...) PGH NY SF sales(pid, cid, date, $price) customers( cid, age, income,...)??? 572 C. Faloutsos 5 Data Warehousing First step: collect the data, in a single place (= Data Warehouse) How? How often? How about discrepancies / nonhomegeneities? 572 C. Faloutsos 6
Data Warehousing First step: collect the data, in a single place (= Data Warehouse) How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / nonhomegeneities? A: Wrappers/Mediators Data Warehousing Step 2: collect counts. (/OLAP) Eg.: 572 C. Faloutsos 7 572 C. Faloutsos 8 sales OLAP Problem: is it true that shirts in large s sell better in dark s? cid pid Size Color $ C0 Shirt L Blue 30 C0 Pants XL Red 50 C20 Shirt XL White 20... 572 C. Faloutsos 9, : DMENSONS count : MEASURE 572 C. Faloutsos 0, : DMENSONS count : MEASURE, : DMENSONS count : MEASURE 572 C. Faloutsos 572 C. Faloutsos 2 2
, : DMENSONS count : MEASURE, : DMENSONS count : MEASURE 572 C. Faloutsos 3 572 C. Faloutsos 4, : DMENSONS count : MEASURE DataCube 572 C. Faloutsos 5 SQL query to generate DataCube: Naively (and painfully:) select,, count(*) from sales where pid = shirt group by, select, count(*) from sales where pid = shirt group by... 572 C. Faloutsos 6 SQL query to generate DataCube: with cube by keyword: select,, count(*) from sales where pid = shirt cube by, DataCube issues: Q: How to store them (and/or materialize portions on demand) Q2: How to index them Q3: Which operations to allow 572 C. Faloutsos 7 572 C. Faloutsos 8 3
DataCube issues: Q: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP Q2: How to index them A: bitmap indices Q3: Which operations to allow A: rollup, drill down, slice, dice [More details: book by HanKamber] Q: How to store a datacube? 572 C. Faloutsos 9 572 C. Faloutsos 20 Q: How to store a datacube? A: Relational (ROLAP) Color Size count all all 47 Blue all 4 Blue M 3 Q: How to store a datacube? A2: Multidimensional (MOLAP) A3: Hybrid (HOLAP) 572 C. Faloutsos 2 572 C. Faloutsos 22 Pros/Cons: ROLAP strong points: (DSS, Metacube) Pros/Cons: ROLAP strong points: (DSS, Metacube) use existing RDBMS technology scale up better with dimensionality 572 C. Faloutsos 23 572 C. Faloutsos 24 4
Pros/Cons: MOLAP strong points: (EssBase/hyperion.com) faster indexing (careful with: highdimensionality; sparseness) Q: How to store a datacube Q3: How to index a datacube? HOLAP: (MS SQL server OLAP services) detail data in ROLAP; summaries in MOLAP 572 C. Faloutsos 25 572 C. Faloutsos 26 Rollup 572 C. Faloutsos 27 572 C. Faloutsos 28 Drilldown Slice 572 C. Faloutsos 29 572 C. Faloutsos 30 5
Dice Rollup Drilldown Slice Dice 572 C. Faloutsos 3 572 C. Faloutsos 32 Q: How to store a datacube Q3: How to index a datacube? Q3: How to index a datacube? 572 C. Faloutsos 33 572 C. Faloutsos 34 Q3: How to index a datacube? A: Bitmaps S M L Red Blue Gray Q3: How to index a datacube? A2: Join indices (see [HanKamber]) 572 C. Faloutsos 35 572 C. Faloutsos 36 6
D/W OLAP Conclusions D/W: copy (summarized) data analyze OLAP concepts: DataCube R/M/HOLAP servers dimensions ; measures Outline Problem Getting the data: Data Warehouses,, OLAP Supervised learning: decision trees Unsupervised learning association rules (clustering) 572 C. Faloutsos 37 572 C. Faloutsos 38 Decision trees Problem Age Chollevel Gender CLASSD 30 50 M?? 572 C. Faloutsos 39 Pictorially, we have num. attr#2 (eg., chollevel) Decision trees num. attr# (eg., age ) 572 C. Faloutsos 40 Decision trees and we want to label? Decision trees so we build a decision tree: num. attr#2 (eg., chollevel)? num. attr#2 (eg., chollevel) 40? num. attr# (eg., age ) 572 C. Faloutsos 4 50 num. attr# (eg., age ) 572 C. Faloutsos 42 7
Decision trees so we build a decision tree: age<50 Y N chol. <40 Y N... 572 C. Faloutsos 43 Outline Problem Getting the data: Data Warehouses,, OLAP Supervised learning: decision trees problem approach scalability enhancements Unsupervised learning association rules (clustering) 572 C. Faloutsos 44 Decision trees Typically, two steps: tree building tree pruning (for overtraining/overfitting) How? num. attr#2 (eg., chollevel) num. attr# (eg., age ) 572 C. Faloutsos 45 572 C. Faloutsos 46 How? A: Partition, recursively pseudocode: Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S and S2 Partition(S); Partition(S2) 572 C. Faloutsos 47 Q: how to introduce splits along attribute A i Q2: how to evaluate a split? 572 C. Faloutsos 48 8
Q: how to introduce splits along attribute A i A: for num. attributes: binary split, or multiple split for categorical attributes: compute all subsets (expensive!), or use a greedy algo Q: how to introduce splits along attribute A i Q2: how to evaluate a split? 572 C. Faloutsos 49 572 C. Faloutsos 50 Q: how to introduce splits along attribute A i Q2: how to evaluate a split? A: by how close to uniform each subset is ie., we need a measure of uniformity: entropy: H(p, p) Any other measure? 0 0 0.5 p 572 C. Faloutsos 5 572 C. Faloutsos 52 entropy: H(p, p ) gini index: p 2 p 2 entropy: H(p, p ) gini index: p 2 p 2 0 0 0.5 p 0 0 0.5 p (How about multiple labels?) 572 C. Faloutsos 53 572 C. Faloutsos 54 9
ntuition: entropy: #bits to encode the class label gini: classification error, if we randomly guess with prob. p 572 C. Faloutsos 55 Thus, we choose the split that reduces entropy/classificationerror the most: Eg.: num. attr#2 (eg., chollevel) num. attr# (eg., age ) 572 C. Faloutsos 56 Before split: we need (n n ) * H( p, p ) = (76) * H(7/3, 6/3) bits total, to encode all the class labels After the split we need: 0 bits for the first half and (26) * H(2/8, 6/8) bits for the second half What for? num. attr#2 (eg., chollevel) Tree pruning num. attr# (eg., age )... 572 C. Faloutsos 57 572 C. Faloutsos 58 Tree pruning Shortcut for scalability: DYNAMC pruning: stop expanding the tree, if a node is reasonably homogeneous ad hoc threshold [Agrawal, vldb92] Minimum Description Language (MDL) criterion (SLQ) [Mehta, edbt96] Tree pruning Q: How to do it? A: use a training and a testing set prune nodes that improve classification in the testing set. (Drawbacks?) A2: or, rely on MDL (= Minimum Description Language) in detail: 572 C. Faloutsos 59 572 C. Faloutsos 60 0
Tree pruning envision the problem as compression (of what?) Tree pruning envision the problem as compression (of what?) and try to min. the # bits to compress (a) the class labels AND (b) the representation of the decision tree 572 C. Faloutsos 6 572 C. Faloutsos 62 (MDL) a brilliant idea eg.: best ndegree polynomial to compress these points: the one that minimizes (sum of errors n ) 572 C. Faloutsos 63 Outline Problem Getting the data: Data Warehouses,, OLAP Supervised learning: decision trees problem approach scalability enhancements Unsupervised learning association rules (clustering) 572 C. Faloutsos 64 Scalability enhancements nterval Classifier [Agrawal,vldb92]: dynamic pruning SLQ: dynamic pruning with MDL; vertical partitioning of the file (but label column has to fit in core) SPRNT: even more clever partitioning Conclusions for classifiers Classification through trees Building phase splitting policies Pruning phase (to avoid overfitting) For scalability: dynamic pruning clever data partitioning 572 C. Faloutsos 65 572 C. Faloutsos 66
Overall Conclusions Data Mining: of high commercial interest DM = DB ML Stat Data warehousing / OLAP: to get the data Tree classifiers (SLQ, SPRNT) Association Rules apriori algorithm (clustering: BRCH, CURE, OPTCS) Reading material Agrawal, R., T. mielinski, A. Swami, Mining Association Rules between Sets of tems in Large Databases, SGMOD M. Mehta, R. Agrawal and J. Rissanen, `SLQ: A Fast Scalable Classifier for Data Mining, Proc. of the Fifth nt'l Conference on Extending Database Technology (EDBT), Avignon, France, March 996 572 C. Faloutsos 67 572 C. Faloutsos 68 Additional references Agrawal, R., S. Ghosh, et al. (Aug. 2327, 992). An nterval Classifier for Database Mining Applications. VLDB Conf. Proc., Vancouver, BC, Canada. Jiawei Han and Micheline Kamber, Data Mining, Morgan Kaufman, 200, chapters 2.22.3, 6.6.2, 7.3.5 572 C. Faloutsos 69 2