North Carolina State University Graduate Program in Operations Research Benchmark Databases for Testing Big-Data Analytics In Cloud Environments Rong Huang Rada Chirkova Yahya Fathi ICA CON 2012 April 20, 2012
Background One major advantage of using computing clouds lies in their applicability to large-scale data warehousing and analytics. Computing clouds can host very large amounts of data and provide efficient parallelized processing of complex analytics queries on the data. Enterprise data-cloud solutions for large-scale data warehousing and analytics are highly desirable. Our goal is to provide synthetically generated benchmark databases for testing the performance and other processing aspects of database systems in a computing-cloud environment. 2
Relational Storage of Data Pos: itemid storeid date amount Items: itemid name category 3
Query Processing Q: Give me total sales Give me recent total sales for all products in the Bay by store Area ID for all appliances 4
Query Processing SELECT storeid, SUM(amount) FROM pos P, items I WHERE P.itemID = I.itemID AND category = 'appliances GROUP BY storeid; storeid SUM(amount) 13357 $27,142.98 28690 $54,124.14 11561 $41,225.26 5
Queries and Views V: total sales by store ID and by item category Q: Give me total sales by store ID for all appliances storeid SUM(amount) 13357 $27,142.98 28690 $54,124.14 11561 $41,225.26 storeid category SUM(amount) 13357 appliances $27,142.98 13357 clothing $45,135.24 13357 electronics $50,245.64 28690 appliances $54,124.14 28690 clothing $60,938.21 28690 electronics $82,623.64 11561 appliances $41,225.26 6
View lattice Views with grouping and aggregation on a given relation. Given a -attribute dataset, the number of views is 2. Measure the size of each view by its number of rows. {a,b,c,d} 25 {a,b,c} {a,b,d} {a,c,d} {b,c,d} 13 20 15 16 {a,b} 7 {a,c} {b,c} {b,d} {c,d} 12 10 8 9 {b} {c} 4 5 7
TPC-H Datasets The TPC-H synthetic database generator is widely recognized as a standard benchmark database generator for data analytics. We have discovered in our work, the TPC-H benchmark has potential shortcomings when used to test the quality of algorithms developed for efficient processing of complex analytics queries. The TPC-H dataset does not distinguish between view sizes. 8
The Potential Shortcomings of TPC-H datasets The size of a great number of views is close to that of the largest view. Total number of attributes Number of views within 0.1% size difference from the largest view Total number of views Ratio 300,000 300,000 7 52 128 40.60% 13 6,192 8,192 75.60% 15 27,318 32,768 83.40% 17 115,162 131,072 87.90% 300,000 We would always prefer to store the largest view. 300 50 9
Our Contribution We define three types of synthetic datasets, which do not have the shortcomings that we have observed in the TPC-H data. We introduce algorithms for generating all the three types of datasets in any range of data sizes, which allows one to use the datasets in a variety of configurations and scales of cloud environments. Our datasets are complementary to the TPC-H datasets in testing the processing performance of complex analytics queries in the cloud environments. 10
The Symmetric Synthetic Datasets : total number of attributes in the dataset : number of values for each attribute Number of rows: 3, 2 A B C Example 1: 3, 2 0 2 4 3, 2 0 2 5 Attributes: 0 3 4 A B C 0 3 5 0 2 4 1 2 4 1 3 5 1 2 5 1 3 4 Number of rows: 2 1 3 5 11
Views in the Symmetric Synthetic Datasets The size of each -attribute view over, is. The size of an ancestor is at least times the size of its descendant. Example 1 (cont d): 3, 2 {A,B,C} 8 {A,B} {A,C} 4 4 {B,C} 4 {A} 2 {B} 2 {C} 2 12
Symmetric Synthetic Datasets Symmetric properties of the datasets Significant size difference between each pair of ancestor-descendant views. The datasets does not distinguish between the sizes of the views with same number of attributes. 13
Type I Non-Symmetric Synthetic Datasets, The number of values of each attribute differs:,,, Example 2: 3; 2, 3, 4 3, 2, 3, 4 Attributes: A B C 0 2 5 1 3 6 4 7 8 Number of rows: 2 3 4 24 A B C 0 2 5 0 2 6 0 2 7 0 2 8... 3; 2, 3, 4...... 1 4 7 1 4 8 14
Views in the Type I Non-Symmetric Synthetic Datasets A -attribute view,,, with values,,, The size (number of rows): Example 2 (cont d): 3; 2, 3, 4 {A,B,C} 24 {A,B} {A,C} 6 8 {B,C} 12 {A} 2 {B} 3 {C} 4 15
Type I Non-Symmetric Synthetic Datasets Type I non-symmetric synthetic dataset distinguishes between any pair of view sizes. Relatively large difference in size between each pair of ancestordescendant views. The size of each view is at least twice of the size of its descendant. We would always prefer to store the answer of each query. 16
Type II Non-Symmetric Synthetic Datasets Objectives: break the symmetric properties and reduce the size difference between adjacent ancestor-descendant pair of views. Conduct an elimination procedure over the rows in a given type I non-symmetric synthetic dataset. For each attribute, we conduct a two-step sub-elimination process Step 1: Eliminate each row with probability Step 2: For each row r that is eliminated in step 1, we also eliminate the rows in the master table with the same values as r of all attributes except 17
Type II Non-Symmetric Synthetic Datasets Input: ;,, and Choose,,, such that Output: a type II non-symmetric synthetic dataset, such that the expected number of rows in is greater than or equal to 18
Type II Non-Symmetric Synthetic Datasets Example 3: Input 3; 2, 3, 4 and 10. Choose 0.9 A B C 0 2 5 0 2 6 0 2 7 0 2 8 0 3 5 0 3 6 0 3 7 0 3 8 0 4 5 0 4 6 0 4 7 0 4 8 3; 2, 3, 4 A B C 1 2 5 1 2 6 1 2 7 1 2 8 1 3 5 1 3 6 1 3 7 1 3 8 1 4 5 1 4 6 1 4 7 1 4 8 A B C 0 2 6 0 2 8 0 3 6 0 3 8 0 4 5 0 4 8 1 2 6 1 2 7 1 3 6 1 3 7 19
Views in the Type II Non-Symmetric Synthetic Datasets Example 3 (cont d): {A,B,C} 10 {A,B} {A,C} 5 5 {B,C} 8 {A} 2 {B} 3 {C} 4 20
Experimental Results A performance measure 600.00% 500.00% 400.00% 300.00% 200.00% Type I non-symmetric dataset TPC-H dataset 100.00% 0.00% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 β 21
Conclusion We define a symmetric synthetic dataset and two types of nonsymmetric synthetic datasets. We studied shortcomings of the TPC-H datasets in testing algorithms devised for improving query-processing performance for complex queries posed on large-scale data. We compare these datasets experimentally with our proposed synthetic datasets in a setting for testing in such algorithms. All the synthetic datasets that we proposed in this paper are beneficial for testing algorithms devised for improving queryprocessing performance in cloud computing 22
Thank You! 23