Benchmark Databases for Testing Big-Data Analytics In Cloud Environments



Similar documents
Unique column combinations

Online EFFECTIVE AS OF JANUARY 2013

How to bet using different NairaBet Bet Combinations (Combo)

Boolean Algebra (cont d) UNIT 3 BOOLEAN ALGEBRA (CONT D) Guidelines for Multiplying Out and Factoring. Objectives. Iris Hui-Ru Jiang Spring 2010

Data Mining Apriori Algorithm

Boolean Algebra Part 1

Unit 3 Boolean Algebra (Continued)

Introduction. The Quine-McCluskey Method Handout 5 January 21, CSEE E6861y Prof. Steven Nowick

CH3 Boolean Algebra (cont d)

TIgeometry.com. Geometry. Angle Bisectors in a Triangle

United States Naval Academy Electrical and Computer Engineering Department. EC262 Exam 1

DEFINITIONS. Perpendicular Two lines are called perpendicular if they form a right angle.

Lecture Notes on Database Normalization

Practical Geometry CHAPTER. 4.1 Introduction DO THIS

Quadrilateral Geometry. Varignon s Theorem I. Proof 10/21/2011 S C. MA 341 Topics in Geometry Lecture 19

Part 2: Community Detection

How To Win At A Game Of Monopoly On The Moon

Intermediate Math Circles October 10, 2012 Geometry I: Angles

Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Page 1

Class One: Degree Sequences

Geometry 1. Unit 3: Perpendicular and Parallel Lines

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14

Lecture 24: Saccheri Quadrilaterals

Die Welt Multimedia-Reichweite

MB2-707: Version: Microsoft Dynamics CRM Customization. and Configuration. Demo

Database Design and Normalization

G5 definition s. G1 Little devils. G3 false proofs. G2 sketches. G1 Little devils. G3 definition s. G5 examples and counters

Angles in a Circle and Cyclic Quadrilateral

Databases -Normalization III. (N Spadaccini 2010 and W Liu 2012) Databases - Normalization III 1 / 31

San Jose Math Circle April 25 - May 2, 2009 ANGLE BISECTORS

Geometry Module 4 Unit 2 Practice Exam

Chapter 1. The Medial Triangle

Most popular response to

Section The given line has equations. x = 3 + t(13 3) = t, y = 2 + t(3 + 2) = 2 + 5t, z = 7 + t( 8 7) = 7 15t.

The Cubetree Storage Organization

Karnaugh Maps & Combinational Logic Design. ECE 152A Winter 2012

GEOMETRY - QUARTER 1 BENCHMARK


The common ratio in (ii) is called the scaled-factor. An example of two similar triangles is shown in Figure Figure 47.1

The Handshake Problem

Can the Elephants Handle the NoSQL Onslaught?

Algebraic Properties and Proofs

Graph Database Proof of Concept Report

Advanced Security for Account Managers-ASAM

Data Mining Association Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Exam Questions Demo Cisco. Exam Questions Cisco Express Foundation for Field Engineers

Geometry Regents Review

Sample Test Questions

Inversion. Chapter Constructing The Inverse of a Point: If P is inside the circle of inversion: (See Figure 7.1)

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

The Inversion Transformation

CHAPTER 1. LINES AND PLANES IN SPACE

Selected practice exam solutions (part 5, item 2) (MAT 360)

The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION GEOMETRY. Wednesday, January 29, :15 a.m. to 12:15 p.m.

The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION GEOMETRY. Thursday, January 24, :15 a.m. to 12:15 p.m.

Shroudbase Technical Overview

Combinations and Permutations Grade Eight

Chapter 3. Inversion and Applications to Ptolemy and Euler

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

1. Find the length of BC in the following triangles. It will help to first find the length of the segment marked X.

Copy in your notebook: Add an example of each term with the symbols used in algebra 2 if there are any.

12. Parallels. Then there exists a line through P parallel to l.

Using the ac Method to Factor

4. How many integers between 2004 and 4002 are perfect squares?

MATH 102 College Algebra

Angle bisectors of a triangle in I 2

How To Handle Big Data With A Data Scientist

THREE DIMENSIONAL GEOMETRY

Mining Social Network Graphs

Students will be able to simplify and evaluate numerical and variable expressions using appropriate properties and order of operations.

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

2014 Chapter Competition Solutions

Wednesday 15 January 2014 Morning Time: 2 hours

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

You must have: Ruler graduated in centimetres and millimetres, protractor, compasses, pen, HB pencil, eraser, calculator. Tracing paper may be used.

Definitions, Postulates and Theorems

Discovering All Most Specific Sentences

Vector Notation: AB represents the vector from point A to point B on a graph. The vector can be computed by B A.

5.3 The Cross Product in R 3

CHAPTER FIVE. 5. Equations of Lines in R 3

Co-ordinate Geometry THE EQUATION OF STRAIGHT LINES

Chapter 2: Boolean Algebra and Logic Gates. Boolean Algebra

1.1 Identify Points, Lines, and Planes

CHANGES IN DESIGN IN CHINA. John Heskett Chair Professor Hong Kong Polytechnic University

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot

The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION GEOMETRY. Thursday, August 16, :30 to 11:30 a.m.

The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION GEOMETRY. Wednesday, January 28, :15 a.m. to 12:15 p.m.

Transcription:

North Carolina State University Graduate Program in Operations Research Benchmark Databases for Testing Big-Data Analytics In Cloud Environments Rong Huang Rada Chirkova Yahya Fathi ICA CON 2012 April 20, 2012

Background One major advantage of using computing clouds lies in their applicability to large-scale data warehousing and analytics. Computing clouds can host very large amounts of data and provide efficient parallelized processing of complex analytics queries on the data. Enterprise data-cloud solutions for large-scale data warehousing and analytics are highly desirable. Our goal is to provide synthetically generated benchmark databases for testing the performance and other processing aspects of database systems in a computing-cloud environment. 2

Relational Storage of Data Pos: itemid storeid date amount Items: itemid name category 3

Query Processing Q: Give me total sales Give me recent total sales for all products in the Bay by store Area ID for all appliances 4

Query Processing SELECT storeid, SUM(amount) FROM pos P, items I WHERE P.itemID = I.itemID AND category = 'appliances GROUP BY storeid; storeid SUM(amount) 13357 $27,142.98 28690 $54,124.14 11561 $41,225.26 5

Queries and Views V: total sales by store ID and by item category Q: Give me total sales by store ID for all appliances storeid SUM(amount) 13357 $27,142.98 28690 $54,124.14 11561 $41,225.26 storeid category SUM(amount) 13357 appliances $27,142.98 13357 clothing $45,135.24 13357 electronics $50,245.64 28690 appliances $54,124.14 28690 clothing $60,938.21 28690 electronics $82,623.64 11561 appliances $41,225.26 6

View lattice Views with grouping and aggregation on a given relation. Given a -attribute dataset, the number of views is 2. Measure the size of each view by its number of rows. {a,b,c,d} 25 {a,b,c} {a,b,d} {a,c,d} {b,c,d} 13 20 15 16 {a,b} 7 {a,c} {b,c} {b,d} {c,d} 12 10 8 9 {b} {c} 4 5 7

TPC-H Datasets The TPC-H synthetic database generator is widely recognized as a standard benchmark database generator for data analytics. We have discovered in our work, the TPC-H benchmark has potential shortcomings when used to test the quality of algorithms developed for efficient processing of complex analytics queries. The TPC-H dataset does not distinguish between view sizes. 8

The Potential Shortcomings of TPC-H datasets The size of a great number of views is close to that of the largest view. Total number of attributes Number of views within 0.1% size difference from the largest view Total number of views Ratio 300,000 300,000 7 52 128 40.60% 13 6,192 8,192 75.60% 15 27,318 32,768 83.40% 17 115,162 131,072 87.90% 300,000 We would always prefer to store the largest view. 300 50 9

Our Contribution We define three types of synthetic datasets, which do not have the shortcomings that we have observed in the TPC-H data. We introduce algorithms for generating all the three types of datasets in any range of data sizes, which allows one to use the datasets in a variety of configurations and scales of cloud environments. Our datasets are complementary to the TPC-H datasets in testing the processing performance of complex analytics queries in the cloud environments. 10

The Symmetric Synthetic Datasets : total number of attributes in the dataset : number of values for each attribute Number of rows: 3, 2 A B C Example 1: 3, 2 0 2 4 3, 2 0 2 5 Attributes: 0 3 4 A B C 0 3 5 0 2 4 1 2 4 1 3 5 1 2 5 1 3 4 Number of rows: 2 1 3 5 11

Views in the Symmetric Synthetic Datasets The size of each -attribute view over, is. The size of an ancestor is at least times the size of its descendant. Example 1 (cont d): 3, 2 {A,B,C} 8 {A,B} {A,C} 4 4 {B,C} 4 {A} 2 {B} 2 {C} 2 12

Symmetric Synthetic Datasets Symmetric properties of the datasets Significant size difference between each pair of ancestor-descendant views. The datasets does not distinguish between the sizes of the views with same number of attributes. 13

Type I Non-Symmetric Synthetic Datasets, The number of values of each attribute differs:,,, Example 2: 3; 2, 3, 4 3, 2, 3, 4 Attributes: A B C 0 2 5 1 3 6 4 7 8 Number of rows: 2 3 4 24 A B C 0 2 5 0 2 6 0 2 7 0 2 8... 3; 2, 3, 4...... 1 4 7 1 4 8 14

Views in the Type I Non-Symmetric Synthetic Datasets A -attribute view,,, with values,,, The size (number of rows): Example 2 (cont d): 3; 2, 3, 4 {A,B,C} 24 {A,B} {A,C} 6 8 {B,C} 12 {A} 2 {B} 3 {C} 4 15

Type I Non-Symmetric Synthetic Datasets Type I non-symmetric synthetic dataset distinguishes between any pair of view sizes. Relatively large difference in size between each pair of ancestordescendant views. The size of each view is at least twice of the size of its descendant. We would always prefer to store the answer of each query. 16

Type II Non-Symmetric Synthetic Datasets Objectives: break the symmetric properties and reduce the size difference between adjacent ancestor-descendant pair of views. Conduct an elimination procedure over the rows in a given type I non-symmetric synthetic dataset. For each attribute, we conduct a two-step sub-elimination process Step 1: Eliminate each row with probability Step 2: For each row r that is eliminated in step 1, we also eliminate the rows in the master table with the same values as r of all attributes except 17

Type II Non-Symmetric Synthetic Datasets Input: ;,, and Choose,,, such that Output: a type II non-symmetric synthetic dataset, such that the expected number of rows in is greater than or equal to 18

Type II Non-Symmetric Synthetic Datasets Example 3: Input 3; 2, 3, 4 and 10. Choose 0.9 A B C 0 2 5 0 2 6 0 2 7 0 2 8 0 3 5 0 3 6 0 3 7 0 3 8 0 4 5 0 4 6 0 4 7 0 4 8 3; 2, 3, 4 A B C 1 2 5 1 2 6 1 2 7 1 2 8 1 3 5 1 3 6 1 3 7 1 3 8 1 4 5 1 4 6 1 4 7 1 4 8 A B C 0 2 6 0 2 8 0 3 6 0 3 8 0 4 5 0 4 8 1 2 6 1 2 7 1 3 6 1 3 7 19

Views in the Type II Non-Symmetric Synthetic Datasets Example 3 (cont d): {A,B,C} 10 {A,B} {A,C} 5 5 {B,C} 8 {A} 2 {B} 3 {C} 4 20

Experimental Results A performance measure 600.00% 500.00% 400.00% 300.00% 200.00% Type I non-symmetric dataset TPC-H dataset 100.00% 0.00% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 β 21

Conclusion We define a symmetric synthetic dataset and two types of nonsymmetric synthetic datasets. We studied shortcomings of the TPC-H datasets in testing algorithms devised for improving query-processing performance for complex queries posed on large-scale data. We compare these datasets experimentally with our proposed synthetic datasets in a setting for testing in such algorithms. All the synthetic datasets that we proposed in this paper are beneficial for testing algorithms devised for improving queryprocessing performance in cloud computing 22

Thank You! 23