How To Win At A Game Of Monopoly On The Moon

Transcription

1 Changing the Face of Database Cloud Services with Personalized Service Level Agreements Jennifer Ortiz, Victor Teixeira de Almeida, Magdalena Balazinska University of Washington, Computer Science and Engineering PETROBRAS S.A., Rio de Janerio, RJ, Brazil CIDR

2 2

3 Many Data Management & Analytics Systems Available 3

4 Many Systems are Available as Cloud Services 4

5 Which Hadoop Version? Cloud Services Today Amazon EMR Pig or Hive? How many instances of the service? 5

6 Which Hadoop Version? Cloud Services Today Amazon EMR Pig or Hive? How many instances of the service? 6

7 Cloud Services Today BigQuery How long will my query take? 7

8 Cloud Services Can Do Better! System Internals 8

9 Cloud Services Can Do Better! System Internals 9

10 Cloud Services Can Do Better! Query: Query Capabilities Time Money SELECT FROM WHERE 10

11 A new proposal Time to Re-think the interface Hide details of cluster deployment and resources Show users monetary costs and performance estimates on their data Let users pick the desired trade-off between options shown Personalized Service Level Agreements 11

12 Tier 1: $0.10/hour Within 20 seconds: SELECT <up to 10 attributes> FROM <Fact Dimension> WHERE <up to 100% of data> Within 1 minute: SELECT <up to 5 attributes> FROM <JOIN Fact + 4 Dimensions> WHERE <up to 10% of data> A PSLA Example Fixed, hourly price Expected performance Templates capture capabilities Within 10 minutes: SELECT <up to 10 attributes> FROM <JOIN Fact + 8 Dimensions> WHERE <up to 100% of data> 12

13 Tier 1: $0.10/hour Within 20 seconds: SELECT <up to 10 attributes> FROM <Fact Dimension> WHERE <up to 100% of data> Within 1 minute: SELECT <up to 5 attributes> FROM <JOIN Fact + 4 Dimensions> WHERE <up to 10% of data> A PSLA Example Fixed, hourly price Expected performance Templates capture capabilities Tier 2: $0.50/hour Within 1 second: SELECT <up to 10 attributes> FROM <Fact Dimension> WHERE <up to 100% of data> Different tiers of service Within 10 minutes: SELECT <up to 10 attributes> FROM <JOIN Fact + 8 Dimensions> WHERE <up to 100% of data> 13

14 Goals Cloud C Database D 14

15 Goals Cloud C PSLA P Database D Money, Time, Capabilities 15

16 Goals Cloud C PSLAManager PSLA P Database D Money, Time, Capabilities 16

17 Goals Cloud C PSLAManager PSLA P Database D Money, Time, Capabilities 17

18 Example of a Real PSLA 18

19 TPC-H Star Schema Benchmark Based on TPC-H 10GB 19

20 Myria is a data management service in the cloud that we built at UW. It has a parallel, shared-nothing back-end query execution engine called MyriaX 20

21 PSLA for Myria 21

26 PSLAManager Cloud C PSLAManager PSLA P Database D 26

27 PSLAManager Workflow Service Perf. Modeling Perform Offline Perform Online Data Workload Generation Runtime Prediction Tier Selection Workload Compression into PSLA (repeat for each tier) Query Clustering Template Generation Dropping Queries with Similar Times PSLA 27

34 Query Workload Generation Which queries to generate? Joins drive performance Think about possible combinations of joins Consider: All possible 2-way joins Tables in Order by Size: Lineitem, Part, Customer, Supplier, Date (Lineitem Part) (Customer Date), (Lineitem Supplier), etc. Only consider most expensive queries Build toward more complex queries, include selections and projections 34

38 Seconds Seconds Tier Selection Runtime Distributions of Query Workload Per Configuration in Myria EMD EMD 7.07 EMD Workers Workers (Configurations) 38

39 Tier Selection 39

41 Workload Compression STEP 1: Query Clustering Threshold-based Density-based Workers 41

42 Workload Compression STEP 1: Query Clustering th Workers 42

43 Workload Compression STEP 1: Query Clustering Workers 43

44 Workload Compression STEP 1: Query Clustering Tier 1: $0.XX/hour Within Cluster Max Threshold: Query Templates in Cluster Within Cluster Max Threshold: Query Templates in Cluster 44

45 Time(s) Workload Compression STEP 2: Template Generation Queries Query Query Templates Dominance SELECT (5 ATT) FROM (5 TABLES) WHERE 10% SELECT (4 ATT) FROM (4 TABLES) WHERE 1% Configuration 45

46 Time(s) Workload Compression STEP 2: Template Generation Attributes Projected Query Dominance Tables SELECT (5 ATT) FROM (5 TABLES) WHERE 10% SELECT (4 ATT) FROM (4 TABLES) WHERE 1% Selectivity Configuration 46

47 Time(s) Workload Compression STEP 2: Template Generation Attributes Projected Query Dominance Tables SELECT (5 ATT) FROM (5 TABLES) WHERE 10% SELECT (4 ATT) FROM (4 TABLES) WHERE 1% Selectivity Given: Configuration 47

48 Time(s) Workload Compression STEP 2: Template Generation Configuration 48

49 Time(s) Workload Compression STEP 2: Template Generation Root Query Template: We call a query template a root query template if no other query template in the same cluster dominates it. Configuration 49

50 Time(s) Workload Compression STEP 3: Dropping Queries with Similar Times Configuration 1 50

51 Time(s) Workload Compression STEP 3: Dropping Queries with Similar Times Root Query Templates Queries Query Templates Configuration 1 Configuration 2 51

52 Time(s) Workload Compression STEP 3: Dropping Queries with Similar Times Configuration 1 Configuration 2 52

55 PSLA Quality Assessment 55

56 PSLA Quality Metrics PSLA Query Capabilities PSLA Complexity PSLA Performance Error Metric 56

57 Time(s) Time(s) Quality Metric Trade-offs High Complexity Low Error Low Complexity High Error Found that a log interval is best without tuning 57

58 PSLA Evaluation on Predicted Runtimes 58

61 Performance Model (Offline) Train model offline on other data and queries Cloud Configuration Query Feature Vector Est. Rows Est. IO Avg. Row runtime Query workload Predict runtime from query features Query Feature Vector Cloud Configuration Est. Rows Est. IO Avg. Row runtime Based on Predicting Multiple Metrics for Queries: Better Decisions enabled by Machine Learning [Ganapathi et. al. 2009] 61

62 Training Dataset Synthetic Dataset 10GB 6 Tables 61 Attributes 62

64 Predicted Myria PSLA (Predicted Runtimes) 64

65 Looking Forward Direct extensions to the approach Add support for indexes Improve time predictions Longer-term future work Can we guarantee the runtimes? Can we update the PSLA as user queries data Goal is to show increasingly more complex queries Usability testing 65

66 Conclusion Many cloud DBMSs exist Require users to reason about resources We propose to re-think that interface Personalized Service Level Agreements Service Tiers Price/Capabilities/Performance Important direction for cloud DBMSs 66