Identifying Performance Bottlenecks in Hive: Use of Processor Counters

Identifying Performance Bottlenecks in Hive: Use of Processor Counters Alexander C Shulyak, Lizy K John Presented By: Shuang Song

Problem Businesses and online services increasingly rely on insights derived from data analytics applications Targeted promotional advertising Personalized content and experiences Streamlining business operations Sales and market analysis Amount of data being collected increasing exponentially Distributed SQL Query Engines (DSQEs) increasingly used as Decision Support System (DSS) to process large amounts of data at scale: Hive, Shark, Impala, etc. What are the performance bottlenecks of a DSQE running DSS queries? What are the performance trade-offs of a DSQE over a traditional Relational Database Management System? 2

Hive Overview Hadoop Architecture; from [1] Data warehousing and query processing framework for large database Uses Hadoop HDFS and MapReduce framework Hive converts SQL-like queries to a set of MapReduce jobs for Hadoop, monitors progress, returns result Yarn Resource Manager allows custom execution engines Tez execution engine improves performance by modeling query as DAG of MapReduce Jobs; optimizing execution of entire DAG [1] 3 Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. Apache tez: A unifying framework for modeling and building data processing applications. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 15, pages 1357 1369, New York, NY, USA, 2015. ACM.

Previous Work Panda et al.; SBAC-PAD 2015; Performance Characterization of Modern Databases on Out-of-Order CPUs Performance analysis of: MySQL, Cassandra, MongoDB, VoltDB Ins/Data cache and TLB stressed significantly MySQL Comparatively best throughput and latency for all workloads Wouw et al.; ICPE 2015; An Empirical Performance Evaluation of Distributed SQL Query Engines Analyzed Shark, Hive with MapReduce, and Impala Propose Micro-benchmarking suite and empirical method for evaluation performance for Distributed SQL Query Engines (DSQEs) Hive with MapReduce is outperformed by all other database options; it experiences high network I/O and framework overhead. 4

Previous Work (Cont.) Floratou et al; VLDB 2014; Sql-on-hadoop: Full circle back to sharednothing database architectures Study included Hive with MapReduce, Hive with Tez, and Impala with 21-node cluster running 1TB TPC-H database Only Hive with MapReduce impacted by startup and scheduling overheads Impala shared-nothing SQL on Hadoop database vastly improved performance when workloads fit in memory Hive with MapReduce and Hive with Tez CPU bound, especially on scan operations 5

Motivation: Hive struggles to beat MySQL despite 6X available cores 250 221 Execution Time Hive still needed due to its scale-out potential 200 150 148 171 Algorithm, code base, OS, and CPU resources could cause computational performance differences between Hive and MySQL Root cause analysis needed Time = (1/f * CPI * Total Ins) / TLP 100 50 0 113 104.4 88.2 81 71 38 49 48 23 Query1 Query3 Query6 Query14 Query19 Average MySQL 10GB HIVE TEZ 10GB 6

Experimental Setup Database Details Hive 1.2.1 Hadoop 2.7.0; Tez 0.7.0 MySQL 14.14 Benchmark Details TPC-H 10GB database, queries 1, 3, 6, 14, 19 Server Setup Single-node 6 (12) core Intel Xeon E5-2430v2 @2.5Ghz 32KB private L1i and L1d, 256KB combined L2, 15MB LLC shared 64GB DDR3 @ 1600Mhz; 750GB HDD @7200rpm OpenJDK 1.8.0 7

Methodology Processor Performance Counters: Perf 3.19 IPC, CPI, MPKI, CS PKI Huge number of counters, hundreds of factors: Need for imperial method to identify bottleneck Top-Down SW Performance Analysis 1. Total Instruction Count and Average Thread Count 2. Query Plan Analysis 3. OS Events and Statistics 4. Instruction-Stream Statistics (Ins Mix, Ins/Data Footprint, Ins/Data access patterns) Top-Down HW Performance Analysis 1. System level utilization statistics: (CPU, Mem, Network I/O, Disk I/O) 2. Instructions per Cycle (IPC) 3. Top-Down Microarchitectural Analysis Method (TMAM) Aggregate statistics may hide phase behavior responsible for performance loss Counter statistics collected at 1-second interval 8

Insights 1. MySQL: relatively efficient DSS query execution engine 2. Hive shows difficulty converting SQL queries into a set of MapReduce Jobs 3. Hive s framework layers add code bloat, slow database traversal 4. Hive with Tez amortize JVM startup cost effectively across MapReduce Tasks; initial start-up period for a Hive is costly for queries under 300 seconds 5. Hive s highly parallel execution increases context switch rates which stresses the memory hierarchy 9

Instruction Count Hive must execute 1.7X more instructions on average Hadoop, MapReduce, Distributed Execution Large Overhead Abstract query from flow of execution Overhead slightly lower due to vector execution Variation in Overhead across queries due to query plan inefficiencies Normalized Instruction Count 8 7 6 5 4 3 2 1 0 Total Instructions: Normalized to MySQL 2.9 7.5 1.9 2.1 1.4 1 1 1 1 1 1 2.7 Query1 Query3 Query6 Query14 Query19 Average MySQL Hive 10

Query 1 SQL Query 19 SQL Query 1 reports the amount of business that was billed, shipped, and returned Query 19 reports the gross discounted revenue attributed to the sale of selected parts handled in a particular manner 11

Startup Overhead Start-up Period Static period of time observed in all Hive queries noted by poor microarchitectural performance 2 parts: initialization period, warm-up period Initialization Period IPC 3 2.5 2 1.5 1 0.5 0 Query 3 5E+10 4.5E+10 4E+10 3.5E+10 3E+10 2.5E+10 2E+10 1.5E+10 1E+10 5E+09 0 Ins Count Hive query plan is generated and sent to Hadoop. Hadoop is initialized 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 Low Instruction Count, Low IPC Warm-up Period Ave. IPC IPC Ave. IPC post-startup Total Instructions Hadoop Tez Containers are initialized, but JVM processes and CPU must warm-up to reach peak IPC High Instruction Count, Low IPC Because period is static, effect on execution time and average IPC dependent on total execution time Query 3 s IPC increases by 9% with Startup Period ignored Query 6 s IPC increases by 45% with Startup Period ignored Improvement over Hive with Built-in Hadoop MapReduce IPC 3 2.5 2 1.5 1 0.5 0 Query 6 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 4.5E+10 4E+10 3.5E+10 3E+10 2.5E+10 2E+10 1.5E+10 1E+10 5E+09 0 Ins Count Use of Tez s online or batch query processing mode can further amortize the start-up period costs Ave. IPC Ave. IPC post-starup IPC Total Instructions 12

IPC 2.50 2.00 1.50 IPC 1.00 0.50 0.00 IPC 2.05 1.87 1.80 1.84 1.67 1.70 1.74 1.63 1.55 1.55 1.46 1.43 1.46 1.45 1.50 1.39 1.42 1.29 Query1 Query3 Query6 Query14 Query19 Average MySQL Hive 6Thread Hive 6Thread Post-Startup Hive Instruction execution rate (IPC) improves based on execution time. When start-up period ignored, Query 6 and Query 14 outperform MySQL counterpart Relative IPC across queries does not correlate between MySQL and Hive. Each database has different inherent microarchitectural bottlenecks 13

LLC Misses: MySQL Performance Bottleneck CPI 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 LLC MPKI vs. CPI 0 0.5 1 1.5 2 2.5 LLC MPKI MySQL Hive 1 Thread Hive 6 Threads Linear (MySQL) MySQL microarchitectural performance correlated to LLC MPKI Data driven performance Fewer instruction between unique entries in database Hive has little variation in LLC MPKI Data traversal indicative of underlying frameworks Little difference in LLC MPKI rates between 1 and 6 threaded Hive setups (despite differences in CPI); therefore, context switches effecting 1 st level caches far more than LLC. 14

Context Switches: Hive Performance Bottleneck CPI increases as the number of threads and subsequently number of context switches increase. Indicates bottleneck of system MySQL has far fewer context switches then Hive even with 1 thread. MySQL is 1 threaded, and there is no apparent correlation between context switches and CPI CPI 0.75 0.7 0.65 0.6 0.55 0.5 0.45 CS VS. CPI WITH DIFFERENT NUMBER OF THREADS Hive Query 1 Hive Query 3 Hive Query 6 Hive Query 14 Hive Query 19 MySQL Queries 0.4 0.00E+00 5.00E-05 1.00E-04 1.50E-04 2.00E-04 2.50E-04 CS PKI 15

Conclusion MySQL executes queries as efficiently as possible Low instruction count Microarchitectural performance differences dependent on how the data is traversed and how the memory hierarchy is stressed by that algorithm. Hive s large code base and generic execution framework primary performance bottleneck Query plan inefficiency and increased instruction count Hive s startup period hurts the performance of short running queries Hive s higher context switch rates directly impact microarchitectural performance Improvements: Amortize startup costs over more queries with batch or online execution Decrease parallelization per node to improve microarchitectural performance Resort to Distributed SQL Query Engine only if database size too large for 1 node 16

17 QUESTIONS?

Overview Objective: Root Cause Analysis of performance discrepancies between MySQL and Hive MySQL: traditional Relational Database Management System (RDMS), to scale-out approach like Hive Hive: Hadoop-based DSQE Use Decision Support Benchmark (DSB), TPC-h as database benchmark Identify computational overheads associated with Distributed set-ups 18

Previous Work (Cont.) Floratou et al; VLDB 2014; Sql-on-hadoop: Full circle back to sharednothing database architectures Hive without Tez impacted by startup and scheduling overheads Impala shared-nothing SQL on Hadoop database vastly improved performance when workloads fit in memory Hive on MapReduce and Hive on Tez CPU bound, especially on scan operations Jia et al; IISWC 2013; Characterizing data analysis workloads in data centers Data analysis workloads have higher IPC than data serving workloads, while lower than that of computation-intensive HPCC workloads Both data analysis workloads and data serving workloads suffer from noticeable front-end stalls which they blame on larger code footprint causing inefficient L1I cache and itlb performance. 19

20 TPC-H Database

MySQL Query Plans Query Table Type Clauses Query 1 lineitem ALL Using Where Using Temporary Using Filesort Query 3 orders ALL Using Where Using Temporary Using Filesort customers eq_ref Using where lineitem ref Using where Query 6 lineitem ALL Using where Query 14 lineitem ALL Using where part eq_ref lineitem ALL Using where Query 19 part eq_ref Using where 21

Hive Query Plans Query Vertex Source Type Partitions Filter Query 1 Map 1 lineitem Select Group By 12 Reduce 2 Map 1 Group By 11 Reduce 3 Reduce 2 Select 1 Map 1 customers Filter 1 Map 6 orders Filter 4 Map 7 lineitem Filter 15 Reduce 2 Map 1 Map 6 Merge/Join 1 Query 3 Merge/Join Map 7 Reduce 3 Select Reduce 2 Group By 2 Reduce 4 Reduce 3 Group By Select 2 Reduce 5 Reduce 4 Select 1 Filter Query 6 Map 1 lineitem Select 12 Group By Reduce 2 Map 1 Group By 1 Map 1 part Filter 1 Map 4 lineitem Filter 15 Query 14 Reduce 2 Map 1 Merge/Join Map 4 Group By 1 Reduce 3 Reduce 2 Group By Select 1 Map 1 lineitem Filter 15 Map 4 parts Filter 1 Merge/Join Map 1 Query 19 Filter Reduce 2 Select Map 4 Group By 4 Reduce 3 Reduce 2 Group By 1 22