Architecture Support for Big Data Analytics Ahsan Javed Awan EMJD-DC (KTH-UPC) (http://uk.linkedin.com/in/ahsanjavedawan/) Supervisors: Mats Brorsson(KTH), Eduard Ayguade(UPC), Vladimir Vlassov(KTH) 1
Why should we care? Motivation 2 *Source: Babak Falsafi slides
Cont... Motivation A mismatch between the characteristics of emerging workloads and the underlying hardware. M. Ferdman et-al, Clearing the clouds: A study of emerging scale-out workloads on modern hardware, in ASPLOS 2012. Z. Jia, et-al Characterizing data analysis workloads in data centers, in IISWC 2013. C. Zheng et-al, Characterizing os behavior of scale-out data center workloads. in WIVOSCA 2013. M. Dimitrov et al, Memory system characterization of big data workloads, in BigData Conference, 2013. Z. Jia et-al, Characterizing and subsetting big data workloads, in IISWC 2014 A. Yasin et-al, Deep-dive analysis of the data analytics workload in cloudsuite, in IISWC 2014. L. Wang et-al, Bigdatabench: A big data benchmark suite from internet services, in HPCA, 2014. T. Jiang, et-al, Understanding the behavior of in-memory computing workloads, in IISWC 2014 3
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server Motivation 4 *Source: SGI
Cont... Motivation Hadoop, Spark, Flink, etc.. Phoenix ++, Metis, Ostrich, etc.. Our OurFocus Focus Improve the single node performance in scale-out configuration 5 *Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
Which Scale-out Framework? Progress Meeting 12-12-14 [Picture Courtesy: Amir H. Payberah] 6
Methodology Our Approach A three fold analysis method at Application, Thread and Microarchitectural level Tuning of Spark internal Parameters Tuning of JVM Parameters (Heap size etc..) Concurrency Analysis General Architectural Exploration 7
Benchmarks Our Approach 3GB of Wikipedia raw datasets, Amazon Movies Reviews and numerical records have been used 8
Machine Details Our Hardware Configuration Hyper Threading and Turbo Boost is disabled Hyper Threading and Turbo-boost are disabled 9
System Configuration Our Approach 10
Application Level Performance Multicore Scalability of Spark Spark scales poorly in Scale-up configuration 11
Stage Level Performance Multicore Scalability of Spark Shuffle Map Stages don't scale beyond 12 threads across different workloads No of concurrent files open in Map-side shuffling is C*R where C is no of threads in executor pool and R is no of reduce tasks 12
Task Level Performance Multicore Scalability of Spark Percentage increase in Area Under the Curve compared to 1-thread 13
Is there thread level load imbalance?? 14
CPU Utilization is not scaling with performance 15
Is there any Work Time Inflation?? 16
How does Micro-architecture contribute to Work time inflation?? 17
Cont... 18
Cont... 19
Is Memory Bandwidth a bottleneck?? 20
Key Findings More than 12 threads in an executor pool does not yield significant performance Spark runtime system need to be improved to provide better load balancing and avoid work-time inflation. Work time inflation and load imbalance on the threads are the scalability bottlenecks. Removing the bottlenecks in the front-end of the processor would not remove more than 20% of stalls. Effort should be focused on removing the memory bound stalls since they account for up to 72% of stalls in the pipeline slots. Memory bandwidth of current processors is sufficient for inmemory data analytics 21
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server Motivation 22
Do Spark based data analytics benefit from using larger scale-up servers Motivation 23
Is GC detrimental to scalability of Spark applications? Motivation 24
How does performance scale with data volume? Motivation 25
Does GC time scale linearly with Data Volume?? Motivation 26
How does CPU utilization scale with data volume? Motivation 27
Is File I/O detrimental to performance? Motivation 28
How does data size affects micro-architectural performance? Motivation 29
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server Motivation 30
Cont.. Motivation 31
Cont.. Motivation 32
Key Findings Spark workloads do not benefit significantly from executors with more than 12 cores. The performance of Spark workloads degrades with large volumes of data due to substantial increase in garbage collection and file I/O time. With out any tuning, Parallel Scavenge garbage collection scheme outperforms Concurrent Mark Sweep and G1 garbage collectors for Spark workloads. Spark workloads exhibit improved instruction retirement due to lower L1 cache misses and better utilization of functional units inside cores at large volumes of data. Memory bandwidth utilization of Spark benchmarks decreases with large volumes of data and is 3x lower than the available offchip bandwidth on our test machine 33
What are the major bottlenecks?? Our Approach A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, Performance charaterization of in-memory data analytics on a mordern cloud server, in 5th International IEEE Conference on Big Data and Cloud Computing, 2015. A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, How Data Volume Affects Spark Based Data Analytics on a Scale-up Server, in 6th International Workshop on Big Data Benchmarking, Performance Optimization and Emerging Hardware (BpoE) held in conjunction with VLDB, 2015. 34
Future Directions Motivation NUMA Aware Task Scheduling Cache Aware Transformations Exploiting Processing In Memory Architectures HW/SW Data Prefectching Rethinking Memory Architectures 35