Cache Replacement Policies for Embedded Mobile Systems: Performance

Cache Replacement Policies for Embedded Mobile Systems: Performance and Power Consumption 1 1, First Author Henan University of Technology, China, orangeding@163.com Abstract With fast advances in processor technology, imbalance between the relative speeds of processors and the main memory is the rapidly enlarged. To mitigate it, an increasingly larger portion of the chip transistors budget is dedicated for on-chip cache usage, leading to an ever expanding memory hierarchy. As the key for cache system, cache replacement policy component give great impact on performance and overall energy consumption. As the spread of mobile embedded system, in this paper we talk about how nine selected last cache level replacement policies behaves on performance, energy consumption and hardware cost. We find that for popular mobile applications, most of the policies have better behavior than traditional LRU algorithms in both performance and energy cost. The performance speedup can achieve up to 8% and the LLC cache misses can be reduced by 35%. It proved that it s promising to provide more replacement policy diversity for embedded mobile system. 1. Introduction Keywords: Embedded System, Cache Replacement, Power Consumption With each access to the main memory taking hundreds of processor cycles, there is more and more on-chip resource are taken by cache system and its hierarchies became deeper and deeper to tolerate the long latency memory access. There also comes lots of speculation approaches like prefetching or value prediction [1][8][10] to bridge this performance gap. However, prior studies have shown that only a small fraction of the cache block actually hold the live data that will be referenced again before evicted from the cache (called live blocks [4]). Most of the cache blocks are wasted by dead data just used only one time. The inadequate use of cache hierarchy leads to the performance loss of the application and extra energy consumption. For extreme case that the streaming application takes over all the cache space, leaves little room for locality friendly data and results huge number of memory accesses. Traditional LRU policy inserts the new data in the MRU position and removes the block in LRU position. Therefore, it just fits the application that can fit in cache or with good locality. With embedded mobile system becoming popular, we should study how replacement policy works for it and if there exists any best candidate policy. To this goal, we select night replacement policies that improve the traditional LRU algorithms; apply them in the last level cache in a three-layer cache hierarchy; study their performance and energy consumption behaviors. The following paper is organized that: in section 2, we review the background of our study. In section 3, we introduce our methodology. In section 4 and 5, we give the performance analysis and energy consumption analysis respectively. At last, we make the conclusion. 2. Background In this section, we first describe the existing cache replacement algorithms and then we state nine selected improved cache replacement algorithms. With them, we can study how different cache replacement can impact the performance and energy consumption of system. Belady introduced an optimal offline replacement policy during the 1960 s [2]. However, it s not practical for it relies on the knowledge of future address references. Therefore, there exist some replacement policies in many practical situations, like LRU policy. LRU can be as good as the optimal policy. Yet, its success is very dependent on applications natures, like their locality and working set size. For some case, its performance is far from the one optimal policy can provide. That s because LRU only insert the new coming line into MRU position of the stack and remove the cache line in the LRU position. To overcome this drawback, researchers introduced three adaptive extensions of LRU to prevent cache thrashing for workloads with a large memory footprint. LIP (LRU Insertion Policy)[5] Research Notes in Information Science (RNIS) Volume11, January 2013 doi:10.4156/rnis.vol11.4 40

places all incoming lines in the LRU position, and moves them to the MRU position only if they are referenced while they are still in the LRU position. BIP (Bimodal Insertion Policy) [3] improves LIP by just placing some of the incoming lines directly in the MRU position. Therefore, it can adapt to changes in the working set during the execution. However, both LIP and BIP cannot overwhelm LRU on all benchmarks, a dynamic policy, called DIP (Dynamic Insertion Policy) [7], was proposed to choose dynamically between the traditional LRU policy and BIP depending on the performance improvement the policy can provide. In our study, we select night replacement policy from [11]: N1 is a novel adaptive cache replacement policy named SCORE, which uses a score system to select a cache line to replace; N2 is a high performance cache replacement algorithm called Dueling Segmented LRU replacement algorithm with adaptive Bypassing (DSB). It randomly provides better protection for a newly allocated line and use an aging algorithm to remove stale cache lines; N3 is named MadCache. It s a cache insertion policy that uses memory access history based on the Program Counter (PC) to determine the appropriate policy for the L3 cache, either LRU for locality friendly application or bypass for streaming memory accesses; N4 is ASRP policy, each set in Last-Level Cache (LLC) is divided into multiple subsets, and one subset is active and others are inactive at a given time. The victim block for a miss is only chosen from the active subset using LRU policy. N5 estimates the data reuse possibility on the basis of data reuse history and remove the one with least reuse possibility. N6 is a combination of cache replacement and bypass policy derived by dead block prediction. N7 uses the Decision Tree Analysis (DTA) based insertion policy selection to find the best position for new loading in data block. N8 is the 3P policy that uses a bimodal insertion improving memory-level parallelism to reduce the impact from long time memory access; N9 is a protected LRU algorithm, its Competitor provides access to the class structure holding meta-data and functions responsible for selecting the evicted line. 3. Methodology In this section, we state our experiment methodology, including simulation platform and benchmarks. We used a simulation framework based on the CMP$im [6]. CMP$im is a binaryinstrumentation-based cache simulator from Intel. It can be used to study the cache performance of kinds of workloads, including single-threaded, multi-threaded, and multi-programmed applications. To model modern computing systems, we use the parameters listed in Table 2 in our simulation: an 8- Stage and 4-wide pipeline powerful processor coupled with a four-layer memory sub-system and a branch predictor to make perfect branch prediction. Table 1. Simulation Parameters Processor 8-Stage, 4-wide pipeline Instruction window size 128 entries Branch Predictor Perfect L1 inst cache 32KB, 64B linesize, 4-way, LRU, 1cycle hit L1 data cache 32KB, 64B linesize, 8-way, LRU, 1cycle hit L2 cache 256KB, 64B linesize, 8-way, LRU, 10cycle hit L3 cache 4MB, 64B linesize, 16-way, 30 cycle hit Main memory 200 cycle There has been a wide variety of workloads including entertainment, image processing, and data processing etc. spread in modern embedded mobile systems. To represent these embedded mobile workloads, we select a set of benchmark programs as follows: SAX and DOM are the XML data processing benchmarks taken from Xerces-C++ [16]. SAX implements the push XML parsing model, in which the parser sends (pushes) XML data to the client as the parser encounters elements in an XML infoset; DOM implements the tree-based XML parsing model, in which the parser reads the entire content of an XML document into memory and creates inmemory objects to represent it. In the experiment, we use both SAX and DOM models to parse four different XML files with varying sizes and complexities, and then take their average performance results for measurement. 41

JPEG2000 Encode and JPEG2000 Decode are taken from MediaBench II, which represents multimedia and entertainment workloads [17]. They respectively implement an encoder and a decoder based on the ISO JPEG-2000 standard for wavelet-based image compression. Fluidanimate and Freqmine are taken from the PARSEC benchmark suite [18]. Fluidanimate simulates an incompressible fluid for interactive animation purposes and is commonly used in gaming applications; whereas Freqmine employs an array-based version of FP-growth method for frequent item set mining that is usually used to mine multimedia contents. In this paper, we have also used analytical modeling techniques [23-26] to model the cache energy consumption behavior; you will see the details in section 5. 4. Performance analysis With CMP$im, we conclude how different cache replacement algorithms impact system performance and list the results in Figure 1 and Figure 2. Figure 1 gives the comparisons of Last Level Cache (LLC) misses Figure 2 states how performance speedup varies. In these two figures, we use the performance under LRU cache replacement as our baseline, so that in figure 1 the positive number means there is a performance improvement and in figure 2 the positive one means there is a LLC miss reduction, vice versa. It can be seen in Figure, for those benchmarks with the streaming characters like JPEG2000 Encode JPEG2000 Decode and SAX, almost all the selected cache replacement algorithms can greatly alleviate the performance gains from long latency memory accesses. For JPEG2000 Encode, there is an average 25% LLC miss reduction and in the best case it can reduce up to 35% memory accesses. That s because the workload with streaming nature always poses a large work size and its locality is pretty low. When applying LRU algorithms, it makes most of the cache block stuffed with one-time-used data, leaves little room for locality friendly data and produces more long latency memory accesses. However, all the selected algorithms can get rid of this case by limiting the space streaming data occupies and leaving more space to hold more locality friendly data. As a result, the amount of unnecessary long latency accesses can be greatly reduced. For selected cache replacement algorithms, we can observe that N3, N6 and N1 are the three algorithms that provide most LLC miss reduction. Especially, N3 gives top results among five of six benchmarks. In JPEG2000 Encode, it can achieve 35% LLC miss reduction. That s because N3 is a PC-aware cache insertion policy, it use a PC-based history table to store information regarding cache accesses and determine whether apply LRU policy for workloads that exhibit good locality or bypass for streaming memory accesses. We can also observe that there is even miss increase in executing benchmark freqmine, that s because freqmine is a benchmark so that it benefits least from all the non-lru algorithms. It can be seen in Figure 2, most of the variation keep similar trend with those in Figure 1. For benchmarks JPEG2000 Encode and JPEG2000 Decode, almost all the selected cache replacement algorithms can provide performance gains in terms of reduction of CPI. For JPEG2000 Encode, there is an average 6% performance improvement and in the best case there is an up to 8% improvement. Remaindering Figure 1, we can deduce that it s the results of up to 35% LLC miss reductions. That s because the applying of selected cache replacement algorithms can preserve those spaces streaming data occupied and leaves more space serving for locality friendly data. So that we can reduce considerable amount of diminished unnecessary long latency accesses lead to up to 8% CPI reduction. We can also reduce those benchmarks like JPEG2000 Encode is cache-replacement-sensitive so that careful selection of replacement policy can determine how it behaves in performance. On the other hand, remaindering Figure the LLC miss increase in benchmark freqmine produces some performance degradation, for algorithms N1 there comes nearly 1% CPI decrease. However, on average its performance variation is so obvious. We can say that freqmine is a cache-replacement-insensitive benchmark, whose performance keeps steady no matter what kinds of replacement policy applied. 42

0.35 0.25 0.15 0.05 0.05 N1 N2 N3 N4 N5 N6 N7 Figure1. Miss Reduction of LLC 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0.01 N1 N2 N3 N4 N5 N6 N7 N8 5. Analysis of energy consumption Figure 2. Performance Speedup In this study, we aim to find out how different cache replacement policy which is applied in LLC can impact the system performance and energy consumption. In this section, we move to the energy consumption part. Here, the energy consumption we talked is the energy cost of memory subsystem. Since there have more and more on-chip transistors used to be a member of memory subsystem, whose energy dissipation turns to be the major part in the whole system. Since just the last level policy changes, we also make some assumptions to simplify our discussion. As we studies in previous work [9], the overall system consumption consists of two part: the dynamic energy (Edynamic) and the static energy (Estatic). In equation 1, we can see dynamic one is the result of the amount of how many time memory subsystem is accessed (nm) multiplying dynamic energy consumed each time (E m); the static one is the of static power (Pstatic) and the overall execution time (t). E E static E dynamic P static t n m E ' m To detailed discussing the energy consumption under different policies, we make the further discussion. With different policies, the invariable parts are the amounts of the accesses to L1, L2 and L3 cache; and their static power consumption. The variable parts are the amount of the accesses to memory and the total execution time. So that when we talk about the energy consumption variation with different policies, we only need to focuses on the variable components. The accesses to memory are the last level cache misses, i.e. L3 cache misses. Reminding Figure 1, for most benchmarks, the night selected replacement policies is able to reduce the amount of memory accesses, leading to the 43

dynamic energy consumption reduction. Reminding Figure 2, most benchmarks are speeded by selected policies with the reduced execution time. That results a reduction in static energy. We can get the conclusion that: applying different replacement policy, there s double-fold benefits to the system. On the one hand, proper policy will keep useful data resident longer in the cache therefore reduce the amount of memory accesses and the dynamic energy consumption. On the other hand, the overall execution time is shortened by dismissing those long latency memory accesses, leading to the reduction of static energy consumption. Table 2. Hardware cost of selected policies Cache replacement policy Hardware cost (K bits) N1 69 N2 125.4 N3 159 N4 45 N5 125.3848 N6 129 N7 2.020508 N8 20.04395 N9 37 LRU 64.28571 The only one overhead we have to consider is those introduced by implementation of replacement policy itself, including hardware cost and the energy cost that its hardware implementation consumes. In table 1, we list the hardware cost needed for each policy under our experiment configuration. For all policies, their requirements are fewer than 160K bits, considering that the last level cache is 4MB with 16-way associtivity and 64B line size, it s fairly reasonable. Compared to the LRU policy with 65K bits overhead, half of the policies require less hardware so their energy consumption is less than what LRU implementation need. For the rest policies that need more hardware, the hardware and energy cost are also tolerable. 6. Conclusion In this paper, we study night selected cache replace policy including their performance, energy consumption and hardware cost. We find that in embedded mobile system, the traditional LRU algorithms is not the best candidate for last level cache replacement. For most mobile application, night selected policies are able to provide more speedups as well as reduction the overall energy consumption, with tolerant hardware cost. It results that we can make more replacement policy available for embedded mobile system, dynamically serving for different applications. In our future plan, since we have done some research on energy efficiency of mobile computing [9][22], garbage collection of java virtual machine [12][13][15][21] and xml parsing acceleration [14][19], we hope can provide an entire powerful and energy efficient mechanism for mobile system, which equipped with well-built cache replacement, speculative memory assistance as well as hardware assisted middleware including garbage collection and xml parsing. 7. References [1] D.G. Perez, G. Mouchard, and O. Temam, Microlib: A case for the quantitative comparison of micro-architecture mechanisms, In Proceedings of International Symposiums on Microarchitecture (MICRO), 2007 [2] L. A. Belady, A study of replacement algorithms for a virtual-storage computer. IBM Syst. J., 5(2):78-101,1966. [3] M. Qureshi, A. Jaleel, Y.N. Patt, S.C. Steely Jr, and J. Emer. Adaptiveinsertion policies for high performance caching, In Proc. of the 34th Int. Symp. on Computer Architecture, 2007. 44

[4] S. Kaxiras, Z. Hu, and M. Martonosi. Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power. Proc. of the International Symposium on Computer Architecture, 2001. [5] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 381-391, New York, NY, USA, 2007. ACM. [6] A. Jaleel, R. S. Cohn, C. K. Luk, and B. Jacob. CMP$im: A Pin-Based On-The-Fly Multi-Core Cache Simulator. In MoBS, 2008. [7] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely. Jr., and J. Emer. Adaptive insertion policies for highperformance caching. Proc. of the International Symposium on Computer Architecture, 2007. [8] Shaoshan Liu and Jean-Luc Gaudiot,Value Prediction in Modern Many-Core Systems, Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS 2009), TCPP- Ph.D. Forum, Rome, Italy, May 25-29, 2009 [9] Jie Tang, Shaoshan Liu, Zhimin Gu, Chen Liu and Jean-Luc Gaudiot, Prefetching in mobile embedded system can be energy efficient, IEEE Computer Architecture Letters, IEEE Computer Architecture Letters, Volume 10, Issue 1 (2011), Page 8-11 [10] Shaoshan Liu, Christine Eisenbeis, and Jean-Luc Gaudiot, A Theoretical Framework for Value Prediction in Parallel Systems, in Proceedings of the 39th International Conference on Parallel Processing (ICPP 2010), San Diego, California, September 13-16, 2010. [11] JWAC-1: Cache Replacement Championship Program, June 20, 2010, http://www.jilp.org/jwac-1/ [12] Jie Tang, Shaoshan Liu, Zhimin Gu, Xiao-Feng Li, Jean-Luc Gaudiot. Achieving middleware execution efficiency: Hardware-assisted Garbage Collection Operations. Journal of Supercomputing: Volume 59, Issue 3 (2012), Page 1101-1119 [13] Jie Tang, Shaoshan Liu, Zhimin Gu, Xiao-Feng Li, Jean-Luc Gaudiot. Hardware-Assisted Middleware: Acceleration of Garbage Collection Operations. In Proceedings of the 21st IEEE International Conference on Application-Specific Systems. Architectures and Processors (ASAP 2010), Rennes, France, 2010: 281-284 [14] Jie Tang, Shaoshan Liu, Zhimin Gu, Chen Liu, Jean-Luc Gaudiot. Memory-Side Acceleration for XML Parsing. In Proceedings of the 8th IFIP International Conference on Network and Parallel Computing (NPC 2011). Changsha, China, 2009: 277-292 [15] Shaoshan Liu, Jie Tang, Ligang Wang, Xiao-Feng Li, Jean-Luc Gaudiot. Packer: Parallel Garbage Collection Based on Virtual Spaces. IEEE Transactions on Computers, [16] Xerces-C++ XML Parser: http://xerces.apache.org/xerces-c/ [17] C. Lee, M. Potkonjak, and W.H. Mangione-Smith, MediaBench: a tool for evaluating and synthesizing multimedia and communications systems, in Proceedings of the international symposium on Microarchitecture (MICRO), 1997 [18] C. Bienia, S. Kumar, J.P. Singh, and K. Li, The PARSEC benchmark suite: characterization and architectural implications, Princeton University Technical Report TR-811-08, 2008. [19] Jie Tang, Shaoshan Liu, Zhimin Gu, Chen Liu, Jean-Luc Gaudiot. Acceleration of XML Parsing Through Prefetching. IEEE Transactions on Computers [20] Jie Tang, Pollawat Thanarungroj, Chen Liu, Shaoshan Liu, Zhimin Gu and Jean-Luc Gaudiot. Pinned OS/Services: A Case Study of XML Parsing on Intel SCC. Journal of Computer Science and Technology [21] Shaoshan Liu, Jie Tang, Chengrui Deng, Xiao-Feng Li, Jean-Luc Gaudiot. RHE: A JVM Courseware. IEEE Transactions on Education[J]. Volume 54, Issue 1 (2011), Page 141-148 [22] Huang, Jie Tang, Zhimin GU, Min Cai, Jianxun Zhang, Ninghan Zheng. The Performance Optimization of Threaded Prefetching for Linked Data Structures. International Journal of Parallel Programming, Vol.40, no.2 pp.141-163 [23] Zhefu Shi, Beard Cory, Mitchell Ken. Analytical Models for Understanding Misbehavior and MAC Friendliness in CSMA Networks. Performance Evaluation (2009), Volume: 66, Issue: 9-10, Pages: 469-487 [24] Zhefu Shi, Beard Cory, Mitchell Ken. Misbehavior and MAC Friendliness in CSMA Networks. IEEE Wireless Communications and Networking Conference, 2007. (WCNC 2007). March 2007. page(s): 355 360 45

[25] Zhefu Shi, Beard Cory, Mitchell Ken. Tunable Traffic Control for Multihop CSMA Networks. IEEE Military Communications Conference, 2008. MILCOM 2008. Nov. 2008. On page(s): 1-7 [26] Zhefu Shi, Beard Cory, Mitchell Ken. Competition, Cooperation, and Optimization in Multi-Hop CSMA Networks. Proceeding: PE-WASUN '11 Proceedings of the 8th ACM Symposium on Performance evaluation of wireless ad hoc, sensor, and ubiquitous networks. 46