Institutionen för datavetenskap Department of Computer and Information Science

Transcription

1 Institutionen för datavetenskap Department of Computer and Information Science Final thesis Measuring the effect of memory bandwidth contention in applications on multi-core processors by Emil Lindberg LIU-IDA/LITH-EX-A 15/002 SE Linköpings universitet SE Linköping, Sweden Linköpings universitet Linköping

2

3 Final Thesis Measuring the effect of memory bandwidth contention in applications on multi-core processors by Emil Lindberg LIU-IDA/LITH-EX-A 15/002--SE Supervisor: Erik Hansson Examiner: Christoph Kessler

4

5 Abstract In this thesis we design and implement a benchmarking tool for applications sensitivity to main memory bandwidth contention, in a multi-core environment, on an ARM Cortex-A15 CPU. The tool is supposed to minimize usage of shared resources, except for the main memory bandwidth, allowing it to isolate the effects of the bandwidth contention only. The difficulty in doing this lies in using a correct memory access pattern for this purpose, i.e. which memory addresses to access, in which order and at what rate in order to minimize cache usage while generating a high and controllable main memory bandwidth usage. We manage to implement a tool with low cache memory usage while still being able to saturate the main memory bandwidth. The tool uses a proportional-integral controller to control the amount of bandwidth it uses. We then use the tool to investigate the memory behaviour of the platform and of some applications when the tool is using a variable amount of bandwidth. However, we have some difficulties in analyzing the results due to the lack of support for hardware performance counters in the operating system we are using and are forced to rely on hardware timers for our data gathering. Another difficulty is the platform s limited L2 cache bandwidth, which leads to a heavy impact on L2 cache read latency by the tool. Despite this, we are able to draw some conclusions on the bandwidth usage of other applications in optimal cases with the help of the tool. iii

6

7 Contents 1 Introduction Motivation Problem Description Results Structure Background Memory Hierarchy SDRAM Bandwidth Usage Properties Paged Virtual Memory Experimental Platform Our Bandit Design Main objectives Floating point data Location of Memory Accesses Controlling Bandwidth Usage Measuring Target Application Bandwidth Usage Multithreading the Bandit Portability Usability Implementation Location of Memory Accesses Bandwidth delay implementation PI-Controller implementation Measuring Target Application Bandwidth Usage Multithreading the Bandit Portability User interface Low Overhead Bandit v

8 CONTENTS CONTENTS 4 Evaluation Method Platform Performance Memory Access Latency Off-Chip Memory Bandwidth Bandit Performance Cache Miss Generation PI-Controller Performance Measuring Memory Latency Measuring Bandwidth Bandit s Effect on Programs Overview Micro-Benchmarks Telecom Application MiBench Summary Platform evaluation Bandit evaluation Closing remarks Related Work Prior Work Main Memory Bandwidth Contention Measurement Main Memory Bandwidth Contention Mitigation Conclusions Limitations Future work Appendices 50 A Cache hit simulator 51 B Physical address lookup 54 C Memory Access Iteration Code 56 D PI-Controller Code 60 E The Reference Program 62 F Synthetic benchmarks 64 vi

9 Chapter 1 Introduction 1.1 Motivation Chip multiprocessors, CMPs, have rapidly become the standard in laptop and desktop PCs. This is due to multiple reasons, one being the unfeasible energy and cooling requirements of running a processor at high frequencies, as an increase in frequency approximately increases the energy consumption by the cube [3]. For the same reason, a decrease in frequency leads to large energy savings, allowing processor manufacturers to keep up with Moore s law by constructing CMPs with several cores that each run at a lower frequency, resulting in a higher total theoretical computational power. While the potential gains of utilizing CMPs are high, they present several new challenges and considerations, which prevents or delays them from being adopted and utilized in every computer. Applications using multiple cores are more difficult to develop and require more from programmers, but there is another major issue that is introduced by CMPs, namely shared resource contention. This means that different cores contend for limited resources, such as caches, main memory bandwidth and network resources. An application running on one core can therefore affect the execution time of an application running on a different core by contending for the same shared resources. Due to this contention, running two seemingly independent tasks on different cores in a real-time system (a system where processes have deadlines) can be a problem [11]. To mitigate resource contention, we need to know the applications characteristics with regard to shared resources. With that information, we can decide if we can run them together or if we need to redesign either our platform or applications. 1.2 Problem Description In this thesis we are going to look specifically at the contention for main memory bandwidth using methods previously described by Eklöv et al. [7]. 1

10 1.3. RESULTS CHAPTER 1. INTRODUCTION They generated traffic on the main memory bus with an application they called Bandwidth Bandit and then measured how various applications performances were affected. The Bandit is a program that generates a variable amount of load on the main memory bus. The load it generates should be as realistic as possible when compared to real applications and it should have minimal effect on the other shared resources, which in this case mostly means the shared caches. We have some additional demands on the Bandit compared to our predecessors. First, we want to investigate if it is possible to gather memory bandwidth usage data on the application affected by the Bandit, directly with the Bandit. The previous Bandit relied on external monitoring with hardware performance counters to gather data, but we want to rely as little as possible on other software and specific hardware. Secondly, we want the Bandit to be a usable tool in regression testing to verify an application s characteristics regarding memory bandwidth usage. Finally, our target platform is an embedded system with an ARM Cortex-A15 CPU, while the previous Bandit used an Intel system. As an added requirement, we want to look into the portability aspect of the Bandit and try to isolate the hardware dependent features of the Bandit. This would allow for easier porting to different architectures and keep the Bandit relevant in an environment of ever changing hardware with a lower effort. The platform we are using is intended for running telecom applications which we need to take into consideration when evaluating the Bandit. 1.3 Results We managed to create an application that is able to control the amount of bandwidth used by itself, while also keeping the amount of used cache memory low. This bandit can then be used to test how other applications function under different contention environments. As the execution time can be heavily affected by this contention, as we have shown, it can be a useful tool to verify the robustness of various real-time applications. Porting of the Bandit to different architectures is doable, however our reliance on a fast hardware timing function is a weakness. It can however be replaced by other mechanisms for controlling high precision delays. The attempt to make the Bandit able to measure other applications bandwidth usage did not go entirely as we had wanted. We were able to measure synthetic applications with highly stable bandwidth usage, but our attempt to measure other applications did not work very well. With more time, it could have been possible to implement it. Our target architecture s low bandwidth beyond the L2 cache and it being a bottleneck led to it being difficult to separate pure main memory accesses from L2 cache accesses as the latter suffered very much from the contention. This was also very obvious when we attempted to implement 2

11 1.4. STRUCTURE CHAPTER 1. INTRODUCTION a bandwidth measuring function in our bandit as the measurements were affected by L2 cache accesses. 1.4 Structure In Chapter 2 we present concepts used for understanding our work. These concepts are mainly focused on the different types of memory in a cpu and their function. We also present our experimental platform here. In Chapter 3 we present the design and implementation of our bandit and show details of how it works. In Chapter 4 we evaluate the performance of our platform, the function of the Bandit and tests how it affects other applications. In Chapter 5 we show some related work to our thesis, which mostly are articles that focus on memory bandwidth contention in different ways. We conclude our thesis with Chapter 6 with some outlook on the future for the Bandit. 3

12 Chapter 2 Background 2.1 Memory Hierarchy Modern processors utilize a hierarchy of memories. First, there are several layers of volatile storage, from the registers in the processor to a couple of cache memories, usually two or three (denominated by L1, L2 etc.), and finally the main memory. Behind all those is the permanent but slow storage. The speed of CPUs has been increasing at a much higher rate than that of main memory for a while now [15] and still is [14]. It was first the case that the two operated at the same speed, but they are now separated by orders of magnitude. This is the reason for the memory hierarchy. Data often has the following two properties: temporal locality (data recently used is likely to be used again) and spatial locality (data close to other recently used data is likely to be used). By having faster caches, processors get faster access to the data that is likely to be used and even though they are small, they are large enough to gain the benefits of the spatial and temporal locality. 4

13 2.1. MEMORY HIERARCHY CHAPTER 2. BACKGROUND Figure 2.1: Organization and mapping of cache sets in main memory to cache memory. When data is fetched from the memory for the first time, it is read from the main memory and then stored in the different layers of cache. More data than requested, a cache line, is often fetched to take advantage of the spatial locality. In some cases, when the access pattern can be predicted, even more data can be read into the cache, which is called prefetching. When any of this data is read again and it still resides in a cache, a cache hit has occurred. The memory system searches for the requested data in order of the fastest cache to the slowest and returns the first hit. When a new cache line is installed, it usually can not be stored at any place in the cache, i.e the cache is usually not fully associative. Instead, cache memories can be n-way, where n denotes in how many places a block of data can be stored in the cache. These n places that can store the same subset of memory blocks are called a cache set. An example of how cache sets maps from main memory to cache memory is found in figure 2.1. If there is no free spot to install the data in, some other cache line in the same cache set must be evicted. The selection of the cache line can be done with different algorithms such as least recently used (LRU) or random selection [12]. In CMPs, caches can be private to a processor or shared between some or all of the processors as shown in Figure 2.2. Shared caches introduces contention for space in the cache, a phenomenon Eklöv et al. also examined [6]. They also noted that decreased performance of the caches led to an increase of usage of main memory bandwidth. This happens because every cache hit removes the need for a read from the main memory. The memory hierarchy is in other words coupled, if one part of it is affected the effect ripples upwards through the hierarchy. 5

14 2.2. SDRAM CHAPTER 2. BACKGROUND Figure 2.2: Organization of L1 and L2 caches. The back side is internal to the cores and the front side is external to the CPU 2.2 SDRAM Synchronous dynamic random access memory, SDRAM, is the most common type of main memory in computers today [12]. Access to the SDRAM is controlled by a memory controller which translates a memory address in the operating system into an address in the SDRAM. An address has to be translated into several different signals. With more than one SDRAM module, a memory channel has to be selected and every SDRAM is partitioned into banks, rows and columns. All these parts together translate into a unique memory address. A bank contains different rows and columns. The different banks allow for some parallelism since they can prepare reads independently. When data has to be read, the selected bank must first load the selected row, also known as a page, into a buffer unique for each bank. If this page is ready when the request is made (page hit) the request will be serviced significantly faster than if the wrong page is loaded (page miss) or if no page is loaded (page empty). Due to the channels and banks parallel nature the SDRAM has great potential for parallelism [13]. 2.3 Bandwidth Usage Properties The various levels in the memory hierarchy usually allow some form of parallel operation or at least queues in order to increase the utilization of the buses in question. An ARM Cortex-A15 for example can have 16 outstanding loads and stores at any given time, while its cache is able to keep delivering data that is present in the cache while waiting for data that earlier 6

15 2.4. PAGED VIRTUAL MEMORY CHAPTER 2. BACKGROUND caused a cache miss. The memory controller for the main memory can direct requests to the different banks in an SDRAM, and if multiple channels are available, it can direct the requests over both channels in parallel. The parallel operation in the hierarchy is key to understanding how an application behaves under different contention for the memory bandwidth resources, as Eklöv et al. demonstrated [7]. They classified applications as either bandwidth sensitive or latency sensitive. When the load on the memory system increases, the latency will gradually increase, even if there is available bandwidth left. This is due to contention for specific parts of the memory system, which results in requests being placed in different queues. A bandwidth sensitive application is good at utilizing prefetching and is able to perform calculations while new data is being fetched. Latency sensitive applications is the opposite and will have to stall in order to wait for new data. 2.4 Paged Virtual Memory Modern operating systems usually employ a memory management technique called paged virtual memory [12]. This works by giving each process its own memory space, a virtual memory space, independent of the actual physical memory space where only that process operates. A process then sees the available memory as one contiguous block segmented into pages and can usually address 4 GB of memory or more, independent of the actual available memory of the system. The size of these pages is usually 4 KB, but can differ depending on support from hardware and the operating system. When memory is allocated by a process, a binding is created between virtual pages and physical pages, i.e. pages in the physical memory. The operating system keeps track of all these mappings in a page table on a per process basis. The page tables reside in the main memory. One of the great benefits of paged virtual memory is the ability to scatter the memory space of a process arbitrarily in the physical memory, thus reducing the impact of memory fragmentation. If a process allocates a chunk of memory by the size of 1,000 pages, the addresses of the chunk will be contiguous in the virtual memory space, but 1,000 contiguous pages are not necessary for the allocation. The 1,000 pages may be scattered throughout the physical memory. When a process accesses a virtual memory address, a lookup is done in the page table by using the higher bits of the virtual memory address. The lower bits, the 12 least significant bits in the case of 4 KB (2 12 ) bytes page size, represent the offset within the page. If the mapping exist, the virtual address is translated into the corresponding physical address by taking the physical page address and using the same offset within it. Modern processors provide hardware support for this type of virtual memory with a memory management unit, MMU. It performs the translation from virtual memory addresses to physical memory addresses. To speed 7

16 2.5. EXPERIMENTAL PLATFORM CHAPTER 2. BACKGROUND Figure 2.3: Mapping from virtual addresses to physical addresses. up the process a translation lookaside buffer, TLB, is used, which is a kind of cache memory that stores recently accessed pages and their corresponding mappings to physical memory. These can operate in a hierarchy in the same way as the normal caches. When a process performs a memory access when hardware support exists, a lookup is first done in the TLB. If the address exists, the physical address is returned and the memory access is performed. Otherwise a so called page walk is performed, which is the operation of accessing the page table in main memory, retrieving the requested mapping and storing it in the TLB. This is usually performed through dedicated hardware. If no such mapping exists, an exception is raised. 2.5 Experimental Platform The platform our bandit is being tested on is running two clusters of ARM Cortex-A15 processors with four cores in each, all within a single die. Each core has a private L1 cache memory, each cluster shares an L2 cache memory and the entire system shares an L3 cache memory. The clusters are connected via the L2-cache system to a CoreLink CCN-504 Cache Coherent Network [1] which connects the clusters to each other and to the rest of the system. The main memory is connected to a CoreLink DMC-520 Dynamic Memory Controller [2] supporting two DDR3 modules. The documented bandwidth of the memory controller is around 15 GB/s. The L1 cache is 32 KB in size, uses 64 B cache lines and is 2-way associative. Its cache replacement policy is least recently used and it can have up to 6 different outstanding memory requests at any time. The L2 cache is 8

17 2.5. EXPERIMENTAL PLATFORM CHAPTER 2. BACKGROUND Figure 2.4: Architecture of the experimental platform. 2 MB in size, uses 64 B cache lines and is 16-way associative. Its cache replacement policy is random selection and it can have 16 outstanding writes and 11 outstanding reads. The L3 cache has the same properties as the L2 cache except that it is 8 MB in size. The operating system is a custom Linux distribution patched with real-time patches. The Bandit is cross-compiled using GCC version with -O2 optimizations active. 9

18 Chapter 3 Our Bandit 3.1 Design Main objectives In short, the objective of our implementation is to use as little shared resources as possible except for main memory bandwidth and have a sufficiently realistic bandwidth usage pattern, be easy to use and, finally, be easy to port to different architectures. The most relevant shared resource to minimize in our case is the shared L2 cache memory. The L3 cache memory usage is not focused on to simplify the implementation and evaluation and it is also not relevant for the final use case of the platform. The Bandit is completely implemented in C as it is close enough to the hardware for our needs Floating point data Floating point operations should be avoided if possible in time critical sections due to them being significantly slower than normal integer operations. The data for us that would benefit from being represented as floating point is time. In order to use integers instead, we have to represent time in a small enough unit to not lose any significant precision. We therefore chose to represent time as nanoseconds. Since we have a 32-bit platform, it would be faster for us to save the time in a 32-bit integer. This, however, puts a limit on how long we can time things. Using an unsigned integer we get the maximum value of 4,294,967,295 ns, or roughly 4.3 seconds. Due to the high speed nature of computers, 4.3 seconds should be enough for our needs. 10

19 3.1. DESIGN CHAPTER 3. OUR BANDIT Location of Memory Accesses In order to achieve the first two objectives, it is necessary to access specific memory addresses in a specific order. To generate a memory access, we need to make a cache miss. A cache miss can be guaranteed in two ways, either by trying to access memory of a greater size than the target cache, or, if the number of ways in a set is limited, by accessing more cache lines that belong to the same cache set then there are ways. By doing the latter, we do not need to use up the entire cache memory. The usage pattern we aim for is one that utilizes the parallelism in the main memory as uniformly as possible, i.e. utilizes the channels and banks equally, and, if possible, control the number of page hits and misses in the SDRAM. To access memory in specific cache sets we need to know how the mapping to cache sets is done. The least significant bits of the address denotes the offset in the cache line, in our case this means the 6 least significant bits. Some of the following bits will then denote the cache set. To calculate how many bits denote the set we use the following equation: cache sets = cache size ways line size (3.1) The L2 cache in our case is 2 MB in size, has 16 ways and its line size is 64 bytes. Equation 3.1 gives us 2048 different cache sets which needs 11 bits to address them all. This means that the 11 bits after the cache line offset encodes the cache set. Since they are offset by 6 for the addressing of the cache line, the cache sets will repeat every 2 17 bytes or 128 KB of memory. If we align memory allocations to this 128 KB boundary we can contain us to a few cache sets and therefore generate cache misses without thrashing the entire cache. An easy way to step through this memory is to create a circularly linked list for each cache set that we use, with each element at a constant offset from the 128 KB boundary. Each element contains only a pointer to the next element. By stepping through through these lists we generate the memory traffic. Our platform utilizes, as described in Section 2.5, a random eviction strategy for the cache. A perfect LRU strategy would allow us to use one more element than there are ways for each cache set, i.e. 17 elements in our L2 example. However, the random eviction strategy forces us to use significantly more elements than that. Using a simple simulator we created, described in Appendix A, we can determine that at twice the amount of elements we will get a hit rate of about 20% and that same number is about 2% when we have four times as many elements as we have ways, which should be enough for our needs. Now we know what we need to in order to generate reads beyond the L2 cache, but we also need to read beyond the L3 cache. The requirements to not thrash the L3 cache is not as strict though. Since the L3 cache has the same properties as the L2 cache, except that it is four times the 11

20 3.1. DESIGN CHAPTER 3. OUR BANDIT size, it should suffice to allocate four times as many elements as we earlier anticipated when calculating on the L2 cache. By taking all this in mind we get that the circularly linked lists belonging to the cache sets should have at least , or 256, elements. To generate the uniform access to the main memory we need to know the mapping from memory addresses to the different parts of the SDRAM. This translation is done by the memory controller, however we can not find this specific mapping in the technical reference manual for the memory controller [2]. A workaround for this problem that improves our chance for uniform accesses is simply to allocate more memory than we need for our cache miss purposes Controlling Bandwidth Usage In order to generate different amounts of bandwidth usage, we take a set amount of steps through the linked lists and then delay by a different amount of time depending on how much bandwidth we want to use. The number of steps we take before the delay is important because there is an overhead associated with each delay. Fewer steps gives us a finer degree of control over the bandwidth usage, but it limits our ability to generate a higher amount of traffic due to the overhead associated with the delays. By using a hardware timer in the ARM Cortex-A15 that has a very low overhead, we can get a very high control over this delay. An alternative would be to use a normal busy loop instead. This solution would however reduce the observability as the delay is not based directly on time, but on loop iterations instead. It could be solved by calculating how long different iterations take to run, but it would take more effort. An upside would be that the overhead of the delays would be lower because of the simpler construct, but the observability is prioritized in this case due to our lack of hardware counter support. We also want to be able to control the total bandwidth usage. This can be done in a number of different ways, although it is very intuitive to just enter the usage in absolute numbers, in our case the number of MB/s. This can also be done in different ways, either by using pre-calculated values for the delay corresponding to different bandwidth usages or, as we are doing, by a proportional-integral controller, PI-controller. The PI-controller use the current bandwidth usage as the process variable and the delay as the manipulated variable. A controller works by comparing the process variable, the bandwidth usage, to the target value, the target bandwidth usage, and then taking actions depending on the difference. The difference between the observed value and target value is called the error. The action is to change the manipulated variable so the process variable approaches the target value. In our case that means to modify the delay so we achieve the target bandwidth usage. The proportional part and the integral part gives us the change in our manipulated variable: 12

21 3.1. DESIGN CHAPTER 3. OUR BANDIT Where u(t) is the output. e(t) is the current error. b u(t) = K p e(t) + K i e(τ) dτ (3.2) K p is the proportional gain constant. K i is the integral gain constant. The proportional part provides a proportional change to the output while the integral part provides a change depending on the accumulated error, thus compensating for long term errors [8]. The advantage of the controller is a high flexibility and low sensitivity to changes in the memory access latency. This is very helpful when the Bandit is run in multiple threads, as we will look into in the next section. By timing each full iteration through all the 64 lists and by using this time information combined with how much data we access during an iteration we know how much bandwidth we are currently using. This is the only data we need in order to have a functioning controller. Because we know the length of the delay, we can also separate the time taken for the actual memory accesses. This will allow us to evaluate the current memory access latency in the system Measuring Target Application Bandwidth Usage The basic idea behind measuring the bandwidth usage of the target application is that if the maximum bandwidth that the Bandit can expect to use is known, then the difference between that and the actual accessed bandwidth is the amount that the target is using. The problem is that the target application is getting less bandwidth when the Bandit uses bandwidth compared to when the system is silent. However, if we know how much bandwidth the Bandit is using, then we can compare the target s bandwidth with a bandit s bandwidth in the same circumstances and from this it can be possible to gain some information about the application s actual bandwidth usage. It would work as following: 1. Run an application with the Bandit and get measurements. 2. Perform another run and replace the application with another bandit and configure it so we get the same measurements as in the previous step. 3. Run a bandit with the same configuration and get information on how much memory bandwidth it is using. This bandwidth usage should be the same as the original applications bandwidth usage. a 13

22 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT Multithreading the Bandit Multithreading is useful to generate even more bandwidth usage. By using the pthreads library to achieve this, we can run all the Bandits under one process. The benefit of this is that the threads can communicate with each other via their shared memory space in the process, this helps the Bandit to monitor its total memory bandwidth usage and only one controller is needed to regulate the memory bandwidth usage. However, it will also work if the Bandits are run as different processes as the bandits will get their own controllers. The controller will then be able to handle the effect of the other bandits running in parallel. When started with multiple threads, the Bandit pins the different threads to their own cores so they don t interfere with each other Portability The main hardware dependencies in the Bandit are the memory layout and the hardware timer used for the delay. In order to avoid thrashing the shared caches, we need to adapt our memory placement to the architecture s cache size, ways and line width. Our implementation has hard coded values for those parameters, so a change of platform from the memory layout perspective should only require a correct setting of the parameters, but we do not investigate this in practice. The hardware timer used in the ARM CPU has a very low overhead associated with it, which allows it to be used to time the delay. Although, if another architecture does not have a timer with similar properties, we will have to either switch to the busy loop discussed earlier or accept the increased overhead and therefore reduced maximum possible stolen bandwidth per bandit Usability The goal of the interface to the Bandit is for it to be usable in a script environment so it can be used for automated tests. A simple command line interface which take different parameters fulfills this goal. It can be used to control the number of executing threads, the amount of bandwidth that the threads should use in total and a verbosity setting that controls the amount of data output. 3.2 Implementation Location of Memory Accesses To implement our design in Section we have to carefully choose the memory we use for our circularly linked lists. 14

23 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT In order to find addresses fulfilling the cache usage and main memory access demands, Eklöv et al. used a feature called huge pages in Linux to allocate 8 MB chunks of contiguous memory, significantly larger chunks compared to the standard 4 KB pages. In those chunks they placed several linked lists, which they then iterated through to generate memory traffic. However, our Linux installation does not support the huge pages feature. Luckily, we have another feature that allows us to find the physical address corresponding to our virtual addresses. The complete mappings of a processes virtual addresses can be looked up by using the proc filesystem, specifically /proc/pid/pagemaps. The specific way we did it can be found in Appendix B. The first step we do is to allocate memory that we know is at least aligned by the page size of the system. It is done with the following call: posix_memalign (& memory_array [ k], page_alignment, size ); With the help of that function we can allocate as many pages as we need to find pages with the correct alignment. To find pages with the correct alignment is simply evaluating a modulo operation: if (( lookup_address ( memory ) % alignment ) == offset ) The current method we use is to perform one allocation for each memory page, which is very slow as it results in a lot of system calls. The reason we do this was to allow us to free the pages we did not need. A major drawback that the freeing had was that it fragmented the memory of the system and resulted in a severe slowdown of the system, even after the Bandit had finished executing. The Bandit now keeps all the memory it allocates, which means that the Bandit uses around 32 MB of memory at the 256 element minimum we calculated in Section A better alternative given that we can t free memory could have been to allocate several pages at once and then test them, thus saving system calls, but we did not do this due to time constraints. Once we have the memory, we partition the pages into 64 pieces, one for each cache line in the page, and create the 64 different circularly linked lists with one element in each page, each belonging to a separate cache set. This means that we use 64 out of the 2048 available cache sets or 3% of the available L2 cache. By having these 64 lists we are hopefully able to utilize the parallelism in the various levels of the memory hierarchy, however we have no control over the page hits and misses in the SDRAM Bandwidth delay implementation We use the implementation described in Section 3.1.4, the hardware timer is used for the delay between reads and we always perform 16 memory reads at the time, resulting in 64 * 16 bytes, or 1 KB, of memory read at the time. The reason we use this number is because it is the lowest number we found that still can effectively generate large enough amounts of bandwidth usage. 15

24 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT With a lower number, we would not be able to leverage the parallelism in the memory hierarchy. The main function for generating bandwidth usage is shown in Listing C.1. It is mainly made up of two parts, the memory read part, shown in Listing C.3, and the delay part, shown in listing C.2. The memory reading utilizes the circularly linked lists constructed in Section and keeps an iterator for each list in order to remember the positions. As we can see in Listing C.4, the compiler unrolls the memory read loop, thereby improving the Bandit s bandwidth usage performance PI-Controller implementation The controller is implemented as described in Section At first we tried to implement it using only integers, but it was more difficult than using floating point numbers. In the end the controller is not run very frequently, so the extra clock cycles required for floating point numbers is acceptable. The controller also has two modifications in order to improve its speed and reduce oscillations. The first one handles the property of the delay have to increase exponentially in order to decrease the bandwidth usage. The ratio in the following listing is used to scale up or down the delay modification depending on the delay s size compared to the current memory read step time: float wait_step_ratio = new_wait_time / ( float ) step_time ; if ( wait_step_ratio > RATIO_MAX ) wait_step_ratio = RATIO_MAX + ( wait_step_ratio - RATIO_MAX ) * RATIO_STEP_DOWN ; else if ( wait_step_ratio < RATIO_MIN ) wait_step_ratio = RATIO_MIN ; The integral part in a PI-controller can lead to overshooting and oscillations in a controller [8]. In order to minimize this effect, we reduce and flip the sign of the accumulated error and apply an adjustment to the new delay at the moment that we pass the target value as shown in the following listing: const float INTEGRAL_MODIFIER = -0.1; if (( difference < EPSILON && difference > - EPSILON )!(( difference >= 0) ^ ( old_difference < 0))) int_diff = int_diff * INTEGRAL_MODIFIER ; float overshoot_factor = difference / ( float ) total_usage ; int wait_modification = ( float )( wait_time + step_time ) * overshoot_factor ; new_wait_time -= wait_modification ; 16

25 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT Measuring Target Application Bandwidth Usage In order to get the target application s bandwidth usage as described in Section we first need the bandwidth usage of the Bandit. We get that as follows: unsigned int mem_usage = number_kilobytes * 1000 * 1000 / ( median_time ); // KB per ms my_data - > memory_usage [ my_data - >id] = mem_usage * 1000 / 1024; // MB per s Care has been taken to avoid integer overflows from the different operations. However, it was most probably an unnecessary optimization to only use 32- bit integers here and not 64-bit integers or floating point operations since the operation is not very frequent. We then use these values to get the target applications usage: app_usage = baseline_memory_usage [ my_data - > bandit_count - 1] - total_usage ; app_usage = app_usage > 0? app_usage : 0; app_need = ( app_usage * my_data - > bandit_count * 100 / total_usage ); printf (" Total usage : %u MB, App usage : %d MB, \ App need in bandit terms : %d %%\ n", total_usage, app_usage, app_need ); The app need in bandit terms in the code refers to how much, in percent, the application is using compared to a bandit. The baseline memory usage is the value produced from the benchmark described in Section Multithreading the Bandit The Bandit automatically pins the threads to a separate core in order to isolate them from each other. It is shown how it is done in this code snippet: 17

26 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT pthread_getaffinity_np ( pthread_self (), sizeof ( cpu_set_t ), & cpuset ); for ( current_core = 0; current_core < cores ; ++ current_core ) if ( CPU_ISSET ( current_core, & cpuset )) if ( nr_cores_found == id) found_core = 1; else ++ nr_cores_found ; if ( found_core ) break ; if ( found_core ) CPU_ZERO (& cpuset ); CPU_SET ( current_core, & cpuset ); pthread_setaffinity_np ( pthread_self (), sizeof ( cpu_set_t ), & cpuset ); else fprintf ( stderr, " Not enough cores assigned to the process \ n "); exit ( -1); The cpu affinity has to be set beforehand and it requires to be assigned at least as many cores as there are threads. Each thread handles its own measurements of bandwidth usage, but only one thread uses the data for the regulator and eventual print outs to the user. No locks are used for this shared data as they slowed down the bandit noticeably. Since only one thread writes each data and only one thread reads the data and the detriment for using old is not very significant, it is not a problem Portability The identified portability issues is, as pointed out in Section 3.1.7, the memory layout and the hardware timers. A software portability issue is also created by the method we use to acquire the physical memory addresses, as seen in Appendix B. The memory layout can at least be broken down into parameters that can be tuned to match the current system. However, we only do it for one specific layer of cache as seen in the code snippet below for time reasons: # define LLC_LINESIZE 64 18

27 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT # define LLC_ASSOC 16 # define LLC_SIZE_IN_MB 2 # define LLC_SIZE ( LLC_SIZE_IN_MB * MB) # define LLC_SET_SIZE ( LLC_ASSOC * LLC_LINESIZE ) # define LLC_NB_SETS ( LLC_SIZE / LLC_SET_SIZE ) LLC_ASSOC in this case refers to the associativity, or the number of ways of the cache as we have referred to it. The hardware timer is used in the following manner: static inline uint32_t get_cpu_time32 ( void ) u32 cvall, cvalh ; asm volatile (" mrrc p15, 0, %0, %1, c14 " : "=r" ( cvall ), "=r" ( cvalh )); return cvall ; If a low overhead timer is available for the platform it is just a matter of replacing the contents of this function. But that may not always be the case. We still decided that this hardware timer is better for us due to the increased observability we can get by using it, especially in the case of the delay represented in actual nanoseconds as described in Section The method used to get the physical memory addresses is not portable to older linuxes or other operating systems in general and it would have to be replaced in its entirety when compiling for another system. We did however not find another way to do this on our platform, so it is good enough for us User interface The implementation of the user interface uses getopt for parsing of the arguments due to its simplicity. The result is that arguments are always preface with a flag in order to avoid ordering of the arguments. The Bandit is able to produce different output, but the default setting is to be silent. Example usage: The Bandit help string: root@du1 :~#./ pbandit -- help Usage : -- benchmark -m -- measure -v -- verbose = LEVEL -t -- threads = NUM -b -- bandwidth = TARGET Usage of a Bandit with 2 threads with the target usage of 2000 MB/s and outputting the total used bandwidth: 19

28 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT :~# taskset -c 1,2./ pbandit -t 2 -b v 1 Total usage : 1957 MB Total usage : 2000 MB Total usage : 2028 MB Usage of a Bandit with 3 threads with the target usage of 1000 MB/s and outputting the total used bandwidth time of each memory read step, the wait time between read steps and individual thread bandwidth usage: root@du1 :~# taskset -c 4,5,6./ pbandit -t 3 -b v 3 Thread 2 affinity : 40 Thread 0 affinity : 10 Thread 1 affinity : 20 Thread 2: ns, step time : ns, wait time 2500 ns, memory usage 333 MB Thread 0: ns, step time : ns, wait time 2500 ns, memory usage 334 MB Thread 1: ns, step time : ns, wait time 2500 ns, memory usage 328 MB Total usage : 1000 MB Thread 2: ns, step time : ns, wait time 2496 ns, memory usage 335 MB Thread 0: ns, step time : ns, wait time 2496 ns, memory usage 333 MB Thread 1: ns, step time : ns, wait time 2496 ns, memory usage 334 MB Total usage : 1001 MB Usage of a Bandit with the measuring feature active: root@du1 :~# taskset -c 4,5,6./ pbandit -t 3 -v 1 -- measure Total usage : 2198 MB, App usage : 543 MB, App need in bandit terms : 74 % Total usage : 2244 MB, App usage : 497 MB, App need in bandit terms : 66 % Total usage : 2280 MB, App usage : 461 MB, App need in bandit terms : 60 % Low Overhead Bandit With regular intervals, the Bandit performs other tasks than just using bandwidth. The controller, measuring and printing all cost time for the Bandit to do. In our evaluation, we will have an alternate bandit that have all those features stripped away and we call it the Low Overhead Bandit. The normal bandit s main loop looks like the following: 20

29 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT void * bandit ( void * thread_options ) struct bandit_options * options = thread_options ; struct bandit_data my_data ; init_bandit (& my_data, options ); pthread_barrier_wait (& barrier ); while (1) normal_bandit_iteration (& my_data ); bandit_update (& my_data ); bandit_print (& my_data ); bandit_commit (& my_data ); free_bandit (& my_data ); While the Low Overhead Bandit looks like this: void * low_overhead_bandit ( void * thread_options ) struct bandit_options * options = thread_options ; struct bandit_data my_data ; init_bandit (& my_data, options ); pthread_barrier_wait (& barrier ); while (1) normal_bandit_iteration (& my_data ); free_bandit (& my_data ); By comparing these bandits to each other we will find out if the overhead of the other tasks than the main function of the Bandit interferes with its function. 21

30 Chapter 4 Evaluation 4.1 Method Using performance hardware counters to measure the performance and function of the Bandit would have been optimal, as they provide accurate counts of many different performance metrics. The ARM Cortex-A15 does support hardware counters that can measure cache hits and misses, off-chip memory accesses and more, however in order for software that use them to gather the metrics to work, the operating system must be compiled with support for it. The Linux installation we are using is lacking this feature. Instead, we rely on micro-benchmarks that use the available timing functions also used by the Bandit in order to highlight different effects of the Bandit. In order to verify that the Bandit is indeed using memory bandwidth whilst minimizing the impact on the L2-cache we use a micro-benchmark that allocates memory normally and then iterates through that memory with read operations while timing it. The program can be run with different sizes of the allocated memory, thus allowing us to run the program in a way to only access the different caches or to force it to only access the main memory by iterating over enough memory. It has virtually no overhead when compared to the Bandit as the access loop is unrolled and performs unrelated reads that can be run in parallel. By measuring the total time to iterate over the entire memory block we can get the access time to memory and total bandwidth when parallel accesses are performed. Since the memory accesses are sequential, prefetching can also occur, however as long as all the tests are performed this way, the prefetcher should not be a problem since it is present in all the tests. The memory access loop of the reference program is shown in Listing E.1. We refer to this micro-benchmark as the reference program throughout the evaluation. To be able to execute the Bandit and the reference program on different cores and contain them in the separate clusters of the platform we use the taskset command. The platform has, as described in section 2.5, two 22

31 4.2. PLATFORM PERFORMANCE CHAPTER 4. EVALUATION CPU clusters and in the first one most of the Linux housekeeping processes are contained. All of the tests, with the exception of a cross-cluster test, are performed in the second cluster in order to minimize noise from other applications. All values from measurements are an average of 10 runs. When the slowdown of an application s execution time is of interest, we present it as a relative execution time compared to the original execution time. We calculates it as follows: slowdown = new time base time (4.1) The advantage of this value is that it keeps the appearance of the original graph while giving us normalized values that allow us to compare different tests more easily. It is equally applicable when comparing latencies. 4.2 Platform Performance Memory Access Latency By using the reference program we can get the numbers for the different memory access latencies of the platform. We just need to select the correct size for the memory block in the reference program. Due to the random replacement policy of the L2 and L3 cache, the memory block needs to be a little bit larger compared to a case with an LRU replacement policy. Using the same simulator as in Section 3.1.3, which is described in Appendix A, give us again that the hit rate for a memory block twice the size of a cache is about 20%, and if we double once more, for the total of four times the size of the cache, we reach a hit rate around 2%, which is low enough for our purposes. There is also another problem: if the allocated memory is of the same size as the cache, we cannot guarantee that all pages will map uniformly to the different cache sets, which leads to some sets where we have too many cache lines, which in turn leads to misses, even though the cache is large enough. To compensate for this we use a smaller memory block that is half the size of the target cache type. Access type Memory block size Access speed L1 Cache 16 KB 1.8 ns L2 Cache 1024 KB 5.2 ns L3 Cache 4096 KB 12.1 ns Main memory 64 MB 24.1 ns Table 4.1: Observed memory access latencies of the different memory types. Taking these things into consideration we select the sizes 16 KB for L1 access speed tests, 512 KB for L2 access speed tests, 6 MB for L3 access 23

32 4.2. PLATFORM PERFORMANCE CHAPTER 4. EVALUATION speed test and 32 MB for main memory access speed tests. The results are shown in Table 4.1. One thing to note is that this result shows the access latency for a series of reads, which means that prefetching is a factor in the measurements, especially in the main memory read case Off-Chip Memory Bandwidth By running multiple instances of the reference program, each iterating over memory blocks of the size 32 MB, we can measure the limits of the main memory bandwidth in the cores and clusters. By first testing the available bandwidth for one cluster we find out that one core saturated almost all the available bandwidth in one cluster, which is about 2700 MB/s. Next we find out how the two clusters of four cores each are affecting each other bandwidth. In an ideal case with no contention between the clusters, we should be able to achieve about 5400 MB/s. The results are in Table 4.2 and a more gradual bandwidth usage pattern result is found in Figure Bandwidth usage of cluster 2 (MB/s) Bandwidth usage of cluster 1 (MB/s) Figure 4.1: Accessed bandwidth when load is generated in both CPU clusters. An interesting thing in Table 4.2 is how little the two clusters affect each other. The bandwidth per cluster is only reduced by approximately 6% from 5400 MB/s when both clusters are saturated. It is also worth noting how relatively little bandwidth we have access to, especially if we limit us to one core. If we compare this to the platform used by Eklöv et al. [7] they 24