Institutionen för datavetenskap Department of Computer and Information Science

Size: px
Start display at page:

Download "Institutionen för datavetenskap Department of Computer and Information Science"

Transcription

1 Institutionen för datavetenskap Department of Computer and Information Science Final thesis Measuring the effect of memory bandwidth contention in applications on multi-core processors by Emil Lindberg LIU-IDA/LITH-EX-A 15/002 SE Linköpings universitet SE Linköping, Sweden Linköpings universitet Linköping

2

3 Final Thesis Measuring the effect of memory bandwidth contention in applications on multi-core processors by Emil Lindberg LIU-IDA/LITH-EX-A 15/002--SE Supervisor: Erik Hansson Examiner: Christoph Kessler

4

5 Abstract In this thesis we design and implement a benchmarking tool for applications sensitivity to main memory bandwidth contention, in a multi-core environment, on an ARM Cortex-A15 CPU. The tool is supposed to minimize usage of shared resources, except for the main memory bandwidth, allowing it to isolate the effects of the bandwidth contention only. The difficulty in doing this lies in using a correct memory access pattern for this purpose, i.e. which memory addresses to access, in which order and at what rate in order to minimize cache usage while generating a high and controllable main memory bandwidth usage. We manage to implement a tool with low cache memory usage while still being able to saturate the main memory bandwidth. The tool uses a proportional-integral controller to control the amount of bandwidth it uses. We then use the tool to investigate the memory behaviour of the platform and of some applications when the tool is using a variable amount of bandwidth. However, we have some difficulties in analyzing the results due to the lack of support for hardware performance counters in the operating system we are using and are forced to rely on hardware timers for our data gathering. Another difficulty is the platform s limited L2 cache bandwidth, which leads to a heavy impact on L2 cache read latency by the tool. Despite this, we are able to draw some conclusions on the bandwidth usage of other applications in optimal cases with the help of the tool. iii

6

7 Contents 1 Introduction Motivation Problem Description Results Structure Background Memory Hierarchy SDRAM Bandwidth Usage Properties Paged Virtual Memory Experimental Platform Our Bandit Design Main objectives Floating point data Location of Memory Accesses Controlling Bandwidth Usage Measuring Target Application Bandwidth Usage Multithreading the Bandit Portability Usability Implementation Location of Memory Accesses Bandwidth delay implementation PI-Controller implementation Measuring Target Application Bandwidth Usage Multithreading the Bandit Portability User interface Low Overhead Bandit v

8 CONTENTS CONTENTS 4 Evaluation Method Platform Performance Memory Access Latency Off-Chip Memory Bandwidth Bandit Performance Cache Miss Generation PI-Controller Performance Measuring Memory Latency Measuring Bandwidth Bandit s Effect on Programs Overview Micro-Benchmarks Telecom Application MiBench Summary Platform evaluation Bandit evaluation Closing remarks Related Work Prior Work Main Memory Bandwidth Contention Measurement Main Memory Bandwidth Contention Mitigation Conclusions Limitations Future work Appendices 50 A Cache hit simulator 51 B Physical address lookup 54 C Memory Access Iteration Code 56 D PI-Controller Code 60 E The Reference Program 62 F Synthetic benchmarks 64 vi

9 Chapter 1 Introduction 1.1 Motivation Chip multiprocessors, CMPs, have rapidly become the standard in laptop and desktop PCs. This is due to multiple reasons, one being the unfeasible energy and cooling requirements of running a processor at high frequencies, as an increase in frequency approximately increases the energy consumption by the cube [3]. For the same reason, a decrease in frequency leads to large energy savings, allowing processor manufacturers to keep up with Moore s law by constructing CMPs with several cores that each run at a lower frequency, resulting in a higher total theoretical computational power. While the potential gains of utilizing CMPs are high, they present several new challenges and considerations, which prevents or delays them from being adopted and utilized in every computer. Applications using multiple cores are more difficult to develop and require more from programmers, but there is another major issue that is introduced by CMPs, namely shared resource contention. This means that different cores contend for limited resources, such as caches, main memory bandwidth and network resources. An application running on one core can therefore affect the execution time of an application running on a different core by contending for the same shared resources. Due to this contention, running two seemingly independent tasks on different cores in a real-time system (a system where processes have deadlines) can be a problem [11]. To mitigate resource contention, we need to know the applications characteristics with regard to shared resources. With that information, we can decide if we can run them together or if we need to redesign either our platform or applications. 1.2 Problem Description In this thesis we are going to look specifically at the contention for main memory bandwidth using methods previously described by Eklöv et al. [7]. 1

10 1.3. RESULTS CHAPTER 1. INTRODUCTION They generated traffic on the main memory bus with an application they called Bandwidth Bandit and then measured how various applications performances were affected. The Bandit is a program that generates a variable amount of load on the main memory bus. The load it generates should be as realistic as possible when compared to real applications and it should have minimal effect on the other shared resources, which in this case mostly means the shared caches. We have some additional demands on the Bandit compared to our predecessors. First, we want to investigate if it is possible to gather memory bandwidth usage data on the application affected by the Bandit, directly with the Bandit. The previous Bandit relied on external monitoring with hardware performance counters to gather data, but we want to rely as little as possible on other software and specific hardware. Secondly, we want the Bandit to be a usable tool in regression testing to verify an application s characteristics regarding memory bandwidth usage. Finally, our target platform is an embedded system with an ARM Cortex-A15 CPU, while the previous Bandit used an Intel system. As an added requirement, we want to look into the portability aspect of the Bandit and try to isolate the hardware dependent features of the Bandit. This would allow for easier porting to different architectures and keep the Bandit relevant in an environment of ever changing hardware with a lower effort. The platform we are using is intended for running telecom applications which we need to take into consideration when evaluating the Bandit. 1.3 Results We managed to create an application that is able to control the amount of bandwidth used by itself, while also keeping the amount of used cache memory low. This bandit can then be used to test how other applications function under different contention environments. As the execution time can be heavily affected by this contention, as we have shown, it can be a useful tool to verify the robustness of various real-time applications. Porting of the Bandit to different architectures is doable, however our reliance on a fast hardware timing function is a weakness. It can however be replaced by other mechanisms for controlling high precision delays. The attempt to make the Bandit able to measure other applications bandwidth usage did not go entirely as we had wanted. We were able to measure synthetic applications with highly stable bandwidth usage, but our attempt to measure other applications did not work very well. With more time, it could have been possible to implement it. Our target architecture s low bandwidth beyond the L2 cache and it being a bottleneck led to it being difficult to separate pure main memory accesses from L2 cache accesses as the latter suffered very much from the contention. This was also very obvious when we attempted to implement 2

11 1.4. STRUCTURE CHAPTER 1. INTRODUCTION a bandwidth measuring function in our bandit as the measurements were affected by L2 cache accesses. 1.4 Structure In Chapter 2 we present concepts used for understanding our work. These concepts are mainly focused on the different types of memory in a cpu and their function. We also present our experimental platform here. In Chapter 3 we present the design and implementation of our bandit and show details of how it works. In Chapter 4 we evaluate the performance of our platform, the function of the Bandit and tests how it affects other applications. In Chapter 5 we show some related work to our thesis, which mostly are articles that focus on memory bandwidth contention in different ways. We conclude our thesis with Chapter 6 with some outlook on the future for the Bandit. 3

12 Chapter 2 Background 2.1 Memory Hierarchy Modern processors utilize a hierarchy of memories. First, there are several layers of volatile storage, from the registers in the processor to a couple of cache memories, usually two or three (denominated by L1, L2 etc.), and finally the main memory. Behind all those is the permanent but slow storage. The speed of CPUs has been increasing at a much higher rate than that of main memory for a while now [15] and still is [14]. It was first the case that the two operated at the same speed, but they are now separated by orders of magnitude. This is the reason for the memory hierarchy. Data often has the following two properties: temporal locality (data recently used is likely to be used again) and spatial locality (data close to other recently used data is likely to be used). By having faster caches, processors get faster access to the data that is likely to be used and even though they are small, they are large enough to gain the benefits of the spatial and temporal locality. 4

13 2.1. MEMORY HIERARCHY CHAPTER 2. BACKGROUND Figure 2.1: Organization and mapping of cache sets in main memory to cache memory. When data is fetched from the memory for the first time, it is read from the main memory and then stored in the different layers of cache. More data than requested, a cache line, is often fetched to take advantage of the spatial locality. In some cases, when the access pattern can be predicted, even more data can be read into the cache, which is called prefetching. When any of this data is read again and it still resides in a cache, a cache hit has occurred. The memory system searches for the requested data in order of the fastest cache to the slowest and returns the first hit. When a new cache line is installed, it usually can not be stored at any place in the cache, i.e the cache is usually not fully associative. Instead, cache memories can be n-way, where n denotes in how many places a block of data can be stored in the cache. These n places that can store the same subset of memory blocks are called a cache set. An example of how cache sets maps from main memory to cache memory is found in figure 2.1. If there is no free spot to install the data in, some other cache line in the same cache set must be evicted. The selection of the cache line can be done with different algorithms such as least recently used (LRU) or random selection [12]. In CMPs, caches can be private to a processor or shared between some or all of the processors as shown in Figure 2.2. Shared caches introduces contention for space in the cache, a phenomenon Eklöv et al. also examined [6]. They also noted that decreased performance of the caches led to an increase of usage of main memory bandwidth. This happens because every cache hit removes the need for a read from the main memory. The memory hierarchy is in other words coupled, if one part of it is affected the effect ripples upwards through the hierarchy. 5

14 2.2. SDRAM CHAPTER 2. BACKGROUND Figure 2.2: Organization of L1 and L2 caches. The back side is internal to the cores and the front side is external to the CPU 2.2 SDRAM Synchronous dynamic random access memory, SDRAM, is the most common type of main memory in computers today [12]. Access to the SDRAM is controlled by a memory controller which translates a memory address in the operating system into an address in the SDRAM. An address has to be translated into several different signals. With more than one SDRAM module, a memory channel has to be selected and every SDRAM is partitioned into banks, rows and columns. All these parts together translate into a unique memory address. A bank contains different rows and columns. The different banks allow for some parallelism since they can prepare reads independently. When data has to be read, the selected bank must first load the selected row, also known as a page, into a buffer unique for each bank. If this page is ready when the request is made (page hit) the request will be serviced significantly faster than if the wrong page is loaded (page miss) or if no page is loaded (page empty). Due to the channels and banks parallel nature the SDRAM has great potential for parallelism [13]. 2.3 Bandwidth Usage Properties The various levels in the memory hierarchy usually allow some form of parallel operation or at least queues in order to increase the utilization of the buses in question. An ARM Cortex-A15 for example can have 16 outstanding loads and stores at any given time, while its cache is able to keep delivering data that is present in the cache while waiting for data that earlier 6

15 2.4. PAGED VIRTUAL MEMORY CHAPTER 2. BACKGROUND caused a cache miss. The memory controller for the main memory can direct requests to the different banks in an SDRAM, and if multiple channels are available, it can direct the requests over both channels in parallel. The parallel operation in the hierarchy is key to understanding how an application behaves under different contention for the memory bandwidth resources, as Eklöv et al. demonstrated [7]. They classified applications as either bandwidth sensitive or latency sensitive. When the load on the memory system increases, the latency will gradually increase, even if there is available bandwidth left. This is due to contention for specific parts of the memory system, which results in requests being placed in different queues. A bandwidth sensitive application is good at utilizing prefetching and is able to perform calculations while new data is being fetched. Latency sensitive applications is the opposite and will have to stall in order to wait for new data. 2.4 Paged Virtual Memory Modern operating systems usually employ a memory management technique called paged virtual memory [12]. This works by giving each process its own memory space, a virtual memory space, independent of the actual physical memory space where only that process operates. A process then sees the available memory as one contiguous block segmented into pages and can usually address 4 GB of memory or more, independent of the actual available memory of the system. The size of these pages is usually 4 KB, but can differ depending on support from hardware and the operating system. When memory is allocated by a process, a binding is created between virtual pages and physical pages, i.e. pages in the physical memory. The operating system keeps track of all these mappings in a page table on a per process basis. The page tables reside in the main memory. One of the great benefits of paged virtual memory is the ability to scatter the memory space of a process arbitrarily in the physical memory, thus reducing the impact of memory fragmentation. If a process allocates a chunk of memory by the size of 1,000 pages, the addresses of the chunk will be contiguous in the virtual memory space, but 1,000 contiguous pages are not necessary for the allocation. The 1,000 pages may be scattered throughout the physical memory. When a process accesses a virtual memory address, a lookup is done in the page table by using the higher bits of the virtual memory address. The lower bits, the 12 least significant bits in the case of 4 KB (2 12 ) bytes page size, represent the offset within the page. If the mapping exist, the virtual address is translated into the corresponding physical address by taking the physical page address and using the same offset within it. Modern processors provide hardware support for this type of virtual memory with a memory management unit, MMU. It performs the translation from virtual memory addresses to physical memory addresses. To speed 7

16 2.5. EXPERIMENTAL PLATFORM CHAPTER 2. BACKGROUND Figure 2.3: Mapping from virtual addresses to physical addresses. up the process a translation lookaside buffer, TLB, is used, which is a kind of cache memory that stores recently accessed pages and their corresponding mappings to physical memory. These can operate in a hierarchy in the same way as the normal caches. When a process performs a memory access when hardware support exists, a lookup is first done in the TLB. If the address exists, the physical address is returned and the memory access is performed. Otherwise a so called page walk is performed, which is the operation of accessing the page table in main memory, retrieving the requested mapping and storing it in the TLB. This is usually performed through dedicated hardware. If no such mapping exists, an exception is raised. 2.5 Experimental Platform The platform our bandit is being tested on is running two clusters of ARM Cortex-A15 processors with four cores in each, all within a single die. Each core has a private L1 cache memory, each cluster shares an L2 cache memory and the entire system shares an L3 cache memory. The clusters are connected via the L2-cache system to a CoreLink CCN-504 Cache Coherent Network [1] which connects the clusters to each other and to the rest of the system. The main memory is connected to a CoreLink DMC-520 Dynamic Memory Controller [2] supporting two DDR3 modules. The documented bandwidth of the memory controller is around 15 GB/s. The L1 cache is 32 KB in size, uses 64 B cache lines and is 2-way associative. Its cache replacement policy is least recently used and it can have up to 6 different outstanding memory requests at any time. The L2 cache is 8

17 2.5. EXPERIMENTAL PLATFORM CHAPTER 2. BACKGROUND Figure 2.4: Architecture of the experimental platform. 2 MB in size, uses 64 B cache lines and is 16-way associative. Its cache replacement policy is random selection and it can have 16 outstanding writes and 11 outstanding reads. The L3 cache has the same properties as the L2 cache except that it is 8 MB in size. The operating system is a custom Linux distribution patched with real-time patches. The Bandit is cross-compiled using GCC version with -O2 optimizations active. 9

18 Chapter 3 Our Bandit 3.1 Design Main objectives In short, the objective of our implementation is to use as little shared resources as possible except for main memory bandwidth and have a sufficiently realistic bandwidth usage pattern, be easy to use and, finally, be easy to port to different architectures. The most relevant shared resource to minimize in our case is the shared L2 cache memory. The L3 cache memory usage is not focused on to simplify the implementation and evaluation and it is also not relevant for the final use case of the platform. The Bandit is completely implemented in C as it is close enough to the hardware for our needs Floating point data Floating point operations should be avoided if possible in time critical sections due to them being significantly slower than normal integer operations. The data for us that would benefit from being represented as floating point is time. In order to use integers instead, we have to represent time in a small enough unit to not lose any significant precision. We therefore chose to represent time as nanoseconds. Since we have a 32-bit platform, it would be faster for us to save the time in a 32-bit integer. This, however, puts a limit on how long we can time things. Using an unsigned integer we get the maximum value of 4,294,967,295 ns, or roughly 4.3 seconds. Due to the high speed nature of computers, 4.3 seconds should be enough for our needs. 10

19 3.1. DESIGN CHAPTER 3. OUR BANDIT Location of Memory Accesses In order to achieve the first two objectives, it is necessary to access specific memory addresses in a specific order. To generate a memory access, we need to make a cache miss. A cache miss can be guaranteed in two ways, either by trying to access memory of a greater size than the target cache, or, if the number of ways in a set is limited, by accessing more cache lines that belong to the same cache set then there are ways. By doing the latter, we do not need to use up the entire cache memory. The usage pattern we aim for is one that utilizes the parallelism in the main memory as uniformly as possible, i.e. utilizes the channels and banks equally, and, if possible, control the number of page hits and misses in the SDRAM. To access memory in specific cache sets we need to know how the mapping to cache sets is done. The least significant bits of the address denotes the offset in the cache line, in our case this means the 6 least significant bits. Some of the following bits will then denote the cache set. To calculate how many bits denote the set we use the following equation: cache sets = cache size ways line size (3.1) The L2 cache in our case is 2 MB in size, has 16 ways and its line size is 64 bytes. Equation 3.1 gives us 2048 different cache sets which needs 11 bits to address them all. This means that the 11 bits after the cache line offset encodes the cache set. Since they are offset by 6 for the addressing of the cache line, the cache sets will repeat every 2 17 bytes or 128 KB of memory. If we align memory allocations to this 128 KB boundary we can contain us to a few cache sets and therefore generate cache misses without thrashing the entire cache. An easy way to step through this memory is to create a circularly linked list for each cache set that we use, with each element at a constant offset from the 128 KB boundary. Each element contains only a pointer to the next element. By stepping through through these lists we generate the memory traffic. Our platform utilizes, as described in Section 2.5, a random eviction strategy for the cache. A perfect LRU strategy would allow us to use one more element than there are ways for each cache set, i.e. 17 elements in our L2 example. However, the random eviction strategy forces us to use significantly more elements than that. Using a simple simulator we created, described in Appendix A, we can determine that at twice the amount of elements we will get a hit rate of about 20% and that same number is about 2% when we have four times as many elements as we have ways, which should be enough for our needs. Now we know what we need to in order to generate reads beyond the L2 cache, but we also need to read beyond the L3 cache. The requirements to not thrash the L3 cache is not as strict though. Since the L3 cache has the same properties as the L2 cache, except that it is four times the 11

20 3.1. DESIGN CHAPTER 3. OUR BANDIT size, it should suffice to allocate four times as many elements as we earlier anticipated when calculating on the L2 cache. By taking all this in mind we get that the circularly linked lists belonging to the cache sets should have at least , or 256, elements. To generate the uniform access to the main memory we need to know the mapping from memory addresses to the different parts of the SDRAM. This translation is done by the memory controller, however we can not find this specific mapping in the technical reference manual for the memory controller [2]. A workaround for this problem that improves our chance for uniform accesses is simply to allocate more memory than we need for our cache miss purposes Controlling Bandwidth Usage In order to generate different amounts of bandwidth usage, we take a set amount of steps through the linked lists and then delay by a different amount of time depending on how much bandwidth we want to use. The number of steps we take before the delay is important because there is an overhead associated with each delay. Fewer steps gives us a finer degree of control over the bandwidth usage, but it limits our ability to generate a higher amount of traffic due to the overhead associated with the delays. By using a hardware timer in the ARM Cortex-A15 that has a very low overhead, we can get a very high control over this delay. An alternative would be to use a normal busy loop instead. This solution would however reduce the observability as the delay is not based directly on time, but on loop iterations instead. It could be solved by calculating how long different iterations take to run, but it would take more effort. An upside would be that the overhead of the delays would be lower because of the simpler construct, but the observability is prioritized in this case due to our lack of hardware counter support. We also want to be able to control the total bandwidth usage. This can be done in a number of different ways, although it is very intuitive to just enter the usage in absolute numbers, in our case the number of MB/s. This can also be done in different ways, either by using pre-calculated values for the delay corresponding to different bandwidth usages or, as we are doing, by a proportional-integral controller, PI-controller. The PI-controller use the current bandwidth usage as the process variable and the delay as the manipulated variable. A controller works by comparing the process variable, the bandwidth usage, to the target value, the target bandwidth usage, and then taking actions depending on the difference. The difference between the observed value and target value is called the error. The action is to change the manipulated variable so the process variable approaches the target value. In our case that means to modify the delay so we achieve the target bandwidth usage. The proportional part and the integral part gives us the change in our manipulated variable: 12

21 3.1. DESIGN CHAPTER 3. OUR BANDIT Where u(t) is the output. e(t) is the current error. b u(t) = K p e(t) + K i e(τ) dτ (3.2) K p is the proportional gain constant. K i is the integral gain constant. The proportional part provides a proportional change to the output while the integral part provides a change depending on the accumulated error, thus compensating for long term errors [8]. The advantage of the controller is a high flexibility and low sensitivity to changes in the memory access latency. This is very helpful when the Bandit is run in multiple threads, as we will look into in the next section. By timing each full iteration through all the 64 lists and by using this time information combined with how much data we access during an iteration we know how much bandwidth we are currently using. This is the only data we need in order to have a functioning controller. Because we know the length of the delay, we can also separate the time taken for the actual memory accesses. This will allow us to evaluate the current memory access latency in the system Measuring Target Application Bandwidth Usage The basic idea behind measuring the bandwidth usage of the target application is that if the maximum bandwidth that the Bandit can expect to use is known, then the difference between that and the actual accessed bandwidth is the amount that the target is using. The problem is that the target application is getting less bandwidth when the Bandit uses bandwidth compared to when the system is silent. However, if we know how much bandwidth the Bandit is using, then we can compare the target s bandwidth with a bandit s bandwidth in the same circumstances and from this it can be possible to gain some information about the application s actual bandwidth usage. It would work as following: 1. Run an application with the Bandit and get measurements. 2. Perform another run and replace the application with another bandit and configure it so we get the same measurements as in the previous step. 3. Run a bandit with the same configuration and get information on how much memory bandwidth it is using. This bandwidth usage should be the same as the original applications bandwidth usage. a 13

22 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT Multithreading the Bandit Multithreading is useful to generate even more bandwidth usage. By using the pthreads library to achieve this, we can run all the Bandits under one process. The benefit of this is that the threads can communicate with each other via their shared memory space in the process, this helps the Bandit to monitor its total memory bandwidth usage and only one controller is needed to regulate the memory bandwidth usage. However, it will also work if the Bandits are run as different processes as the bandits will get their own controllers. The controller will then be able to handle the effect of the other bandits running in parallel. When started with multiple threads, the Bandit pins the different threads to their own cores so they don t interfere with each other Portability The main hardware dependencies in the Bandit are the memory layout and the hardware timer used for the delay. In order to avoid thrashing the shared caches, we need to adapt our memory placement to the architecture s cache size, ways and line width. Our implementation has hard coded values for those parameters, so a change of platform from the memory layout perspective should only require a correct setting of the parameters, but we do not investigate this in practice. The hardware timer used in the ARM CPU has a very low overhead associated with it, which allows it to be used to time the delay. Although, if another architecture does not have a timer with similar properties, we will have to either switch to the busy loop discussed earlier or accept the increased overhead and therefore reduced maximum possible stolen bandwidth per bandit Usability The goal of the interface to the Bandit is for it to be usable in a script environment so it can be used for automated tests. A simple command line interface which take different parameters fulfills this goal. It can be used to control the number of executing threads, the amount of bandwidth that the threads should use in total and a verbosity setting that controls the amount of data output. 3.2 Implementation Location of Memory Accesses To implement our design in Section we have to carefully choose the memory we use for our circularly linked lists. 14

23 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT In order to find addresses fulfilling the cache usage and main memory access demands, Eklöv et al. used a feature called huge pages in Linux to allocate 8 MB chunks of contiguous memory, significantly larger chunks compared to the standard 4 KB pages. In those chunks they placed several linked lists, which they then iterated through to generate memory traffic. However, our Linux installation does not support the huge pages feature. Luckily, we have another feature that allows us to find the physical address corresponding to our virtual addresses. The complete mappings of a processes virtual addresses can be looked up by using the proc filesystem, specifically /proc/pid/pagemaps. The specific way we did it can be found in Appendix B. The first step we do is to allocate memory that we know is at least aligned by the page size of the system. It is done with the following call: posix_memalign (& memory_array [ k], page_alignment, size ); With the help of that function we can allocate as many pages as we need to find pages with the correct alignment. To find pages with the correct alignment is simply evaluating a modulo operation: if (( lookup_address ( memory ) % alignment ) == offset ) The current method we use is to perform one allocation for each memory page, which is very slow as it results in a lot of system calls. The reason we do this was to allow us to free the pages we did not need. A major drawback that the freeing had was that it fragmented the memory of the system and resulted in a severe slowdown of the system, even after the Bandit had finished executing. The Bandit now keeps all the memory it allocates, which means that the Bandit uses around 32 MB of memory at the 256 element minimum we calculated in Section A better alternative given that we can t free memory could have been to allocate several pages at once and then test them, thus saving system calls, but we did not do this due to time constraints. Once we have the memory, we partition the pages into 64 pieces, one for each cache line in the page, and create the 64 different circularly linked lists with one element in each page, each belonging to a separate cache set. This means that we use 64 out of the 2048 available cache sets or 3% of the available L2 cache. By having these 64 lists we are hopefully able to utilize the parallelism in the various levels of the memory hierarchy, however we have no control over the page hits and misses in the SDRAM Bandwidth delay implementation We use the implementation described in Section 3.1.4, the hardware timer is used for the delay between reads and we always perform 16 memory reads at the time, resulting in 64 * 16 bytes, or 1 KB, of memory read at the time. The reason we use this number is because it is the lowest number we found that still can effectively generate large enough amounts of bandwidth usage. 15

24 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT With a lower number, we would not be able to leverage the parallelism in the memory hierarchy. The main function for generating bandwidth usage is shown in Listing C.1. It is mainly made up of two parts, the memory read part, shown in Listing C.3, and the delay part, shown in listing C.2. The memory reading utilizes the circularly linked lists constructed in Section and keeps an iterator for each list in order to remember the positions. As we can see in Listing C.4, the compiler unrolls the memory read loop, thereby improving the Bandit s bandwidth usage performance PI-Controller implementation The controller is implemented as described in Section At first we tried to implement it using only integers, but it was more difficult than using floating point numbers. In the end the controller is not run very frequently, so the extra clock cycles required for floating point numbers is acceptable. The controller also has two modifications in order to improve its speed and reduce oscillations. The first one handles the property of the delay have to increase exponentially in order to decrease the bandwidth usage. The ratio in the following listing is used to scale up or down the delay modification depending on the delay s size compared to the current memory read step time: float wait_step_ratio = new_wait_time / ( float ) step_time ; if ( wait_step_ratio > RATIO_MAX ) wait_step_ratio = RATIO_MAX + ( wait_step_ratio - RATIO_MAX ) * RATIO_STEP_DOWN ; else if ( wait_step_ratio < RATIO_MIN ) wait_step_ratio = RATIO_MIN ; The integral part in a PI-controller can lead to overshooting and oscillations in a controller [8]. In order to minimize this effect, we reduce and flip the sign of the accumulated error and apply an adjustment to the new delay at the moment that we pass the target value as shown in the following listing: const float INTEGRAL_MODIFIER = -0.1; if (( difference < EPSILON && difference > - EPSILON )!(( difference >= 0) ^ ( old_difference < 0))) int_diff = int_diff * INTEGRAL_MODIFIER ; float overshoot_factor = difference / ( float ) total_usage ; int wait_modification = ( float )( wait_time + step_time ) * overshoot_factor ; new_wait_time -= wait_modification ; 16

25 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT Measuring Target Application Bandwidth Usage In order to get the target application s bandwidth usage as described in Section we first need the bandwidth usage of the Bandit. We get that as follows: unsigned int mem_usage = number_kilobytes * 1000 * 1000 / ( median_time ); // KB per ms my_data - > memory_usage [ my_data - >id] = mem_usage * 1000 / 1024; // MB per s Care has been taken to avoid integer overflows from the different operations. However, it was most probably an unnecessary optimization to only use 32- bit integers here and not 64-bit integers or floating point operations since the operation is not very frequent. We then use these values to get the target applications usage: app_usage = baseline_memory_usage [ my_data - > bandit_count - 1] - total_usage ; app_usage = app_usage > 0? app_usage : 0; app_need = ( app_usage * my_data - > bandit_count * 100 / total_usage ); printf (" Total usage : %u MB, App usage : %d MB, \ App need in bandit terms : %d %%\ n", total_usage, app_usage, app_need ); The app need in bandit terms in the code refers to how much, in percent, the application is using compared to a bandit. The baseline memory usage is the value produced from the benchmark described in Section Multithreading the Bandit The Bandit automatically pins the threads to a separate core in order to isolate them from each other. It is shown how it is done in this code snippet: 17

26 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT pthread_getaffinity_np ( pthread_self (), sizeof ( cpu_set_t ), & cpuset ); for ( current_core = 0; current_core < cores ; ++ current_core ) if ( CPU_ISSET ( current_core, & cpuset )) if ( nr_cores_found == id) found_core = 1; else ++ nr_cores_found ; if ( found_core ) break ; if ( found_core ) CPU_ZERO (& cpuset ); CPU_SET ( current_core, & cpuset ); pthread_setaffinity_np ( pthread_self (), sizeof ( cpu_set_t ), & cpuset ); else fprintf ( stderr, " Not enough cores assigned to the process \ n "); exit ( -1); The cpu affinity has to be set beforehand and it requires to be assigned at least as many cores as there are threads. Each thread handles its own measurements of bandwidth usage, but only one thread uses the data for the regulator and eventual print outs to the user. No locks are used for this shared data as they slowed down the bandit noticeably. Since only one thread writes each data and only one thread reads the data and the detriment for using old is not very significant, it is not a problem Portability The identified portability issues is, as pointed out in Section 3.1.7, the memory layout and the hardware timers. A software portability issue is also created by the method we use to acquire the physical memory addresses, as seen in Appendix B. The memory layout can at least be broken down into parameters that can be tuned to match the current system. However, we only do it for one specific layer of cache as seen in the code snippet below for time reasons: # define LLC_LINESIZE 64 18

27 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT # define LLC_ASSOC 16 # define LLC_SIZE_IN_MB 2 # define LLC_SIZE ( LLC_SIZE_IN_MB * MB) # define LLC_SET_SIZE ( LLC_ASSOC * LLC_LINESIZE ) # define LLC_NB_SETS ( LLC_SIZE / LLC_SET_SIZE ) LLC_ASSOC in this case refers to the associativity, or the number of ways of the cache as we have referred to it. The hardware timer is used in the following manner: static inline uint32_t get_cpu_time32 ( void ) u32 cvall, cvalh ; asm volatile (" mrrc p15, 0, %0, %1, c14 " : "=r" ( cvall ), "=r" ( cvalh )); return cvall ; If a low overhead timer is available for the platform it is just a matter of replacing the contents of this function. But that may not always be the case. We still decided that this hardware timer is better for us due to the increased observability we can get by using it, especially in the case of the delay represented in actual nanoseconds as described in Section The method used to get the physical memory addresses is not portable to older linuxes or other operating systems in general and it would have to be replaced in its entirety when compiling for another system. We did however not find another way to do this on our platform, so it is good enough for us User interface The implementation of the user interface uses getopt for parsing of the arguments due to its simplicity. The result is that arguments are always preface with a flag in order to avoid ordering of the arguments. The Bandit is able to produce different output, but the default setting is to be silent. Example usage: The Bandit help string: root@du1 :~#./ pbandit -- help Usage : -- benchmark -m -- measure -v -- verbose = LEVEL -t -- threads = NUM -b -- bandwidth = TARGET Usage of a Bandit with 2 threads with the target usage of 2000 MB/s and outputting the total used bandwidth: 19

28 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT :~# taskset -c 1,2./ pbandit -t 2 -b v 1 Total usage : 1957 MB Total usage : 2000 MB Total usage : 2028 MB Usage of a Bandit with 3 threads with the target usage of 1000 MB/s and outputting the total used bandwidth time of each memory read step, the wait time between read steps and individual thread bandwidth usage: root@du1 :~# taskset -c 4,5,6./ pbandit -t 3 -b v 3 Thread 2 affinity : 40 Thread 0 affinity : 10 Thread 1 affinity : 20 Thread 2: ns, step time : ns, wait time 2500 ns, memory usage 333 MB Thread 0: ns, step time : ns, wait time 2500 ns, memory usage 334 MB Thread 1: ns, step time : ns, wait time 2500 ns, memory usage 328 MB Total usage : 1000 MB Thread 2: ns, step time : ns, wait time 2496 ns, memory usage 335 MB Thread 0: ns, step time : ns, wait time 2496 ns, memory usage 333 MB Thread 1: ns, step time : ns, wait time 2496 ns, memory usage 334 MB Total usage : 1001 MB Usage of a Bandit with the measuring feature active: root@du1 :~# taskset -c 4,5,6./ pbandit -t 3 -v 1 -- measure Total usage : 2198 MB, App usage : 543 MB, App need in bandit terms : 74 % Total usage : 2244 MB, App usage : 497 MB, App need in bandit terms : 66 % Total usage : 2280 MB, App usage : 461 MB, App need in bandit terms : 60 % Low Overhead Bandit With regular intervals, the Bandit performs other tasks than just using bandwidth. The controller, measuring and printing all cost time for the Bandit to do. In our evaluation, we will have an alternate bandit that have all those features stripped away and we call it the Low Overhead Bandit. The normal bandit s main loop looks like the following: 20

29 3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT void * bandit ( void * thread_options ) struct bandit_options * options = thread_options ; struct bandit_data my_data ; init_bandit (& my_data, options ); pthread_barrier_wait (& barrier ); while (1) normal_bandit_iteration (& my_data ); bandit_update (& my_data ); bandit_print (& my_data ); bandit_commit (& my_data ); free_bandit (& my_data ); While the Low Overhead Bandit looks like this: void * low_overhead_bandit ( void * thread_options ) struct bandit_options * options = thread_options ; struct bandit_data my_data ; init_bandit (& my_data, options ); pthread_barrier_wait (& barrier ); while (1) normal_bandit_iteration (& my_data ); free_bandit (& my_data ); By comparing these bandits to each other we will find out if the overhead of the other tasks than the main function of the Bandit interferes with its function. 21

30 Chapter 4 Evaluation 4.1 Method Using performance hardware counters to measure the performance and function of the Bandit would have been optimal, as they provide accurate counts of many different performance metrics. The ARM Cortex-A15 does support hardware counters that can measure cache hits and misses, off-chip memory accesses and more, however in order for software that use them to gather the metrics to work, the operating system must be compiled with support for it. The Linux installation we are using is lacking this feature. Instead, we rely on micro-benchmarks that use the available timing functions also used by the Bandit in order to highlight different effects of the Bandit. In order to verify that the Bandit is indeed using memory bandwidth whilst minimizing the impact on the L2-cache we use a micro-benchmark that allocates memory normally and then iterates through that memory with read operations while timing it. The program can be run with different sizes of the allocated memory, thus allowing us to run the program in a way to only access the different caches or to force it to only access the main memory by iterating over enough memory. It has virtually no overhead when compared to the Bandit as the access loop is unrolled and performs unrelated reads that can be run in parallel. By measuring the total time to iterate over the entire memory block we can get the access time to memory and total bandwidth when parallel accesses are performed. Since the memory accesses are sequential, prefetching can also occur, however as long as all the tests are performed this way, the prefetcher should not be a problem since it is present in all the tests. The memory access loop of the reference program is shown in Listing E.1. We refer to this micro-benchmark as the reference program throughout the evaluation. To be able to execute the Bandit and the reference program on different cores and contain them in the separate clusters of the platform we use the taskset command. The platform has, as described in section 2.5, two 22

31 4.2. PLATFORM PERFORMANCE CHAPTER 4. EVALUATION CPU clusters and in the first one most of the Linux housekeeping processes are contained. All of the tests, with the exception of a cross-cluster test, are performed in the second cluster in order to minimize noise from other applications. All values from measurements are an average of 10 runs. When the slowdown of an application s execution time is of interest, we present it as a relative execution time compared to the original execution time. We calculates it as follows: slowdown = new time base time (4.1) The advantage of this value is that it keeps the appearance of the original graph while giving us normalized values that allow us to compare different tests more easily. It is equally applicable when comparing latencies. 4.2 Platform Performance Memory Access Latency By using the reference program we can get the numbers for the different memory access latencies of the platform. We just need to select the correct size for the memory block in the reference program. Due to the random replacement policy of the L2 and L3 cache, the memory block needs to be a little bit larger compared to a case with an LRU replacement policy. Using the same simulator as in Section 3.1.3, which is described in Appendix A, give us again that the hit rate for a memory block twice the size of a cache is about 20%, and if we double once more, for the total of four times the size of the cache, we reach a hit rate around 2%, which is low enough for our purposes. There is also another problem: if the allocated memory is of the same size as the cache, we cannot guarantee that all pages will map uniformly to the different cache sets, which leads to some sets where we have too many cache lines, which in turn leads to misses, even though the cache is large enough. To compensate for this we use a smaller memory block that is half the size of the target cache type. Access type Memory block size Access speed L1 Cache 16 KB 1.8 ns L2 Cache 1024 KB 5.2 ns L3 Cache 4096 KB 12.1 ns Main memory 64 MB 24.1 ns Table 4.1: Observed memory access latencies of the different memory types. Taking these things into consideration we select the sizes 16 KB for L1 access speed tests, 512 KB for L2 access speed tests, 6 MB for L3 access 23

32 4.2. PLATFORM PERFORMANCE CHAPTER 4. EVALUATION speed test and 32 MB for main memory access speed tests. The results are shown in Table 4.1. One thing to note is that this result shows the access latency for a series of reads, which means that prefetching is a factor in the measurements, especially in the main memory read case Off-Chip Memory Bandwidth By running multiple instances of the reference program, each iterating over memory blocks of the size 32 MB, we can measure the limits of the main memory bandwidth in the cores and clusters. By first testing the available bandwidth for one cluster we find out that one core saturated almost all the available bandwidth in one cluster, which is about 2700 MB/s. Next we find out how the two clusters of four cores each are affecting each other bandwidth. In an ideal case with no contention between the clusters, we should be able to achieve about 5400 MB/s. The results are in Table 4.2 and a more gradual bandwidth usage pattern result is found in Figure Bandwidth usage of cluster 2 (MB/s) Bandwidth usage of cluster 1 (MB/s) Figure 4.1: Accessed bandwidth when load is generated in both CPU clusters. An interesting thing in Table 4.2 is how little the two clusters affect each other. The bandwidth per cluster is only reduced by approximately 6% from 5400 MB/s when both clusters are saturated. It is also worth noting how relatively little bandwidth we have access to, especially if we limit us to one core. If we compare this to the platform used by Eklöv et al. [7] they 24

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

An Implementation Of Multiprocessor Linux

An Implementation Of Multiprocessor Linux An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than

More information

AMD Opteron Quad-Core

AMD Opteron Quad-Core AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

The Classical Architecture. Storage 1 / 36

The Classical Architecture. Storage 1 / 36 1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Whitepaper: performance of SqlBulkCopy

Whitepaper: performance of SqlBulkCopy We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems 1 Any modern computer system will incorporate (at least) two levels of storage: primary storage: typical capacity cost per MB $3. typical access time burst transfer rate?? secondary storage: typical capacity

More information

Chapter 11 I/O Management and Disk Scheduling

Chapter 11 I/O Management and Disk Scheduling Operatin g Systems: Internals and Design Principle s Chapter 11 I/O Management and Disk Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles An artifact can

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Keil C51 Cross Compiler

Keil C51 Cross Compiler Keil C51 Cross Compiler ANSI C Compiler Generates fast compact code for the 8051 and it s derivatives Advantages of C over Assembler Do not need to know the microcontroller instruction set Register allocation

More information

Performance analysis of a Linux based FTP server

Performance analysis of a Linux based FTP server Performance analysis of a Linux based FTP server A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Technology by Anand Srivastava to the Department of Computer Science

More information

Operating Systems. Virtual Memory

Operating Systems. Virtual Memory Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page

More information

Operating Systems, 6 th ed. Test Bank Chapter 7

Operating Systems, 6 th ed. Test Bank Chapter 7 True / False Questions: Chapter 7 Memory Management 1. T / F In a multiprogramming system, main memory is divided into multiple sections: one for the operating system (resident monitor, kernel) and one

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

Computer Architecture

Computer Architecture Cache Memory Gábor Horváth 2016. április 27. Budapest associate professor BUTE Dept. Of Networked Systems and Services ghorvath@hit.bme.hu It is the memory,... The memory is a serious bottleneck of Neumann

More information

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

COMPUTER HARDWARE. Input- Output and Communication Memory Systems COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

Configuring CoreNet Platform Cache (CPC) as SRAM For Use by Linux Applications

Configuring CoreNet Platform Cache (CPC) as SRAM For Use by Linux Applications Freescale Semiconductor Document Number:AN4749 Application Note Rev 0, 07/2013 Configuring CoreNet Platform Cache (CPC) as SRAM For Use by Linux Applications 1 Introduction This document provides, with

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 11 Memory Management Computer Architecture Part 11 page 1 of 44 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin

More information

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1 Hierarchy Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Hierarchy- 1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor

More information

Chapter 11 I/O Management and Disk Scheduling

Chapter 11 I/O Management and Disk Scheduling Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization

More information

The Big Picture. Cache Memory CSE 675.02. Memory Hierarchy (1/3) Disk

The Big Picture. Cache Memory CSE 675.02. Memory Hierarchy (1/3) Disk The Big Picture Cache Memory CSE 5.2 Computer Processor Memory (active) (passive) Control (where ( brain ) programs, Datapath data live ( brawn ) when running) Devices Input Output Keyboard, Mouse Disk,

More information

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1 Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems

More information

Q & A From Hitachi Data Systems WebTech Presentation:

Q & A From Hitachi Data Systems WebTech Presentation: Q & A From Hitachi Data Systems WebTech Presentation: RAID Concepts 1. Is the chunk size the same for all Hitachi Data Systems storage systems, i.e., Adaptable Modular Systems, Network Storage Controller,

More information

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging Memory Management Outline Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging 1 Background Memory is a large array of bytes memory and registers are only storage CPU can

More information

Computer Systems Structure Main Memory Organization

Computer Systems Structure Main Memory Organization Computer Systems Structure Main Memory Organization Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Storage/Memory

More information

On Benchmarking Popular File Systems

On Benchmarking Popular File Systems On Benchmarking Popular File Systems Matti Vanninen James Z. Wang Department of Computer Science Clemson University, Clemson, SC 2963 Emails: {mvannin, jzwang}@cs.clemson.edu Abstract In recent years,

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

PERFORMANCE TUNING ORACLE RAC ON LINUX

PERFORMANCE TUNING ORACLE RAC ON LINUX PERFORMANCE TUNING ORACLE RAC ON LINUX By: Edward Whalen Performance Tuning Corporation INTRODUCTION Performance tuning is an integral part of the maintenance and administration of the Oracle database

More information

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.

More information

Testing Database Performance with HelperCore on Multi-Core Processors

Testing Database Performance with HelperCore on Multi-Core Processors Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem

More information

Report Paper: MatLab/Database Connectivity

Report Paper: MatLab/Database Connectivity Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been

More information

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory Computer Organization and Architecture Chapter 4 Cache Memory Characteristics of Memory Systems Note: Appendix 4A will not be covered in class, but the material is interesting reading and may be used in

More information

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And

More information

Communicating with devices

Communicating with devices Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.

More information

Laboratory Report. An Appendix to SELinux & grsecurity: A Side-by-Side Comparison of Mandatory Access Control & Access Control List Implementations

Laboratory Report. An Appendix to SELinux & grsecurity: A Side-by-Side Comparison of Mandatory Access Control & Access Control List Implementations Laboratory Report An Appendix to SELinux & grsecurity: A Side-by-Side Comparison of Mandatory Access Control & Access Control List Implementations 1. Hardware Configuration We configured our testbed on

More information

Mass Storage Structure

Mass Storage Structure Mass Storage Structure 12 CHAPTER Practice Exercises 12.1 The accelerating seek described in Exercise 12.3 is typical of hard-disk drives. By contrast, floppy disks (and many hard disks manufactured before

More information

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Winter Term 2008 / 2009 Jun.-Prof. Dr. André Brinkmann Andre.Brinkmann@uni-paderborn.de Universität Paderborn PC² Agenda Multiprocessor and

More information

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction

More information

Motivation: Smartphone Market

Motivation: Smartphone Market Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

10.04.2008. Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

10.04.2008. Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details Thomas Fahrig Senior Developer Hypervisor Team Hypervisor Architecture Terminology Goals Basics Details Scheduling Interval External Interrupt Handling Reserves, Weights and Caps Context Switch Waiting

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

CPU Scheduling Outline

CPU Scheduling Outline CPU Scheduling Outline What is scheduling in the OS? What are common scheduling criteria? How to evaluate scheduling algorithms? What are common scheduling algorithms? How is thread scheduling different

More information

Fast Arithmetic Coding (FastAC) Implementations

Fast Arithmetic Coding (FastAC) Implementations Fast Arithmetic Coding (FastAC) Implementations Amir Said 1 Introduction This document describes our fast implementations of arithmetic coding, which achieve optimal compression and higher throughput by

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Capacity Planning Process Estimating the load Initial configuration

Capacity Planning Process Estimating the load Initial configuration Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting

More information

Data Storage - I: Memory Hierarchies & Disks

Data Storage - I: Memory Hierarchies & Disks Data Storage - I: Memory Hierarchies & Disks W7-C, Spring 2005 Updated by M. Naci Akkøk, 27.02.2004 and 23.02.2005, based upon slides by Pål Halvorsen, 11.3.2002. Contains slides from: Hector Garcia-Molina,

More information

Communication Protocol

Communication Protocol Analysis of the NXT Bluetooth Communication Protocol By Sivan Toledo September 2006 The NXT supports Bluetooth communication between a program running on the NXT and a program running on some other Bluetooth

More information

OpenFlow Based Load Balancing

OpenFlow Based Load Balancing OpenFlow Based Load Balancing Hardeep Uppal and Dane Brandon University of Washington CSE561: Networking Project Report Abstract: In today s high-traffic internet, it is often desirable to have multiple

More information

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux White Paper Real-time Capabilities for Linux SGI REACT Real-Time for Linux Abstract This white paper describes the real-time capabilities provided by SGI REACT Real-Time for Linux. software. REACT enables

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

GPU Hardware Performance. Fall 2015

GPU Hardware Performance. Fall 2015 Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

COS 318: Operating Systems. Virtual Memory and Address Translation

COS 318: Operating Systems. Virtual Memory and Address Translation COS 318: Operating Systems Virtual Memory and Address Translation Today s Topics Midterm Results Virtual Memory Virtualization Protection Address Translation Base and bound Segmentation Paging Translation

More information

Improving the performance of data servers on multicore architectures. Fabien Gaud

Improving the performance of data servers on multicore architectures. Fabien Gaud Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble University Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Quéma Sardes (INRIA/LIG) December 2, 2010

More information

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

More information

SYSTEM ecos Embedded Configurable Operating System

SYSTEM ecos Embedded Configurable Operating System BELONGS TO THE CYGNUS SOLUTIONS founded about 1989 initiative connected with an idea of free software ( commercial support for the free software ). Recently merged with RedHat. CYGNUS was also the original

More information

Exploring RAID Configurations

Exploring RAID Configurations Exploring RAID Configurations J. Ryan Fishel Florida State University August 6, 2008 Abstract To address the limits of today s slow mechanical disks, we explored a number of data layouts to improve RAID

More information

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted

More information

I3: Maximizing Packet Capture Performance. Andrew Brown

I3: Maximizing Packet Capture Performance. Andrew Brown I3: Maximizing Packet Capture Performance Andrew Brown Agenda Why do captures drop packets, how can you tell? Software considerations Hardware considerations Potential hardware improvements Test configurations/parameters

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2 Using Synology SSD Technology to Enhance System Performance Based on DSM 5.2 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD Cache as Solution...

More information

Eight Ways to Increase GPIB System Performance

Eight Ways to Increase GPIB System Performance Application Note 133 Eight Ways to Increase GPIB System Performance Amar Patel Introduction When building an automated measurement system, you can never have too much performance. Increasing performance

More information

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to: 55 Topic 3 Computer Performance Contents 3.1 Introduction...................................... 56 3.2 Measuring performance............................... 56 3.2.1 Clock Speed.................................

More information

Garbage Collection in the Java HotSpot Virtual Machine

Garbage Collection in the Java HotSpot Virtual Machine http://www.devx.com Printed from http://www.devx.com/java/article/21977/1954 Garbage Collection in the Java HotSpot Virtual Machine Gain a better understanding of how garbage collection in the Java HotSpot

More information

Lecture 17: Virtual Memory II. Goals of virtual memory

Lecture 17: Virtual Memory II. Goals of virtual memory Lecture 17: Virtual Memory II Last Lecture: Introduction to virtual memory Today Review and continue virtual memory discussion Lecture 17 1 Goals of virtual memory Make it appear as if each process has:

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

x64 Servers: Do you want 64 or 32 bit apps with that server?

x64 Servers: Do you want 64 or 32 bit apps with that server? TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD

More information

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier. Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

SMock A Test Platform for the Evaluation of Monitoring Tools

SMock A Test Platform for the Evaluation of Monitoring Tools SMock A Test Platform for the Evaluation of Monitoring Tools User Manual Ruth Mizzi Faculty of ICT University of Malta June 20, 2013 Contents 1 Introduction 3 1.1 The Architecture and Design of SMock................

More information

Why Relative Share Does Not Work

Why Relative Share Does Not Work Why Relative Share Does Not Work Introduction Velocity Software, Inc March 2010 Rob van der Heij rvdheij @ velocitysoftware.com Installations that run their production and development Linux servers on

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Introduction to Embedded Systems. Software Update Problem

Introduction to Embedded Systems. Software Update Problem Introduction to Embedded Systems CS/ECE 6780/5780 Al Davis logistics minor Today s topics: more software development issues 1 CS 5780 Software Update Problem Lab machines work let us know if they don t

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System CS341: Operating System Lect 36: 1 st Nov 2014 Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati File System & Device Drive Mass Storage Disk Structure Disk Arm Scheduling RAID

More information

Reliable Systolic Computing through Redundancy

Reliable Systolic Computing through Redundancy Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Quantum Support for Multiprocessor Pfair Scheduling in Linux

Quantum Support for Multiprocessor Pfair Scheduling in Linux Quantum Support for Multiprocessor fair Scheduling in Linux John M. Calandrino and James H. Anderson Department of Computer Science, The University of North Carolina at Chapel Hill Abstract This paper

More information

Multicore Programming with LabVIEW Technical Resource Guide

Multicore Programming with LabVIEW Technical Resource Guide Multicore Programming with LabVIEW Technical Resource Guide 2 INTRODUCTORY TOPICS UNDERSTANDING PARALLEL HARDWARE: MULTIPROCESSORS, HYPERTHREADING, DUAL- CORE, MULTICORE AND FPGAS... 5 DIFFERENCES BETWEEN

More information

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2) RTOSes Part I Christopher Kenna September 24, 2010 POSIX Portable Operating System for UnIX Application portability at source-code level POSIX Family formally known as IEEE 1003 Originally 17 separate

More information