SIMULATING NON-UNIFORM MEMORY ACCESS ARCHITECTURE

Transcription

1 SIMULATING NON-UNIFORM MEMORY ACCESS ARCHITECTURE FOR CLOUD SERVER APPLICATIONS Joakim Nylund Master of Science Thesis Supervisor: Prof. Johan Lilius Advisor: Dr. Sébastien Lafond Embedded Systems Laboratory Department of Information Technologies Åbo Akademi University Autumn 2011

2 ABSTRACT The purpose of this thesis is to evaluate and define architectural candidates for cloud based servers. The research focuses on the interconnect and memory topology of multi-core systems. One specific memory design is investigated and the Linux support for the architecture is tested and analyzed with the help of a full-system simulator with modified memory architecture. The results demonstrates how available tools in Linux can be used to efficiently run tasks on separate CPU s on large systems with many processing elements. Keywords: Interconnect, Cloud Computing, NUMA, Linux, Simics i

3 LIST OF FIGURES 2.1 The Memory Hierarchy [35] Quad-core AMD Opteron Processor [1] Simple SMP System Cache Coherence in a Dual-core CPU Write invalidate bus snooping protocol [2] Write broadcast bus snooping protocol [2] Common Network Topologies [3] D Mesh/Torus Architecture [25] Open Compute Motherboard based on Intel CPU and Quick- Path Interconnect [4] Open Compute Motherboard based on AMD CPU and Hyper- Transport Interconnect [4] Next Generation ARM SoC [5] The Simplest NUMA System [42] Different Motherboard Topologies for the Quad-core AMD Opteron CPU ii

4 4.3 ACPI System Locality Information Table (SLIT) [34] ACPI Static Resource Affinity Table (SRAT) [34] Cache Read and Write of Core-0 and Core Cluster NUMA Architecture A Two Level Cache System [43] The hardware for Simics NUMA All to all message passing on one core All to all message passing on four cores Comparison of one core and four cores (Big) All to one message passing on one core All to one message passing on four cores Comparison of one core and four cores (Bang) Future Work Illustrated B.1 Emark Benchmark Results iii

5 CONTENTS Abstract i List of Figures ii Contents iv 1 Introduction Cloud Software Program Thesis Structure Memory Architecture Introduction The Memory Hierarchy Primary Storage Secondary Storage Locality Shared Memory Symmetric Multiprocessing Coherence Interconnect 12 iv

6 3.1 Introduction Multiprocessing Network Topology High-performance Computing Intel QuickPath Interconnect and HyperTransport Arteris Interconnect Open Compute ARM Architecture Cortex-A Series Advanced Microcontroller Bus Architecture (AMBA) Calxeda 120 x Quad-core ARM Server Chip Non-Uniform Memory Access (NUMA) Introduction Hardware Software Advanced Configuration and Power Interface System Locality Information Table Static Resource Affinity Table The Linux Kernel NUMA Aware Scheduler Memory Affinity Processor Affinity Fake NUMA Nodes CPUSET NUMACTL Implementation 33 v

7 5.1 Approach Simics Full-system Simulator History and Features Working With ARM Architecture gcache and trans-staller Oracle VM VirtualBox Erlang Distributed Programming Erlang Nodes Asynchronous Message Passing Results Analysis of Emark Benchmarks Conclusions and future work Conclusions Future Work Bibliography 51 Swedish Summary 55 A Simics Staller Module Implementation 59 B Emark Results 62 vi

8 CHAPTER ONE INTRODUCTION As the core count in many architectures is constantly growing, the rest of the system also has to be updated to meet the higher demands the processing units set. One of these demands is the interconnect. As more processing power is available, the interconnect has to be able to move significantly more data then before. Another demand is the memory. With a traditional setup, there is only one memory available via one interconnect. When several processing elements are continuously asking for data, they are going to spend most of their time waiting, unless memory is not available in several places through several paths. This work, investigates the Non-Uniform Memory Access (NUMA) design, a memory architecture tailored for many-core systems, and presents a method to simulate this architecture, for evaluation of cloud based server applications. The work also introduces and uses the NUMA capabilities found in the Linux kernel, and results from tests running on a simulated NUMA interconnect topology are presented and analyzed. 1.1 Cloud Software Program The Cloud Software Program is a four-year ( ) research program organized by Tieto- ja viestintäteollisuuden tutkimus TIVIT Oy [6]. The goal of the program is to increase the international competitiveness of Finnish cloud based software. A Cloud Server From Finland could contribute with, among other things, a skillful implementation of a sustainable open system [7]. At the Embedded Systems Laboratory at Åbo Akademi University we provide to 1

9 the Cloud project research related to energy efficiency, with the most current interest in both power smart scheduling and effective interconnect between processing units and memory. 1.2 Thesis Structure Chapter 2 presents to the reader the common memory architecture of modern computers and embedded systems, with descriptions of different factors affecting the memory in terms of speed and scalability. In Chapter 3 we explain and compare different interconnect topologies and present how the interconnect is implemented in different architectures. More deeper analysis of the Non-Uniform Memory Access system is included in Chapter 4 with examination of the Linux kernels NUMA capabilities. Chapter 5 describes the work process of this thesis and presents the results acquired from Emark Erlang benchmarks running in Simics. Finally Chapter 6 sums up the presented material and examines possible future work in this field. 2

10 CHAPTER TWO MEMORY ARCHITECTURE 2.1 Introduction Memory is, and has always been, a fundamental part of every computer system. Today, an increasing amount of electronic devices have become more advanced and therefore act more or less like a computer, consisting of a processing unit with a specific memory setup. At the same time, as devices designed to do simple tasks have become more complex, the already advanced computer systems have also progressed even further. Computer systems with the latest technology and fast processing elements have become more efficient in many ways over the years, and one of the components that has evolved and grown in complexity is the memory architecture. Still, the principal memory architecture has remained the same for all these years, and already in the 1940s, Mr. von Neumann stated that a computer memory has to be engineered in a hierarchical manner [24]. 2.2 The Memory Hierarchy Even though the idea and usage of a memory hierarchy has remained the same for years, the size of the hierarchy has increased and the amount of layers seem to be slowly but constantly growing. Most recently, new layers of cache memory [33] and even Non-Uniform Memory Access have become present in some general-purpose CPU s memory hierarchy [35]. These new layers 3

11 of memory have been added to the hierarchy in order to manage the higher demands the increasing speed and growing amount of the CPU s set on the whole memory subsystem. The computer memory can basically be divided into two groups. The first one is the primary storage, which consists of fast memory directly accessible by the CPU. This type of memory is volatile, meaning it can not save its state unless it is powered on. Consequently, when a computer system is power off or rebooted, all data in the primary storage is lost. The primary storage is also, other than fast, typically small and expensive [35]. Secondary storage is in many ways the opposite to primary storage. Firstly, it is a non-volatile memory. This means it can save its state even when powered off. Accordingly, when a computer system is powered off or rebooted all the data stored in the secondary storage is saved and accessible when the system is next time started. Secondary storage is generally named external memory and it is characterized by being large, slow and cheap. Secondary memory can not be accessed by the CPU directly which makes it even more complicated and slow to access than primary memory [35]. The memory hierarchy of a computer is illustrated in Figure 2.1. Figure 2.1: The Memory Hierarchy [35]. 4

12 2.2.1 Primary Storage The purpose of the hierarchy is to make it possible for the CPU to quickly get access to data, so that the CPU can spend more time working, instead of waiting for data. A modern computer system moves data that is needed now, and in the future, from the slower memory up to the faster memory. This is why the small and expensive fast memory is located near the CPU, so that the processor can quickly access the needed data. The top half of the memory hierarchy in Figure 2.1 is composed of the primary storage, which consists of the following parts: CPU Registers The CPU registers lie at the very top of the memory hierarchy. There are only a few processor registers in a CPU, but these are very fast. A computer moves data from other memory into the registers, where it is used for different calculations. Most new computers are based on the x86 instruction set architecture (ISA) and have both hardware and software support for the most recent 64-bit x86 instruction set. This architecture consists of 16 x 64-bit general-purpose registers. In comparison, older computers using the 32-bit x86 instruction set only have 8 x 32-bit general-purpose registers [39]. The popular ARMv7 instruction set architecture used in basically all smartphones and tablets, has 16 x general-purpose registers, all 32-bit long, making it basically a compromise, in terms of CPU registers, of the 32- and 64-bit x86 architecture [23]. When writing a program for some architecture, the compiler usually takes care of which variables are to be saved in CPU registers, but obtaining a more optimized system manually, register variables can be declared by the developer in programming languages, like C with: register int n; [8]. Cache Cache memory is the fast memory which lies between the main memory and the CPU. Cache memory stores the most recently accessed data and instructions, allowing future accesses by the processor to the particular data and instructions to be handled much faster. Some architectures, like AMD s Quad-Core Opteron presented in Figure 2.2, has up to three different cache levels (L1-L2-L3). The lower the level, the smaller and faster the memory is. The Opteron processor has separate instruction (64 KB) and data (64 KB) L1 5

13 Caches for every processing core. We can see illustrated in Figure 2.2 that the 512 KB L2 Caches are also private for every core, but the biggest 8 MB L3 cache is a shared memory for all cores [1]. Figure 2.2: Quad-core AMD Opteron Processor [1]. Main Memory After the caches comes the main memory in the hierarchy. These are usually Dynamic Random-Access Memory (DRAM) which means they are relatively cheap, but still quite fast [35]. Existing mainstream smartphones and computers implementing either the ARM or x86 architecture have typically a main memory of the size between 1-4 GB [9][10]. The DRAM can easily be upgraded on most computer systems, making it an quick and cheap way of increasing the performance of the system. This may, for instance, allow older computers to be upgraded with modern operating systems as the newer operating systems often have higher main memory size requirements. NUMA This layer of the memory hierarchy is usually not present in a mainstream computer architecture, but will most likely be in the future, as the core count 6

14 is constantly growing also for home computers. Today, systems like the AMD Opteron (see Figure 2.2), which is designed for high-performance computing and servers, implement a Non-Uniform Memory Access (NUMA) architecture. This means that all cores have local main memory that is near the processing unit. This results in fast memory access times for all cores to their local memory. But all memory nodes can also be accessible by the other, more distant cores. This makes the memory-core relationship remote and the access times are much slower compared to accessing local memory. The speed of both the local and remote accesses are always dependent on the particular setup with the different physical locations of the CPU s and memories [42]. Virtual Memory Virtual memory takes use of both primary- and secondary storage. It simulates a bigger main memory with the help of the slower and cheaper secondary storage. This makes it easier to develop applications, as the programs have access to one big chunk of virtual memory and all fragmentation s are hidden [35] Secondary Storage The bottom half of the memory hierarchy in Figure 2.1 is composed of the secondary storage. We will only describe the next two layers of the hierarchy, because the rest of the layers are more distant memory systems, that are actually not immediately accessible by the computer. So, the next parts in the hierarchy are: File Storage The File Storage layer in computers usually consists of the Hard Disk Drive (HDD), which is a magnetic data storing device. The capacity of a file storage device is significantly higher than the capacity of the primary storage, but accessing the file storage is substantially slower than accessing any primary storage [35]. Network Storage The last layer is the Network Storage. Here data is stored at an entirely different system, but is still immediately accessible by the computer. Now the bandwidth of the network plays a significant role on the speed and throughput of data transfer [35]. 7

15 2.3 Locality The principle of locality is what makes the use of cache memory more useful, as it saves the most recent data in a fast memory close to the CPU [35]. There are two common types of locality of reference used in computer architectures. These are temporal locality and spatial locality. The concept of temporal locality is that, if a value is referenced, it is probably going to be referenced again in the near future, as this is the standard case in most running programs. Where spatial locality occurs when one arbitrary memory address is referenced, than the physical locations close are probably also going to be referenced soon. Again, because this is usually the case in an executing program. Therefore data is moved on a computer system to the faster memory when these conditions are met. 2.4 Shared Memory Shared memory exists already at the second cache level, as seen with the Opteron processor in Figure 2.2. If we are dealing with multi- or many-core architectures, the main memory will also be a shared memory. Communication and synchronization between programs running on two or several CPU cores is done with the help of shared memories, but as there are several independent systems acting on one space, some issues, which are described next, occur involving both symmetric-multiprocessing and coherence Symmetric Multiprocessing If a computer hardware architecture is designed and built as a Symmetric Multiprocessing (SMP) system, one shared main memory is used, which is seen by all cores. All cores are equally connected to the shared memory, as seen in Figure 2.3, and as the number of CPU cores grow, the communication between the CPU s and the single shared memory will also grow. This leads to an interconnect bottleneck as the CPU s have to wait for the memory and connection to be ready before they can continue working [42]. 8

16 Figure 2.3: Simple SMP System Coherence The other problematic part with shared memory is coherence. The memory is accessed and modified by several cores which most likely cache data. This means that the CPU copies data to its cache memory and when the data in the memory is later modified (see Figure 2.4), all the cache memories that have a copy of that memory location must also be updated in order to keep all the copies of the data same and the whole system up-to-date. This is handled with the help of a separate system that makes sure that the consistency between all the memories is maintained. The system has a coherency protocol, and depending on the system, the protocol can be of different implementations. The hardware coherency protocol found in some systems, like the ARM Cortex-A9 MPCore, have a snoop control unit that is "sniffing" the bus to keep the cache memory lines updated with the correct values. The ARM CPU has a separate Snoop Control Unit (SCU) that handles and maintains the cache coherency between all the processors [27]. Figure 2.5 and Figure 2.6 exemplifies a situation where the CPU s cache data from the main memory (X). The setup is a dual-core system with a separate unit snooping the bus like illustrated in Figure 2.4. There are two common types of buss sniffing methods: Write invalidate and Write broadcast (write update). The most common protocol is the write invalidate and a scenario with this method is described in Figure 2.5. The other method, write broadcast (see Figure 2.6), updates all the cached copies as a write occurs to the data that is cached. The figures describe the processor and buss activity and the contents of the memory and the cache s are shown after each step [2]. 9

17 Figure 2.4: Cache Coherence in a Dual-core CPU. 1. CPU A reads memory X. A cache miss occurs as the cache is empty. The content of X (0) is copied to CPU A cache. 2. CPU B reads memory X. A cache miss occurs as the cache is empty. The content of X (0) is copied to CPU B cache. 3. CPU A writes 1. CPU A cache is updated to 1. A cache invalidate is set to CPU B cache s copy. 4. CPU B reads X. A cache miss occurs as the cache is invalidated. The content of CPU A cache (1) is copied to memory X and CPU B cache. Figure 2.5: Write invalidate bus snooping protocol [2]. 10

18 1. CPU A reads memory X. A cache miss occurs as the cache is empty. The content of X (0) is copied to CPU A cache. 2. CPU B reads memory X. A cache miss occurs as the cache is empty. The content of X (0) is copied to CPU B cache. 3. CPU A writes 1. CPU A cache is updated to 1. A bus broadcast occurs and the content of CPU A cache and memory X is updated to CPU B reads X. Data located in local CPU B cache. Figure 2.6: Write broadcast bus snooping protocol [2]. 11

19 CHAPTER THREE INTERCONNECT 3.1 Introduction A computer system consists of many hardware parts that are physically connected to each other so they may exchange data. These electrical connections inside the circuits are called the interconnects. As the traffic between some of the hardware parts is constant, the interconnect needs to be able to move a huge amount of data quickly. Energy efficiency is also a common interconnect requirement, making the desired design even harder to accomplish, as it also has to be able to move all the data with a low amount of energy. The high-performance interconnect connecting the CPU and memory together are what we are mostly interested in, and the research concerning the interconnect is very important performance wise. In fact, the interconnect is the single most important factor of a computer architecture when dealing with high-performance computers [31]. 3.2 Multiprocessing Multiprocessing became a natural step in the evolution of computer architecture, as the frequency of a single core computer was reaching maximum performance with extremely high frequencies hard to top. The high frequency also makes the CPU less power efficient, and heat dissipation harder and more expensive. The next step was simply to put several cores on one CPU in 12

20 order to achieve higher performance. The effect of many-core architecture can be seen as a stress on the interconnect and memory on a whole new scale, forcing the whole system to adapt technologically if the full advantage of the processing power of all the cores is to be utilized. Today, multi-core microprocessors have been used in desktop computers for some five years. The technology in mainstream computers has advanced, from two to six cores on a single die. In recent time, multi-core architecture has also been adopted in low-power embedded systems and mobile devices, like smartphones and tablets [5]. 3.3 Network Topology When discussing the distinct arrangement of some computer parts that are interconnected, like memory and CPU s, the hardware parts are usually exemplified as nodes and the physical relationship, i.e. the interconnect, between the nodes are drawn as branches. A specific setup of nodes and branches represent one Network Topology. Figure 3.1 shows some general topologies used in computer architectures, and depending on system requirements one topology might be much more efficient and suitable than another [3]. The network topology of the memory hierarchy we discussed in Section 2.2 could be seen as the Linear Topology illustrated in Figure 3.1, since the memory hierarchy connects different memory subsystems in a linear fashion. Another example of a network topology demonstrated is the SMP System discussed and illustrated in Section The SMP topolgy can be seen in the topology figure as the Bus topology. 13

21 Figure 3.1: Common Network Topologies [3] High-performance Computing The fastest computers in the world are ranked in the TOP500 project. Currently the most powerful computer is a Japanese supercomputer known as the K Computer [11]. The K Computer, produced by Fujitsu at the RIKEN Advanced Institute for Computational Science, implements a Tofu interconnect (see Figure 3.2), which is a 6-dimensional (6D) custom mesh/torus architecture [25]. This means that one node somewhere in the center of the topology is connected to the nodes on the left and right (2D), front and back (2D) and also to the ones that are above and under (2D), hence the 6D. 14

22 Figure 3.2: 6D Mesh/Torus Architecture [25] Intel QuickPath Interconnect and HyperTransport The world s largest semiconductor chip maker, Intel [12] relays on their own interconnect technology (see Figure 3.3), which they call QuickPath Interconnect (QPI). This is the key technology maximizing the performance of Intel s x86 multi-core processors. The QuickPath interconnect uses fast pointto-point link topology between CPU cores inside the processor, and outside the processor the same links are used to connect memory and other CPU s [13]. Another similar interconnect using point-to-point links is HyperTransport (HT). The technology is developed and maintained by The HyperTransport Consortium. AMD s Opteron processor (Figure 3.4) uses HyperTransport bidirectional links to interconnect several CPU s and memory [31] Arteris Interconnect The Arteris interconnect is a NoC (Network-on-Chip) interconnect for SoC (System-on-Chip). Key characteristics is low power and high bandwidth, making it suitable for modern mobile devices implementing complex SoC dies with an increasing amount of IP blocks. Therefore many companies have chosen Arteris as the interconnect for their devices. Also, ARM Holdings has invested in Arteris making it perhaps even more interesting when taking, other 15

23 than architectural specifications, also the business and partner perspective in to acount. One of Arteris big customers is Samsung, who has selected Arteris interconnect solutions for multiple chips. One of Arteris interconnect products is called FlexNoC, which is designed for SoC interconnects with low latency and high throughput requirements, and it supports the broadly used ARM AMBA protocols [14]. Another vendor using the Arteris interconnect is Texas Instruments. The L3 interconnect inside the TI OMAP 4 processors connects high throughput IP blocks, like the the dual-core ARM Cortex-A9 CPU [41] Open Compute Some interesting server architectural work is done by Facebook under a project named Open Compute. They have designed and built an energy efficient data center which is cheaper and more powerful than other data-centers. The Open Compute project uses both Intel and AMD motherboards in the servers, and both motherboards are stripped down from many features that are otherwise found in traditional motherboards. Still, perhaps the most exciting part is that this project is open source, meaning they share everything [4]. The functional block diagram and board placement of the Intel and AMD motherboards are illustrated in Figure 3.3 and Figure 3.4. Both the Intel and AMD board diagrams show that the different processors have separate main memory located near the processing units, so they may quickly access the data. QuickPath (QPI)- and HyperTransport(HT) interconnect technologies are used in the boards as the physical connection linking the separate processors and memory. 16

24 Figure 3.3: Open Compute Motherboard based on Intel CPU and QuickPath Interconnect [4]. 17

25 Figure 3.4: Open Compute Motherboard based on AMD CPU and Hyper- Transport Interconnect [4]. 3.4 ARM Architecture ARM Holdings is a Cambridge based company designing the popular 32-bit ARM instruction set architecture. The current instruction set is ARMv7 and it is implemented in most smartphones and tablets on the market today. One of the key features of the ARM architecture is the excellent power efficiency which makes the architecture suitable for portable devices. ARM operates by licensing its design as IP rather than manufacturing the processors themselves. Today there are several companies building ARM processors: Among others, Nvidia, Samsung and Texas Instruments [15]. 18

26 3.4.1 Cortex-A Series The next version of ARMs popular Cortex-A series SoC is described in ARM s webpage as: "The ARM Cortex-A15 MPCore processor is the highest-performance licensable processor the industry has ever seen. It delivers unprecedented processing capability, combined with low power consumption to enable compelling products in markets ranging from smartphones, tablets, mobile computing, high-end digital home, servers and wireless infrastructure. The unique combination of performance, functionality, and power-efficiency provided by the Cortex-A15 MPCore processor reinforces ARM s clear leadership position in these high-value and high-volume application segments." [16] In Figure 3.5, is an image of the upcoming processor from ARM Holdings. Some of the new features found in the new processor are shown in the block diagram: The Snoop Control Unit (SCU) enabling the 128-bit AMBA 4 Coherent Bus are perhaps some of the most interesting. Also, what is particularly exciting about the Cortex-A15, is that the 1.5 GHz GHz quadcore configuration of the architecture is specifically designed for low-power servers [16]. 19

27 Figure 3.5: Next Generation ARM SoC [5] Advanced Microcontroller Bus Architecture (AMBA) All System-on-Chip (SoC) integrated circuits from ARM uses the Advanced Microcontroller Bus Architecture (AMBA) as the on-chip bus interconnect. The most recent AMBA protocol specification is the Advanced extensible Interface (AXI), which is targeted at high-performance systems with a high frequency and low-latency requirements. The AMBA AXI protocol is backwardcompatible with the earlier AHB and APB interfaces. The latest AMBA 4 version adds several new interface protocols. Some of these are: The AMBA 4 Coherency Extension (ACE) which allows full cache coherency between processors, ACE-Lite for I/O coherency and the AXI 4 which is designed to increase performance and power efficiency. [28] Calxeda 120 x Quad-core ARM Server Chip One promising attempt, other than the unseen future Cortex-A15 processor, to change the x86 dominated server market is done by a company named Calxeda, which ARM Holdings has shown interest in by investing in it. They 20

28 are building a server chip based on ARM Cortex-A9 MPCore processors. The architecture is based on a standard 2 rack unit (2U) server with 120 quad-core ARM processors. Each ARM node will only consume 5W of power, which is a lot less than any other x86 server [17]. 21

29 CHAPTER FOUR NON-UNIFORM MEMORY ACCESS (NUMA) 4.1 Introduction The Non-Uniform Memory Access architecture is designed for systems with many CPU s. The problem with traditional Symmetric Multiprocessing (SMP) is that it does not scale very well, because all traffic between the cores and memory goes through one place. NUMA is specifically designed to address this issue that occurs in large SMP systems, and solves it with an architecture where separate memory nodes are directly connected to all the CPU s [42]. A simple NUMA system is illustrated below in Figure 4.1. Figure 4.1: The Simplest NUMA System [42]. 22

30 4.2 Hardware A full NUMA system consists of special hardware (CPU, motherboard) that supports NUMA. There are many different types of motherboard architectures for one CPU family. Below, in Figure 4.2, we can see four different topologies for the AMD Opteron CPU. The block diagram is an approximation of the real motherboards. The interconnect between the processors follows a pattern where all processors are connected to two other processors and the local memory. For some architectures the interconnect is obvious (see Tyan K8QS Thunder Pro S4882) but for the other architectures with a more irregular setup, the interconnect could be manufactured in different ways. As the exact block diagrams of the interconnect was not found, the interconnect has been left out from the figure. As the distance between the cores and their remote nodes varies a lot between the different architectures, the performance is also going to be different. Some motherboard might be suitable for a specific application that doesn t need a significant amount of memory, but has much traffic between the memory and the processing unit. Another architecture might be optimal for a specific server that has separate computing intensive applications running on all the CPU s, where all the applications need a lot of shared memory. 23

31 Figure 4.2: Different Motherboard Topologies for the Quad-core AMD Opteron CPU. A NUMA system without the NUMA hardware could basically be implemented with the help of virtual memory (see Section 2.2.1). Most systems have a Memory Management Unit (MMU), which is a hardware part that all memory transactions from the CPU s go through. The virtual addresses from the CPU s are translated by the MMU to physical addresses. This way a computer cluster without NUMA hardware could take advantage of a programmable MMU and virtual memory to run a software NUMA system which uses both local and remote memory, where remote memory would be the memory of another computer connected to the same cluster. 24

32 4.3 Software To achieve a fast NUMA system, the Operating System (OS) running the software part, also has to be NUMA aware. It is equally important to have NUMA aware software as it is to have the physical NUMA hardware. The kernel, which is the main component of an OS, has to allocate memory to a process in the most efficient way. To do this, the kernel needs to know the distances between all the nodes and then calculate and apply an efficient memory policy for all the processes. The scheduler is a software part of every OS kernel that handles accesses to different resources in the system. Some schedulers, like the most recent Linux scheduler, uses different priority levels that are given to tasks. This way important and real-time tasks can access resources like the processor before other task that are not that important. The schedulers also uses load-balancing in order to evenly distribute the workforce to all the processors. This basically means that for a NUMA aware OS to work efficiently on NUMA hardware, the scheduler also needs to be able to parse the distance information of the underlying NUMA. The tasks should then be distributed accordingly and executed with NUMA scheduling with efficient memory placement policy. [30]. As an example, if a fair scheduler would not be aware of an underlying Quadcore NUMA hardware with 2 GB local main memory per CPU, the tasks would be evenly distributed to the four different processors, making them all work. As the OS would not be aware of the different physical locations of the main memory, the tasks would be executed on the processor with the most free time. This would result in an inefficient memory usage, as a random memory access would be 75 % of the times remote and inefficient. 4.4 Advanced Configuration and Power Interface Some major companies (HP, Intel, Microsoft, Phoenix and Toshiba) have together engineered a standard specification, called Advanced Configuration and Power Interface (ACPI). It is an open standard, for the x86 architecture, which describes the computer hardware and power management to the operating system. The ACPI defines many different tables that are filled with useful information that the OS reads and uses. Some of these tables hold information about the memory and CPU s and the distances between these on a NUMA machine. These tables are what we are interested in and they are described next [34]. 25

33 4.4.1 System Locality Information Table One of the two important tables in the ACPI specification concerning a NUMA hardware, is the System Locality Information Table (SLIT). The SLIT is an optional ACPI table that holds information about the distances between all the processors, memory controllers and host bridges. The table holds information about the distances in a matrix with all the nodes. The unit of distance is relative to the local node which has the value of 10. The distance between node 1 and node 4 could for instance be 34. This would mean it takes 3.4 times more time for node 1 to access node 4 than what it takes to access its own local memory. Figure 4.3 gives the SLIT specification [34]. 26

34 Figure 4.3: ACPI System Locality Information Table (SLIT) [34] Static Resource Affinity Table The other vital ACPI table needed to export NUMA information, is the Static Resource Affinity Table (SRAT). The SRAT (Figure 4.4) describes and stores a list of the physical location information about the CPU s and memory. As the 27

35 Figure 4.4: ACPI Static Resource Affinity Table (SRAT) [34]. SLIT holds the distance between all the nodes, the SRAT actually describes which and where these nodes physically are [34]. 4.5 The Linux Kernel The Linux kernel has, since 2004, support for Non-Uniform Memory Access on some architectures [30]. The kernel uses the ACPI SLIT definition to get the correct NUMA distances and then applies a certain NUMA policy to achieve optimal performance [26] NUMA Aware Scheduler A fundamental part of every modern OS kernel is the scheduler. In a multicore system the scheduler has to decide which process should run on which CPU and dynamic load balancing is done by the scheduler as it can migrate 28

36 processes from one core to another. On a NUMA system, where memory access times depend on the CPU and memory, the scheduling becomes even more complex, yet more important [18]. As of kernel 2.5 the scheduler has been a multi-queue scheduler which implements separate runqueues for every CPU core. It was called the O(1) scheduler but the scheduler still had no awareness of NUMA nodes until later when parts of the O(1) scheduler and parts of another NUMA aware scheduler were combined and resulted in a new optimal scheduler. [18]. The current Linux scheduler is namned the Completely Fair Scheduler (CFS). The CFS scheduler maximizes CPU workload and schedules tasks fairly among all the available cores in order to maximize performance. In situations where the amount of running tasks is less than the amount of logical processors, the scheduling can be tuned with a power saving option. The sched_mc_power_savings is disabled by default, but can easily be enabled with: echo 1 > /sys/devices/system/cpu/sched_mc_power_savings This will change the scheduler behavior, so that new tasks are distributed to other processors only when the first processors all cores are fully loaded and can not handle any new tasks [19] Memory Affinity Memory affinity is the done when the memory is split into spaces, and these spaces are then made accessible by predefined CPU s. In a NUMA system this affinity is called node affinity where the kernel tries to keep a process and its children running on a local node [26] Processor Affinity In Linux a program called taskset can be used to retrieve or set process s CPU affinity. Using the Process IDentifier (PID) of a process the taskset utility can be used to bypass the default scheduling applied by the Linux scheduler. The program can also be used to run a command with a given CPU affinity [37]. As an example, the command below sets a CPU affinity to program1, forcing it 29

37 to use only CPU 3 taskset -c 3 program Fake NUMA Nodes If a system lacks NUMA hardware, the 64-bit Linux kernel can be built with options that enables a fake NUMA configuration. The kernel does not have fake NUMA enabled, but users can manually compile the kernel with the two following options: CONFIG_NUMA_=y and CONFIG_NUMA_EMU=y The final step to start the Linux kernel with a fake NUMA system is to modify the boot loader with: numa=fake=x, where x is the amount of NUMA nodes. This way the kernel splits the memory into x equally large parts. Alternatively the size of the NUMA nodes memory can be specified with: numa=fake=x*y where y is the size of the memory nodes in MB. As an example we could start a system with 4 CPU cores and 3 GB of memory. If we want to split the memory into 2 nodes of 512 MB each and 2 other nodes with 1 GB each, we start the kernel with: numa=fake=2*512,2* CPUSET The Linux kernel includes a feature called cpuset. The cpusets provide a useful mechanism for assigning a group of processors and memory nodes to certain defined tasks. A task has a cpuset which forces the CPU and memory placement policy to follow the current cpusets resources. The cpusets are especially useful with large many-core systems with complex memory hierarchies and NUMA architecture, as scheduling and memory management becomes increasingly hard on these systems. The cpusets represent different sized subsystems that are especially useful on web servers running several instances of the same application. The default kernel scheduler uses load balancing across all CPU s, which actually is not a good option for at least two specific types of systems [32]: 30

38 1. Large systems "On large systems, load balancing across many CPUs is expensive. If the system is managed using cpusets to place independent jobs on separate sets of CPUs, full load balancing is unnecessary." [32] 2. Real-time systems "Systems supporting realtime on some CPUs need to minimize system overhead on those CPUs, including avoiding task load balancing if that is not needed." [32] Below is an example from the documentation where a sequence of commands will setup a cpuset named "Charlie" containing CPU s 2 and 3 and Memory Node 1, and after that start a subshell sh in that cpuset [32]: mount -t cgroup -ocpuset cpuset /dev/cpuset cd /dev/cpuset mkdir Charlie cd Charlie /bin/echo 2-3 > cpus /bin/echo 1 > mems /bin/echo $$ > tasks sh # The subshell sh is now running in cpuset Charlie # The next line should display /Charlie cat /proc/self/cpuset NUMACTL When running a NUMA configured machine in Linux, the cpusets feature can be extended with another program called numactl. As one NUMA node typically consists of one CPU and one memory part, the separate NUMA policy feature is necessary, when with cpusets the CPU does not necessarily have local memory. Using numactl, one can set a certain NUMA policy for a file or process for the current cpuset and in that way expand the memory management to be more optimized for a certain application on a NUMA architecture [36]. An example of using numactl, where process execute is run on node 3 with memory allocated on node 1 and 5: 31

39 numactl --cpubind=3 --membind=1,5 execute 32

40 CHAPTER FIVE IMPLEMENTATION This chapter explains the work process of this thesis. A short description of what we tried to accomplish and which tools we were using, with an ending of the obtained results. A conclusion with discussion is included in the next and final chapter. 5.1 Approach The approach has been from the beginning to explore the capabilities of the Simics simulator and to setup a suitable target machine with Simics and evaluate the performance of certain multi-core architectures with focus on the hardware interconnection and memory topology. Later, as the NUMA design was analyzed more accurately, the work concentrated on the Linux kernel and its NUMA features (Chapter 4.5). 5.2 Simics Full-system Simulator Wind River Simics is a simulator capable of fully emulating the behavior of a computer hardware, meaning one can install and run unmodified operating systems and software on supported architecture (x86, ARM, MIPS,...) [20]. 33

41 5.2.1 History and Features Wind River, a subsidiary of Intel Corporation, has been producing software for embedded systems since 1981 and has its technology in more than one billion products. Simics was first developed in the Swedish Institute of Computer Science and later in 1998 Virtutech was established so commercial development of Simics could be started. In 2010 Virtutech was acquired by Intel and Simics is now marketed through Wind River Systems [21]. Simics is a fast simulator that has the capabilities to save the state of a simulation. This state can later be opened again and simulation from the same moment can be continued. Every register and memory location can easily be viewed and modified and the whole system can even be ran backwards in order to find a bug. A system can conveniently be built with a script file which can include information like memory size, CPU frequency, core count and network definitions like IP and MAC address. A system with two cores and cache memory running a program in Linux is being monitored (Figure 5.1) with the help of the data caches. Both cores have private data caches, and the read and write transactions to the caches are being plotted. This way a user can visualize the system better and see which CPU core is doing all the work. In Figure 5.1 the program is single threaded, so all calculations seem to first be executed on Core-0 and later, at approximately simulated seconds, the thread is migrated to Core-1. Figure 5.1: Cache Read and Write of Core-0 and Core-1. It is possible to run several targets at the same time in Simics and these can, for instance, easily be interconnected with Ethernet. The minimum latency for a distributed system can be changed in Simics Command Line, and if we would want a system with 50 milliseconds latency, the following command should be 34

42 used: running> set-min-latency 0.05 The traffic in a distributed system can be monitored in Simics when enabling pcapdump capture. At one point we looked at the traffic between two Erlang nodes running on different targets, communicating with message passing. The pcapdump was enabled on the service node which provided the virtual network. The traffic data was written to a file, which we analyzed on the host OS in a program called Wireshark. The idea was to get a more accurate understanding of the traffic that occurs in a distributed Erlang system. In order to research and evaluate different interconnects, a deeper understanding of the whole system and traffic needs to be carefully examined. But, as it would have required a lot of more time and resources to fully achieve the task of correctly analyzing the traffic in a distributed Erlang system running cloud based server applications, the work was canceled. Still, as the feature was found and tested to work, the knowledge could help give directions with future work Working With ARM Architecture As earlier research in the Cloud Project has proposed a low-power server architecture implementing the ARMv7 instruction set, we used at the beginning ARM targets in Simics. Some difficulties was met upon during the work with the ARM target machine, which are described below: 1. Full-featured Distribution The need for a full-featured operating system running on the ARM target was apparent, as the benchmarks we are using in our research are implemented in Erlang programming language, which requires, among other things, GNU components and Perl. The target machine available was a Single-Board Computer (SBC) from Technologic Systems (TS-7200) running a compact Linux distribution with some Unix-tools available with BusyBox. The TS-7200 is based on an old ARM architecture with the ARM9 CPU. Booting a full Linux distribution is still possible with the CompactFlash (CF) card, and Debian 3.1 Sarge is available to download for the TS The installation of Debian Sarge distribution was Therefore, the different phases of booting a full distribution in the Simics 35

43 ARM target, are described next. First step was to create a new empty raw file. This was done in Linux with the dd tool: $ dd if=/dev/zero of=/home/joaknylu/image bs=1k count= Next, a file system has to be created: $ /sbin/mkfs.ext3 -N /home/joaknylu/image After this we loopback mount it with: $ sudo mount -o loop,rw /home/joaknylu/image /home/joaknylu/mount-folder/ Now the downloaded distribution can be extracted into the mount: $ sudo tar jxvf /home/joaknylu/debian-sarge-256mb tar.bz2 -C /home/joaknylu/mount-folder/ After this, we have to unmount the file system inside the file: $ sudo umount /home/joaknylu/mount-folder/ And finally use the Simics craff tool to compress and create the image file needed: $./craff /image -o debian-256mb-sarge.craff This way we have successfully compressed and converted the downloaded distribution to a file that should be bootable in Simics. As the distribution could not be successfully started in Simics, it turned out that more files were needed. The OS could not be started, as the image file needed in Simics to correctly boot the distribution has to be made up of several files, including, not only the distribution image but also the kernel (bzimage) and the bootloader. This made a simple sounding task hard, and was never fully achieved. 2. Memory Management Unit (MMU) 36

44 The TS-7200 SBC target has Ethernet interface implemented, allowing several machines to be easily interconnected with Ethernet in Simics. And, as the ARM9 CPU has a Memory Management Unit (MMU), we looked at taking advantage of virtual memory (more detailed information in section 4.2), and implementation of a cluster NUMA architecture with software, was analyzed. Memory was quickly shared among several interconnected targets running in Simics with the use of the Simics Command Line utility. Using the following commands edits the memory-space and forces one of the two targets to use both local and remote memory, where remote memory is the one located on the other board: running> board_cmp0.phys_mem.del-map base = 0x0 device = board_cmp0.sdram running> board_cmp0.phys_mem.add-map base = 0x0 device = board_cmp1.sdram offset = 0x0 length = 0x Running these commands makes the target crash, so a cluster NUMA architecture can not be simulated this easy. The obvious problem here is that the operating systems running on each target are not aware of each other, so the crashing of the kernel is expected. In order for it to work, some sort of device should have to be made, which would act as a physical device monitoring the traffic to the memory. Figure 5.2 below illustrates how a cluster NUMA system would work, but regardless of the issues with the hardware simulations, the problem here still remained with the evident lack of a full-featured distribution. Figure 5.2: Cluster NUMA Architecture. 3. Multiprocessing Multiprocessing architecture is a logical part of every Cloud based system, so the need to simulate such a system is a necessity. As the ARM 37

45 TS-7200 target only has one CPU, and no new models with ARM SMP are available in Simics (4.2) academic package, the need to build an own ARM SMP was needed in order to simulate multi-core architecture. This could be done for instance with the inclusion of an own control unit that handles the synchronization of the cores. The need for a full-featured distribution still remains, and also a Linux Board Support Package (BSP) should have to be written in order to make the custom ARM multi-core architecture run. This could behave like a newer regular ARM SMP, but the architecture would regardless not be the same gcache and trans-staller The g-cache module is a standard cache model used in Simics. The g-cache simulates the behavior of the cache memory with the help of the timing model interface. Simics has to be run in stall mode in order for the staller to have an affect. Stall mode is supported on some architectures in Simics, like x86 but it is not supported for the ARM instruction set. In Figure 5.3 the g-cache module represents a memory hierarchy with a two level cache structure. The id-splitter is a separate module that splits the instruction and data accesses and sends them to separate L1 caches. The splitters are there to avoid accesses that can cross a cache-line boundary by splitting these into two accesses. The L1 caches are connected to the L2 cache, which is finally connected to the transstaller, which simulates the memory latency [43]. This module is important, as we are interested in the delays or latency that always occur when dealing with interconnects. The trans-staller module source is available in Simics, so modifications can be done to it. This allows us to change the standard trans-staller behavior to a more complex one. In order to simulate a NUMA behavior, the transstaller needs to be modified so that it returns different memory latency times depending on which core or CPU accesses what memory space, as different memory spaces are located physically different. 5.3 Oracle VM VirtualBox The Oracle VM VirtualBox is a x86 virtualization tool that allows users to install different guest operating systems within the program. The VirtualBox allowed us to quickly install and recompile a Linux kernel. We installed 38

46 Figure 5.3: A Two Level Cache System [43]. Ubuntu Server on VirtualBox and recompiled the kernel with the NUMA emulation set. This meant that the standard configuration had to be modified and NUMA and NUMA_EMU had to be set before compilation. Some programs and benchmarks, including Erlang and NUMACTL were also installed. One of the most important reason to use a virtualization tool for this task, is that the whole disk image is available and can easily be converted. In this case The VirtualBox Disk Image (.vdi) was converted with the VirtualBox command-line interface, vboxmanage to a a raw disk image (.raw). This makes it possible to use The Simics craff tool and convert and compress the raw disk image to a Simics.craff image file. A Simics script is then modified to open the new craff image file, allowing the simulated machine to be started with the desired distribution and kernel. 5.4 Erlang This chapter presents the Erlang programming language and explains some features of the language. Erlang is designed by Ericsson and is well suited for distributed concurrent systems running fault-tolerant applications. Erlang runs programs fast on multi-core computers and has the ability to modify a running application without closing the program [29]. These are some interesting features which makes Erlang suitable for cloud based servers. 39

47 5.4.1 Distributed Programming Distributed programming is done in Erlang with the help of libraries. Distributed programs are implemented to run on a distributed system that consist of a number of Erlang runtime systems communicating with each other. The processes running are using message passing to communicate and synchronize with each other [29] Erlang Nodes Distributed Erlang is implemented with nodes. One can start several Erlang nodes on one machine or on different machines interconnected over Ethernet. As an example, two nodes are started on the same machine with two terminals open. Each node is running on a different terminal and both nodes have their own name [29]: Terminal-1: $ erl -sname node1 Terminal-2: $ erl -sname node2 With this setup, the user can put to use the Remote Procedure Call service (rpc) in order to, from one node, call a function located on the other node. The rpc method will obtain the answer from the remote node and fetch it to the calling node Asynchronous Message Passing As Erlang processes share no memory they communicate with message passing. The message passing is asynchronous, as the sender does not wait for the receiver to be ready. The processes have a mailbox queue where incoming messages are saved, until received by the process. In Erlang, the operator "!" is used to send messages. The syntax of "!" is [22]: Pid! Message 40

48 5.5 Results In our simulations the inclusion of cache memory was irrelevant, so the full g-cache module with separate timing models was not used. Instead a simpler script was written that directly uses the trans-staller for all memory accesses: # Connect phys_mem with NUMA = pre_conf_object( staller, trans-staller = conf.staller The last line connects the staller to the memory space. Simics is then started in Stall execution mode, in order to make it possible to stall memory accesses. The trans-staller has been modified to simulate the NUMA behavior for the system. To the trans-staller was added an algorithm that checks which CPU_id accesses what memory space (See Appendix A). The stall times returned by the trans-staller are now adjusted to the corresponding latency of a certain NUMA hardware configuration. In our simulations, the system is a 64 bit x86 machine with 4 x CPU s. The operating system is the 64 bit Ubuntu Server with recompiled Linux kernel with NUMA emulation set. The Linux kernel is started with 4 x 128 MB NUMA nodes. The NUMA scheduling and memory placement policy is set with numactl and in order to ensure memory locality, the policy is simply set to --localalloc, so only the fastest local memory is used by the CPU s. In comparison, the tests are also ran without NUMA policy, where allocations are made to an arbitrary node. The hardware is a symmetric NUMA system with three different node latencies. Below in Figure 5.4 is the NUMA hardware with the SLIT like table showing distance information between the nodes. 41

49 Figure 5.4: The hardware for Simics NUMA. 5.6 Analysis of Emark Benchmarks A number of benchmarks, called Emark was used to evaluate or show how changes in Linux NUMA memory allocation policies impact on the computational performance of a computer system. The Emark benchmarks are used to evaluate the performance of different versions of the Erlang Run-Time System Application (ERTS), and see how it performs on different machines. The benchmarks have also been used to evaluate the performance of different platforms running the same version of ERTS. In our simulations two of the Emark benchmarks are used: Bang and Big. The first implements all to one message passing and the second all to all message passing. These benchmarks measure the performance in elapsed time (ms) [40]. The results seen below show the performance of the ERTS running on our simulated particular hardware. All figures include three different simulations, where the first and fast simulation is ran in Simics under normal execution mode, meaning no timing model is in use for the memory accesses. The two other simulations are ran under Simics stall mode, where all the memory accesses are stalled for a certain amount of cycles, depending on which CPU access what memory space. The difference between the two later simulations ran under stall mode, is that the first is run without any NUMA policy and the 42

50 other using the Linux user space shared library interface, called libnuma and in particular the command line utility numactl, to control the NUMA policy. The NUMA policy we are using for the second simulation, is to allocate on the local node. The benchmarks have been run several times in order to identify odd or inconsistent behavior. The results presented here are the data acquired from one random run of the benchmarks, as the different runs did not show any significant variance between themselves. Below, in Figure 5.5 we can see results from the Emark Big benchmark. The benchmark is run on only one core with the use of Linux taskset CPU affinity utility, and we can see that the benchmark is finished much faster under normal Simics mode, just as expected. Under stall mode the difference is clear between the two different executions. Without any NUMA policy the benchmark takes approximately two times longer to finish as it allocates memory from slower nodes and as the other run uses NUMA policy and allocates only from the fast local node, the benchmark finishes much faster. Figure 5.5: All to all message passing on one core. The second Figure shows how fast the benchmark finished with all 4 CPU s in use. Here we can see similar ratio between the speed of the two executions as with the previous figure with only one core. 43

51 Figure 5.6: All to all message passing on four cores. A comparison between the two first figures can be seen below in Figure 5.7. The results indicate that the Big benchmark is about five to six times faster when run on stall execution mode on the quad-core machine when using all cores compared to using only one. With local memory allocation policy set, using four cores is about four times faster than when only using one core. This is almost exactly the same with the normal execution mode, where memory accesses happen without any stalling. The lines are almost identical because the stalling factor can be excluded from the performance ratio: P erformance Ratio = run 1_single_core run 1 _quad_core run 2_single_core x stall_local run 2 _quad_core x stall_local Similar results can be seen with the Bang benchmark. On one core the result is clear and consistent. We execute the benchmark two times faster when allocations are made on the local node. Again, we can see in Figure 5.9 that the local allocation gives the same performance advantages over random allocation, when using all four cores. 44

52 Figure 5.7: Comparison of one core and four cores (Big). Figure 5.8: All to one message passing on one core. Comparing the speed of the single-core setup with the quad-core setup, shows the increased performance of a factor between 1.9 and

53 Figure 5.9: All to one message passing on four cores. Figure 5.10: Comparison of one core and four cores (Bang). From these results we can see that the modifications done to the function returning stall times in the trans-staller module works somewhat as expected. The NUMA policy set with the command line utility numactl also works as 46