SIMULATING NON-UNIFORM MEMORY ACCESS ARCHITECTURE

Size: px
Start display at page:

Download "SIMULATING NON-UNIFORM MEMORY ACCESS ARCHITECTURE"

Transcription

1 SIMULATING NON-UNIFORM MEMORY ACCESS ARCHITECTURE FOR CLOUD SERVER APPLICATIONS Joakim Nylund Master of Science Thesis Supervisor: Prof. Johan Lilius Advisor: Dr. Sébastien Lafond Embedded Systems Laboratory Department of Information Technologies Åbo Akademi University Autumn 2011

2 ABSTRACT The purpose of this thesis is to evaluate and define architectural candidates for cloud based servers. The research focuses on the interconnect and memory topology of multi-core systems. One specific memory design is investigated and the Linux support for the architecture is tested and analyzed with the help of a full-system simulator with modified memory architecture. The results demonstrates how available tools in Linux can be used to efficiently run tasks on separate CPU s on large systems with many processing elements. Keywords: Interconnect, Cloud Computing, NUMA, Linux, Simics i

3 LIST OF FIGURES 2.1 The Memory Hierarchy [35] Quad-core AMD Opteron Processor [1] Simple SMP System Cache Coherence in a Dual-core CPU Write invalidate bus snooping protocol [2] Write broadcast bus snooping protocol [2] Common Network Topologies [3] D Mesh/Torus Architecture [25] Open Compute Motherboard based on Intel CPU and Quick- Path Interconnect [4] Open Compute Motherboard based on AMD CPU and Hyper- Transport Interconnect [4] Next Generation ARM SoC [5] The Simplest NUMA System [42] Different Motherboard Topologies for the Quad-core AMD Opteron CPU ii

4 4.3 ACPI System Locality Information Table (SLIT) [34] ACPI Static Resource Affinity Table (SRAT) [34] Cache Read and Write of Core-0 and Core Cluster NUMA Architecture A Two Level Cache System [43] The hardware for Simics NUMA All to all message passing on one core All to all message passing on four cores Comparison of one core and four cores (Big) All to one message passing on one core All to one message passing on four cores Comparison of one core and four cores (Bang) Future Work Illustrated B.1 Emark Benchmark Results iii

5 CONTENTS Abstract i List of Figures ii Contents iv 1 Introduction Cloud Software Program Thesis Structure Memory Architecture Introduction The Memory Hierarchy Primary Storage Secondary Storage Locality Shared Memory Symmetric Multiprocessing Coherence Interconnect 12 iv

6 3.1 Introduction Multiprocessing Network Topology High-performance Computing Intel QuickPath Interconnect and HyperTransport Arteris Interconnect Open Compute ARM Architecture Cortex-A Series Advanced Microcontroller Bus Architecture (AMBA) Calxeda 120 x Quad-core ARM Server Chip Non-Uniform Memory Access (NUMA) Introduction Hardware Software Advanced Configuration and Power Interface System Locality Information Table Static Resource Affinity Table The Linux Kernel NUMA Aware Scheduler Memory Affinity Processor Affinity Fake NUMA Nodes CPUSET NUMACTL Implementation 33 v

7 5.1 Approach Simics Full-system Simulator History and Features Working With ARM Architecture gcache and trans-staller Oracle VM VirtualBox Erlang Distributed Programming Erlang Nodes Asynchronous Message Passing Results Analysis of Emark Benchmarks Conclusions and future work Conclusions Future Work Bibliography 51 Swedish Summary 55 A Simics Staller Module Implementation 59 B Emark Results 62 vi

8 CHAPTER ONE INTRODUCTION As the core count in many architectures is constantly growing, the rest of the system also has to be updated to meet the higher demands the processing units set. One of these demands is the interconnect. As more processing power is available, the interconnect has to be able to move significantly more data then before. Another demand is the memory. With a traditional setup, there is only one memory available via one interconnect. When several processing elements are continuously asking for data, they are going to spend most of their time waiting, unless memory is not available in several places through several paths. This work, investigates the Non-Uniform Memory Access (NUMA) design, a memory architecture tailored for many-core systems, and presents a method to simulate this architecture, for evaluation of cloud based server applications. The work also introduces and uses the NUMA capabilities found in the Linux kernel, and results from tests running on a simulated NUMA interconnect topology are presented and analyzed. 1.1 Cloud Software Program The Cloud Software Program is a four-year ( ) research program organized by Tieto- ja viestintäteollisuuden tutkimus TIVIT Oy [6]. The goal of the program is to increase the international competitiveness of Finnish cloud based software. A Cloud Server From Finland could contribute with, among other things, a skillful implementation of a sustainable open system [7]. At the Embedded Systems Laboratory at Åbo Akademi University we provide to 1

9 the Cloud project research related to energy efficiency, with the most current interest in both power smart scheduling and effective interconnect between processing units and memory. 1.2 Thesis Structure Chapter 2 presents to the reader the common memory architecture of modern computers and embedded systems, with descriptions of different factors affecting the memory in terms of speed and scalability. In Chapter 3 we explain and compare different interconnect topologies and present how the interconnect is implemented in different architectures. More deeper analysis of the Non-Uniform Memory Access system is included in Chapter 4 with examination of the Linux kernels NUMA capabilities. Chapter 5 describes the work process of this thesis and presents the results acquired from Emark Erlang benchmarks running in Simics. Finally Chapter 6 sums up the presented material and examines possible future work in this field. 2

10 CHAPTER TWO MEMORY ARCHITECTURE 2.1 Introduction Memory is, and has always been, a fundamental part of every computer system. Today, an increasing amount of electronic devices have become more advanced and therefore act more or less like a computer, consisting of a processing unit with a specific memory setup. At the same time, as devices designed to do simple tasks have become more complex, the already advanced computer systems have also progressed even further. Computer systems with the latest technology and fast processing elements have become more efficient in many ways over the years, and one of the components that has evolved and grown in complexity is the memory architecture. Still, the principal memory architecture has remained the same for all these years, and already in the 1940s, Mr. von Neumann stated that a computer memory has to be engineered in a hierarchical manner [24]. 2.2 The Memory Hierarchy Even though the idea and usage of a memory hierarchy has remained the same for years, the size of the hierarchy has increased and the amount of layers seem to be slowly but constantly growing. Most recently, new layers of cache memory [33] and even Non-Uniform Memory Access have become present in some general-purpose CPU s memory hierarchy [35]. These new layers 3

11 of memory have been added to the hierarchy in order to manage the higher demands the increasing speed and growing amount of the CPU s set on the whole memory subsystem. The computer memory can basically be divided into two groups. The first one is the primary storage, which consists of fast memory directly accessible by the CPU. This type of memory is volatile, meaning it can not save its state unless it is powered on. Consequently, when a computer system is power off or rebooted, all data in the primary storage is lost. The primary storage is also, other than fast, typically small and expensive [35]. Secondary storage is in many ways the opposite to primary storage. Firstly, it is a non-volatile memory. This means it can save its state even when powered off. Accordingly, when a computer system is powered off or rebooted all the data stored in the secondary storage is saved and accessible when the system is next time started. Secondary storage is generally named external memory and it is characterized by being large, slow and cheap. Secondary memory can not be accessed by the CPU directly which makes it even more complicated and slow to access than primary memory [35]. The memory hierarchy of a computer is illustrated in Figure 2.1. Figure 2.1: The Memory Hierarchy [35]. 4

12 2.2.1 Primary Storage The purpose of the hierarchy is to make it possible for the CPU to quickly get access to data, so that the CPU can spend more time working, instead of waiting for data. A modern computer system moves data that is needed now, and in the future, from the slower memory up to the faster memory. This is why the small and expensive fast memory is located near the CPU, so that the processor can quickly access the needed data. The top half of the memory hierarchy in Figure 2.1 is composed of the primary storage, which consists of the following parts: CPU Registers The CPU registers lie at the very top of the memory hierarchy. There are only a few processor registers in a CPU, but these are very fast. A computer moves data from other memory into the registers, where it is used for different calculations. Most new computers are based on the x86 instruction set architecture (ISA) and have both hardware and software support for the most recent 64-bit x86 instruction set. This architecture consists of 16 x 64-bit general-purpose registers. In comparison, older computers using the 32-bit x86 instruction set only have 8 x 32-bit general-purpose registers [39]. The popular ARMv7 instruction set architecture used in basically all smartphones and tablets, has 16 x general-purpose registers, all 32-bit long, making it basically a compromise, in terms of CPU registers, of the 32- and 64-bit x86 architecture [23]. When writing a program for some architecture, the compiler usually takes care of which variables are to be saved in CPU registers, but obtaining a more optimized system manually, register variables can be declared by the developer in programming languages, like C with: register int n; [8]. Cache Cache memory is the fast memory which lies between the main memory and the CPU. Cache memory stores the most recently accessed data and instructions, allowing future accesses by the processor to the particular data and instructions to be handled much faster. Some architectures, like AMD s Quad-Core Opteron presented in Figure 2.2, has up to three different cache levels (L1-L2-L3). The lower the level, the smaller and faster the memory is. The Opteron processor has separate instruction (64 KB) and data (64 KB) L1 5

13 Caches for every processing core. We can see illustrated in Figure 2.2 that the 512 KB L2 Caches are also private for every core, but the biggest 8 MB L3 cache is a shared memory for all cores [1]. Figure 2.2: Quad-core AMD Opteron Processor [1]. Main Memory After the caches comes the main memory in the hierarchy. These are usually Dynamic Random-Access Memory (DRAM) which means they are relatively cheap, but still quite fast [35]. Existing mainstream smartphones and computers implementing either the ARM or x86 architecture have typically a main memory of the size between 1-4 GB [9][10]. The DRAM can easily be upgraded on most computer systems, making it an quick and cheap way of increasing the performance of the system. This may, for instance, allow older computers to be upgraded with modern operating systems as the newer operating systems often have higher main memory size requirements. NUMA This layer of the memory hierarchy is usually not present in a mainstream computer architecture, but will most likely be in the future, as the core count 6

14 is constantly growing also for home computers. Today, systems like the AMD Opteron (see Figure 2.2), which is designed for high-performance computing and servers, implement a Non-Uniform Memory Access (NUMA) architecture. This means that all cores have local main memory that is near the processing unit. This results in fast memory access times for all cores to their local memory. But all memory nodes can also be accessible by the other, more distant cores. This makes the memory-core relationship remote and the access times are much slower compared to accessing local memory. The speed of both the local and remote accesses are always dependent on the particular setup with the different physical locations of the CPU s and memories [42]. Virtual Memory Virtual memory takes use of both primary- and secondary storage. It simulates a bigger main memory with the help of the slower and cheaper secondary storage. This makes it easier to develop applications, as the programs have access to one big chunk of virtual memory and all fragmentation s are hidden [35] Secondary Storage The bottom half of the memory hierarchy in Figure 2.1 is composed of the secondary storage. We will only describe the next two layers of the hierarchy, because the rest of the layers are more distant memory systems, that are actually not immediately accessible by the computer. So, the next parts in the hierarchy are: File Storage The File Storage layer in computers usually consists of the Hard Disk Drive (HDD), which is a magnetic data storing device. The capacity of a file storage device is significantly higher than the capacity of the primary storage, but accessing the file storage is substantially slower than accessing any primary storage [35]. Network Storage The last layer is the Network Storage. Here data is stored at an entirely different system, but is still immediately accessible by the computer. Now the bandwidth of the network plays a significant role on the speed and throughput of data transfer [35]. 7

15 2.3 Locality The principle of locality is what makes the use of cache memory more useful, as it saves the most recent data in a fast memory close to the CPU [35]. There are two common types of locality of reference used in computer architectures. These are temporal locality and spatial locality. The concept of temporal locality is that, if a value is referenced, it is probably going to be referenced again in the near future, as this is the standard case in most running programs. Where spatial locality occurs when one arbitrary memory address is referenced, than the physical locations close are probably also going to be referenced soon. Again, because this is usually the case in an executing program. Therefore data is moved on a computer system to the faster memory when these conditions are met. 2.4 Shared Memory Shared memory exists already at the second cache level, as seen with the Opteron processor in Figure 2.2. If we are dealing with multi- or many-core architectures, the main memory will also be a shared memory. Communication and synchronization between programs running on two or several CPU cores is done with the help of shared memories, but as there are several independent systems acting on one space, some issues, which are described next, occur involving both symmetric-multiprocessing and coherence Symmetric Multiprocessing If a computer hardware architecture is designed and built as a Symmetric Multiprocessing (SMP) system, one shared main memory is used, which is seen by all cores. All cores are equally connected to the shared memory, as seen in Figure 2.3, and as the number of CPU cores grow, the communication between the CPU s and the single shared memory will also grow. This leads to an interconnect bottleneck as the CPU s have to wait for the memory and connection to be ready before they can continue working [42]. 8

16 Figure 2.3: Simple SMP System Coherence The other problematic part with shared memory is coherence. The memory is accessed and modified by several cores which most likely cache data. This means that the CPU copies data to its cache memory and when the data in the memory is later modified (see Figure 2.4), all the cache memories that have a copy of that memory location must also be updated in order to keep all the copies of the data same and the whole system up-to-date. This is handled with the help of a separate system that makes sure that the consistency between all the memories is maintained. The system has a coherency protocol, and depending on the system, the protocol can be of different implementations. The hardware coherency protocol found in some systems, like the ARM Cortex-A9 MPCore, have a snoop control unit that is "sniffing" the bus to keep the cache memory lines updated with the correct values. The ARM CPU has a separate Snoop Control Unit (SCU) that handles and maintains the cache coherency between all the processors [27]. Figure 2.5 and Figure 2.6 exemplifies a situation where the CPU s cache data from the main memory (X). The setup is a dual-core system with a separate unit snooping the bus like illustrated in Figure 2.4. There are two common types of buss sniffing methods: Write invalidate and Write broadcast (write update). The most common protocol is the write invalidate and a scenario with this method is described in Figure 2.5. The other method, write broadcast (see Figure 2.6), updates all the cached copies as a write occurs to the data that is cached. The figures describe the processor and buss activity and the contents of the memory and the cache s are shown after each step [2]. 9

17 Figure 2.4: Cache Coherence in a Dual-core CPU. 1. CPU A reads memory X. A cache miss occurs as the cache is empty. The content of X (0) is copied to CPU A cache. 2. CPU B reads memory X. A cache miss occurs as the cache is empty. The content of X (0) is copied to CPU B cache. 3. CPU A writes 1. CPU A cache is updated to 1. A cache invalidate is set to CPU B cache s copy. 4. CPU B reads X. A cache miss occurs as the cache is invalidated. The content of CPU A cache (1) is copied to memory X and CPU B cache. Figure 2.5: Write invalidate bus snooping protocol [2]. 10

18 1. CPU A reads memory X. A cache miss occurs as the cache is empty. The content of X (0) is copied to CPU A cache. 2. CPU B reads memory X. A cache miss occurs as the cache is empty. The content of X (0) is copied to CPU B cache. 3. CPU A writes 1. CPU A cache is updated to 1. A bus broadcast occurs and the content of CPU A cache and memory X is updated to CPU B reads X. Data located in local CPU B cache. Figure 2.6: Write broadcast bus snooping protocol [2]. 11

19 CHAPTER THREE INTERCONNECT 3.1 Introduction A computer system consists of many hardware parts that are physically connected to each other so they may exchange data. These electrical connections inside the circuits are called the interconnects. As the traffic between some of the hardware parts is constant, the interconnect needs to be able to move a huge amount of data quickly. Energy efficiency is also a common interconnect requirement, making the desired design even harder to accomplish, as it also has to be able to move all the data with a low amount of energy. The high-performance interconnect connecting the CPU and memory together are what we are mostly interested in, and the research concerning the interconnect is very important performance wise. In fact, the interconnect is the single most important factor of a computer architecture when dealing with high-performance computers [31]. 3.2 Multiprocessing Multiprocessing became a natural step in the evolution of computer architecture, as the frequency of a single core computer was reaching maximum performance with extremely high frequencies hard to top. The high frequency also makes the CPU less power efficient, and heat dissipation harder and more expensive. The next step was simply to put several cores on one CPU in 12

20 order to achieve higher performance. The effect of many-core architecture can be seen as a stress on the interconnect and memory on a whole new scale, forcing the whole system to adapt technologically if the full advantage of the processing power of all the cores is to be utilized. Today, multi-core microprocessors have been used in desktop computers for some five years. The technology in mainstream computers has advanced, from two to six cores on a single die. In recent time, multi-core architecture has also been adopted in low-power embedded systems and mobile devices, like smartphones and tablets [5]. 3.3 Network Topology When discussing the distinct arrangement of some computer parts that are interconnected, like memory and CPU s, the hardware parts are usually exemplified as nodes and the physical relationship, i.e. the interconnect, between the nodes are drawn as branches. A specific setup of nodes and branches represent one Network Topology. Figure 3.1 shows some general topologies used in computer architectures, and depending on system requirements one topology might be much more efficient and suitable than another [3]. The network topology of the memory hierarchy we discussed in Section 2.2 could be seen as the Linear Topology illustrated in Figure 3.1, since the memory hierarchy connects different memory subsystems in a linear fashion. Another example of a network topology demonstrated is the SMP System discussed and illustrated in Section The SMP topolgy can be seen in the topology figure as the Bus topology. 13

21 Figure 3.1: Common Network Topologies [3] High-performance Computing The fastest computers in the world are ranked in the TOP500 project. Currently the most powerful computer is a Japanese supercomputer known as the K Computer [11]. The K Computer, produced by Fujitsu at the RIKEN Advanced Institute for Computational Science, implements a Tofu interconnect (see Figure 3.2), which is a 6-dimensional (6D) custom mesh/torus architecture [25]. This means that one node somewhere in the center of the topology is connected to the nodes on the left and right (2D), front and back (2D) and also to the ones that are above and under (2D), hence the 6D. 14

22 Figure 3.2: 6D Mesh/Torus Architecture [25] Intel QuickPath Interconnect and HyperTransport The world s largest semiconductor chip maker, Intel [12] relays on their own interconnect technology (see Figure 3.3), which they call QuickPath Interconnect (QPI). This is the key technology maximizing the performance of Intel s x86 multi-core processors. The QuickPath interconnect uses fast pointto-point link topology between CPU cores inside the processor, and outside the processor the same links are used to connect memory and other CPU s [13]. Another similar interconnect using point-to-point links is HyperTransport (HT). The technology is developed and maintained by The HyperTransport Consortium. AMD s Opteron processor (Figure 3.4) uses HyperTransport bidirectional links to interconnect several CPU s and memory [31] Arteris Interconnect The Arteris interconnect is a NoC (Network-on-Chip) interconnect for SoC (System-on-Chip). Key characteristics is low power and high bandwidth, making it suitable for modern mobile devices implementing complex SoC dies with an increasing amount of IP blocks. Therefore many companies have chosen Arteris as the interconnect for their devices. Also, ARM Holdings has invested in Arteris making it perhaps even more interesting when taking, other 15

23 than architectural specifications, also the business and partner perspective in to acount. One of Arteris big customers is Samsung, who has selected Arteris interconnect solutions for multiple chips. One of Arteris interconnect products is called FlexNoC, which is designed for SoC interconnects with low latency and high throughput requirements, and it supports the broadly used ARM AMBA protocols [14]. Another vendor using the Arteris interconnect is Texas Instruments. The L3 interconnect inside the TI OMAP 4 processors connects high throughput IP blocks, like the the dual-core ARM Cortex-A9 CPU [41] Open Compute Some interesting server architectural work is done by Facebook under a project named Open Compute. They have designed and built an energy efficient data center which is cheaper and more powerful than other data-centers. The Open Compute project uses both Intel and AMD motherboards in the servers, and both motherboards are stripped down from many features that are otherwise found in traditional motherboards. Still, perhaps the most exciting part is that this project is open source, meaning they share everything [4]. The functional block diagram and board placement of the Intel and AMD motherboards are illustrated in Figure 3.3 and Figure 3.4. Both the Intel and AMD board diagrams show that the different processors have separate main memory located near the processing units, so they may quickly access the data. QuickPath (QPI)- and HyperTransport(HT) interconnect technologies are used in the boards as the physical connection linking the separate processors and memory. 16

24 Figure 3.3: Open Compute Motherboard based on Intel CPU and QuickPath Interconnect [4]. 17

25 Figure 3.4: Open Compute Motherboard based on AMD CPU and Hyper- Transport Interconnect [4]. 3.4 ARM Architecture ARM Holdings is a Cambridge based company designing the popular 32-bit ARM instruction set architecture. The current instruction set is ARMv7 and it is implemented in most smartphones and tablets on the market today. One of the key features of the ARM architecture is the excellent power efficiency which makes the architecture suitable for portable devices. ARM operates by licensing its design as IP rather than manufacturing the processors themselves. Today there are several companies building ARM processors: Among others, Nvidia, Samsung and Texas Instruments [15]. 18

26 3.4.1 Cortex-A Series The next version of ARMs popular Cortex-A series SoC is described in ARM s webpage as: "The ARM Cortex-A15 MPCore processor is the highest-performance licensable processor the industry has ever seen. It delivers unprecedented processing capability, combined with low power consumption to enable compelling products in markets ranging from smartphones, tablets, mobile computing, high-end digital home, servers and wireless infrastructure. The unique combination of performance, functionality, and power-efficiency provided by the Cortex-A15 MPCore processor reinforces ARM s clear leadership position in these high-value and high-volume application segments." [16] In Figure 3.5, is an image of the upcoming processor from ARM Holdings. Some of the new features found in the new processor are shown in the block diagram: The Snoop Control Unit (SCU) enabling the 128-bit AMBA 4 Coherent Bus are perhaps some of the most interesting. Also, what is particularly exciting about the Cortex-A15, is that the 1.5 GHz GHz quadcore configuration of the architecture is specifically designed for low-power servers [16]. 19

27 Figure 3.5: Next Generation ARM SoC [5] Advanced Microcontroller Bus Architecture (AMBA) All System-on-Chip (SoC) integrated circuits from ARM uses the Advanced Microcontroller Bus Architecture (AMBA) as the on-chip bus interconnect. The most recent AMBA protocol specification is the Advanced extensible Interface (AXI), which is targeted at high-performance systems with a high frequency and low-latency requirements. The AMBA AXI protocol is backwardcompatible with the earlier AHB and APB interfaces. The latest AMBA 4 version adds several new interface protocols. Some of these are: The AMBA 4 Coherency Extension (ACE) which allows full cache coherency between processors, ACE-Lite for I/O coherency and the AXI 4 which is designed to increase performance and power efficiency. [28] Calxeda 120 x Quad-core ARM Server Chip One promising attempt, other than the unseen future Cortex-A15 processor, to change the x86 dominated server market is done by a company named Calxeda, which ARM Holdings has shown interest in by investing in it. They 20

28 are building a server chip based on ARM Cortex-A9 MPCore processors. The architecture is based on a standard 2 rack unit (2U) server with 120 quad-core ARM processors. Each ARM node will only consume 5W of power, which is a lot less than any other x86 server [17]. 21

29 CHAPTER FOUR NON-UNIFORM MEMORY ACCESS (NUMA) 4.1 Introduction The Non-Uniform Memory Access architecture is designed for systems with many CPU s. The problem with traditional Symmetric Multiprocessing (SMP) is that it does not scale very well, because all traffic between the cores and memory goes through one place. NUMA is specifically designed to address this issue that occurs in large SMP systems, and solves it with an architecture where separate memory nodes are directly connected to all the CPU s [42]. A simple NUMA system is illustrated below in Figure 4.1. Figure 4.1: The Simplest NUMA System [42]. 22

30 4.2 Hardware A full NUMA system consists of special hardware (CPU, motherboard) that supports NUMA. There are many different types of motherboard architectures for one CPU family. Below, in Figure 4.2, we can see four different topologies for the AMD Opteron CPU. The block diagram is an approximation of the real motherboards. The interconnect between the processors follows a pattern where all processors are connected to two other processors and the local memory. For some architectures the interconnect is obvious (see Tyan K8QS Thunder Pro S4882) but for the other architectures with a more irregular setup, the interconnect could be manufactured in different ways. As the exact block diagrams of the interconnect was not found, the interconnect has been left out from the figure. As the distance between the cores and their remote nodes varies a lot between the different architectures, the performance is also going to be different. Some motherboard might be suitable for a specific application that doesn t need a significant amount of memory, but has much traffic between the memory and the processing unit. Another architecture might be optimal for a specific server that has separate computing intensive applications running on all the CPU s, where all the applications need a lot of shared memory. 23

31 Figure 4.2: Different Motherboard Topologies for the Quad-core AMD Opteron CPU. A NUMA system without the NUMA hardware could basically be implemented with the help of virtual memory (see Section 2.2.1). Most systems have a Memory Management Unit (MMU), which is a hardware part that all memory transactions from the CPU s go through. The virtual addresses from the CPU s are translated by the MMU to physical addresses. This way a computer cluster without NUMA hardware could take advantage of a programmable MMU and virtual memory to run a software NUMA system which uses both local and remote memory, where remote memory would be the memory of another computer connected to the same cluster. 24

32 4.3 Software To achieve a fast NUMA system, the Operating System (OS) running the software part, also has to be NUMA aware. It is equally important to have NUMA aware software as it is to have the physical NUMA hardware. The kernel, which is the main component of an OS, has to allocate memory to a process in the most efficient way. To do this, the kernel needs to know the distances between all the nodes and then calculate and apply an efficient memory policy for all the processes. The scheduler is a software part of every OS kernel that handles accesses to different resources in the system. Some schedulers, like the most recent Linux scheduler, uses different priority levels that are given to tasks. This way important and real-time tasks can access resources like the processor before other task that are not that important. The schedulers also uses load-balancing in order to evenly distribute the workforce to all the processors. This basically means that for a NUMA aware OS to work efficiently on NUMA hardware, the scheduler also needs to be able to parse the distance information of the underlying NUMA. The tasks should then be distributed accordingly and executed with NUMA scheduling with efficient memory placement policy. [30]. As an example, if a fair scheduler would not be aware of an underlying Quadcore NUMA hardware with 2 GB local main memory per CPU, the tasks would be evenly distributed to the four different processors, making them all work. As the OS would not be aware of the different physical locations of the main memory, the tasks would be executed on the processor with the most free time. This would result in an inefficient memory usage, as a random memory access would be 75 % of the times remote and inefficient. 4.4 Advanced Configuration and Power Interface Some major companies (HP, Intel, Microsoft, Phoenix and Toshiba) have together engineered a standard specification, called Advanced Configuration and Power Interface (ACPI). It is an open standard, for the x86 architecture, which describes the computer hardware and power management to the operating system. The ACPI defines many different tables that are filled with useful information that the OS reads and uses. Some of these tables hold information about the memory and CPU s and the distances between these on a NUMA machine. These tables are what we are interested in and they are described next [34]. 25

33 4.4.1 System Locality Information Table One of the two important tables in the ACPI specification concerning a NUMA hardware, is the System Locality Information Table (SLIT). The SLIT is an optional ACPI table that holds information about the distances between all the processors, memory controllers and host bridges. The table holds information about the distances in a matrix with all the nodes. The unit of distance is relative to the local node which has the value of 10. The distance between node 1 and node 4 could for instance be 34. This would mean it takes 3.4 times more time for node 1 to access node 4 than what it takes to access its own local memory. Figure 4.3 gives the SLIT specification [34]. 26

34 Figure 4.3: ACPI System Locality Information Table (SLIT) [34] Static Resource Affinity Table The other vital ACPI table needed to export NUMA information, is the Static Resource Affinity Table (SRAT). The SRAT (Figure 4.4) describes and stores a list of the physical location information about the CPU s and memory. As the 27

35 Figure 4.4: ACPI Static Resource Affinity Table (SRAT) [34]. SLIT holds the distance between all the nodes, the SRAT actually describes which and where these nodes physically are [34]. 4.5 The Linux Kernel The Linux kernel has, since 2004, support for Non-Uniform Memory Access on some architectures [30]. The kernel uses the ACPI SLIT definition to get the correct NUMA distances and then applies a certain NUMA policy to achieve optimal performance [26] NUMA Aware Scheduler A fundamental part of every modern OS kernel is the scheduler. In a multicore system the scheduler has to decide which process should run on which CPU and dynamic load balancing is done by the scheduler as it can migrate 28

36 processes from one core to another. On a NUMA system, where memory access times depend on the CPU and memory, the scheduling becomes even more complex, yet more important [18]. As of kernel 2.5 the scheduler has been a multi-queue scheduler which implements separate runqueues for every CPU core. It was called the O(1) scheduler but the scheduler still had no awareness of NUMA nodes until later when parts of the O(1) scheduler and parts of another NUMA aware scheduler were combined and resulted in a new optimal scheduler. [18]. The current Linux scheduler is namned the Completely Fair Scheduler (CFS). The CFS scheduler maximizes CPU workload and schedules tasks fairly among all the available cores in order to maximize performance. In situations where the amount of running tasks is less than the amount of logical processors, the scheduling can be tuned with a power saving option. The sched_mc_power_savings is disabled by default, but can easily be enabled with: echo 1 > /sys/devices/system/cpu/sched_mc_power_savings This will change the scheduler behavior, so that new tasks are distributed to other processors only when the first processors all cores are fully loaded and can not handle any new tasks [19] Memory Affinity Memory affinity is the done when the memory is split into spaces, and these spaces are then made accessible by predefined CPU s. In a NUMA system this affinity is called node affinity where the kernel tries to keep a process and its children running on a local node [26] Processor Affinity In Linux a program called taskset can be used to retrieve or set process s CPU affinity. Using the Process IDentifier (PID) of a process the taskset utility can be used to bypass the default scheduling applied by the Linux scheduler. The program can also be used to run a command with a given CPU affinity [37]. As an example, the command below sets a CPU affinity to program1, forcing it 29

37 to use only CPU 3 taskset -c 3 program Fake NUMA Nodes If a system lacks NUMA hardware, the 64-bit Linux kernel can be built with options that enables a fake NUMA configuration. The kernel does not have fake NUMA enabled, but users can manually compile the kernel with the two following options: CONFIG_NUMA_=y and CONFIG_NUMA_EMU=y The final step to start the Linux kernel with a fake NUMA system is to modify the boot loader with: numa=fake=x, where x is the amount of NUMA nodes. This way the kernel splits the memory into x equally large parts. Alternatively the size of the NUMA nodes memory can be specified with: numa=fake=x*y where y is the size of the memory nodes in MB. As an example we could start a system with 4 CPU cores and 3 GB of memory. If we want to split the memory into 2 nodes of 512 MB each and 2 other nodes with 1 GB each, we start the kernel with: numa=fake=2*512,2* CPUSET The Linux kernel includes a feature called cpuset. The cpusets provide a useful mechanism for assigning a group of processors and memory nodes to certain defined tasks. A task has a cpuset which forces the CPU and memory placement policy to follow the current cpusets resources. The cpusets are especially useful with large many-core systems with complex memory hierarchies and NUMA architecture, as scheduling and memory management becomes increasingly hard on these systems. The cpusets represent different sized subsystems that are especially useful on web servers running several instances of the same application. The default kernel scheduler uses load balancing across all CPU s, which actually is not a good option for at least two specific types of systems [32]: 30

38 1. Large systems "On large systems, load balancing across many CPUs is expensive. If the system is managed using cpusets to place independent jobs on separate sets of CPUs, full load balancing is unnecessary." [32] 2. Real-time systems "Systems supporting realtime on some CPUs need to minimize system overhead on those CPUs, including avoiding task load balancing if that is not needed." [32] Below is an example from the documentation where a sequence of commands will setup a cpuset named "Charlie" containing CPU s 2 and 3 and Memory Node 1, and after that start a subshell sh in that cpuset [32]: mount -t cgroup -ocpuset cpuset /dev/cpuset cd /dev/cpuset mkdir Charlie cd Charlie /bin/echo 2-3 > cpus /bin/echo 1 > mems /bin/echo $$ > tasks sh # The subshell sh is now running in cpuset Charlie # The next line should display /Charlie cat /proc/self/cpuset NUMACTL When running a NUMA configured machine in Linux, the cpusets feature can be extended with another program called numactl. As one NUMA node typically consists of one CPU and one memory part, the separate NUMA policy feature is necessary, when with cpusets the CPU does not necessarily have local memory. Using numactl, one can set a certain NUMA policy for a file or process for the current cpuset and in that way expand the memory management to be more optimized for a certain application on a NUMA architecture [36]. An example of using numactl, where process execute is run on node 3 with memory allocated on node 1 and 5: 31

39 numactl --cpubind=3 --membind=1,5 execute 32

40 CHAPTER FIVE IMPLEMENTATION This chapter explains the work process of this thesis. A short description of what we tried to accomplish and which tools we were using, with an ending of the obtained results. A conclusion with discussion is included in the next and final chapter. 5.1 Approach The approach has been from the beginning to explore the capabilities of the Simics simulator and to setup a suitable target machine with Simics and evaluate the performance of certain multi-core architectures with focus on the hardware interconnection and memory topology. Later, as the NUMA design was analyzed more accurately, the work concentrated on the Linux kernel and its NUMA features (Chapter 4.5). 5.2 Simics Full-system Simulator Wind River Simics is a simulator capable of fully emulating the behavior of a computer hardware, meaning one can install and run unmodified operating systems and software on supported architecture (x86, ARM, MIPS,...) [20]. 33

41 5.2.1 History and Features Wind River, a subsidiary of Intel Corporation, has been producing software for embedded systems since 1981 and has its technology in more than one billion products. Simics was first developed in the Swedish Institute of Computer Science and later in 1998 Virtutech was established so commercial development of Simics could be started. In 2010 Virtutech was acquired by Intel and Simics is now marketed through Wind River Systems [21]. Simics is a fast simulator that has the capabilities to save the state of a simulation. This state can later be opened again and simulation from the same moment can be continued. Every register and memory location can easily be viewed and modified and the whole system can even be ran backwards in order to find a bug. A system can conveniently be built with a script file which can include information like memory size, CPU frequency, core count and network definitions like IP and MAC address. A system with two cores and cache memory running a program in Linux is being monitored (Figure 5.1) with the help of the data caches. Both cores have private data caches, and the read and write transactions to the caches are being plotted. This way a user can visualize the system better and see which CPU core is doing all the work. In Figure 5.1 the program is single threaded, so all calculations seem to first be executed on Core-0 and later, at approximately simulated seconds, the thread is migrated to Core-1. Figure 5.1: Cache Read and Write of Core-0 and Core-1. It is possible to run several targets at the same time in Simics and these can, for instance, easily be interconnected with Ethernet. The minimum latency for a distributed system can be changed in Simics Command Line, and if we would want a system with 50 milliseconds latency, the following command should be 34

42 used: running> set-min-latency 0.05 The traffic in a distributed system can be monitored in Simics when enabling pcapdump capture. At one point we looked at the traffic between two Erlang nodes running on different targets, communicating with message passing. The pcapdump was enabled on the service node which provided the virtual network. The traffic data was written to a file, which we analyzed on the host OS in a program called Wireshark. The idea was to get a more accurate understanding of the traffic that occurs in a distributed Erlang system. In order to research and evaluate different interconnects, a deeper understanding of the whole system and traffic needs to be carefully examined. But, as it would have required a lot of more time and resources to fully achieve the task of correctly analyzing the traffic in a distributed Erlang system running cloud based server applications, the work was canceled. Still, as the feature was found and tested to work, the knowledge could help give directions with future work Working With ARM Architecture As earlier research in the Cloud Project has proposed a low-power server architecture implementing the ARMv7 instruction set, we used at the beginning ARM targets in Simics. Some difficulties was met upon during the work with the ARM target machine, which are described below: 1. Full-featured Distribution The need for a full-featured operating system running on the ARM target was apparent, as the benchmarks we are using in our research are implemented in Erlang programming language, which requires, among other things, GNU components and Perl. The target machine available was a Single-Board Computer (SBC) from Technologic Systems (TS-7200) running a compact Linux distribution with some Unix-tools available with BusyBox. The TS-7200 is based on an old ARM architecture with the ARM9 CPU. Booting a full Linux distribution is still possible with the CompactFlash (CF) card, and Debian 3.1 Sarge is available to download for the TS The installation of Debian Sarge distribution was Therefore, the different phases of booting a full distribution in the Simics 35

43 ARM target, are described next. First step was to create a new empty raw file. This was done in Linux with the dd tool: $ dd if=/dev/zero of=/home/joaknylu/image bs=1k count= Next, a file system has to be created: $ /sbin/mkfs.ext3 -N /home/joaknylu/image After this we loopback mount it with: $ sudo mount -o loop,rw /home/joaknylu/image /home/joaknylu/mount-folder/ Now the downloaded distribution can be extracted into the mount: $ sudo tar jxvf /home/joaknylu/debian-sarge-256mb tar.bz2 -C /home/joaknylu/mount-folder/ After this, we have to unmount the file system inside the file: $ sudo umount /home/joaknylu/mount-folder/ And finally use the Simics craff tool to compress and create the image file needed: $./craff /image -o debian-256mb-sarge.craff This way we have successfully compressed and converted the downloaded distribution to a file that should be bootable in Simics. As the distribution could not be successfully started in Simics, it turned out that more files were needed. The OS could not be started, as the image file needed in Simics to correctly boot the distribution has to be made up of several files, including, not only the distribution image but also the kernel (bzimage) and the bootloader. This made a simple sounding task hard, and was never fully achieved. 2. Memory Management Unit (MMU) 36

44 The TS-7200 SBC target has Ethernet interface implemented, allowing several machines to be easily interconnected with Ethernet in Simics. And, as the ARM9 CPU has a Memory Management Unit (MMU), we looked at taking advantage of virtual memory (more detailed information in section 4.2), and implementation of a cluster NUMA architecture with software, was analyzed. Memory was quickly shared among several interconnected targets running in Simics with the use of the Simics Command Line utility. Using the following commands edits the memory-space and forces one of the two targets to use both local and remote memory, where remote memory is the one located on the other board: running> board_cmp0.phys_mem.del-map base = 0x0 device = board_cmp0.sdram running> board_cmp0.phys_mem.add-map base = 0x0 device = board_cmp1.sdram offset = 0x0 length = 0x Running these commands makes the target crash, so a cluster NUMA architecture can not be simulated this easy. The obvious problem here is that the operating systems running on each target are not aware of each other, so the crashing of the kernel is expected. In order for it to work, some sort of device should have to be made, which would act as a physical device monitoring the traffic to the memory. Figure 5.2 below illustrates how a cluster NUMA system would work, but regardless of the issues with the hardware simulations, the problem here still remained with the evident lack of a full-featured distribution. Figure 5.2: Cluster NUMA Architecture. 3. Multiprocessing Multiprocessing architecture is a logical part of every Cloud based system, so the need to simulate such a system is a necessity. As the ARM 37

45 TS-7200 target only has one CPU, and no new models with ARM SMP are available in Simics (4.2) academic package, the need to build an own ARM SMP was needed in order to simulate multi-core architecture. This could be done for instance with the inclusion of an own control unit that handles the synchronization of the cores. The need for a full-featured distribution still remains, and also a Linux Board Support Package (BSP) should have to be written in order to make the custom ARM multi-core architecture run. This could behave like a newer regular ARM SMP, but the architecture would regardless not be the same gcache and trans-staller The g-cache module is a standard cache model used in Simics. The g-cache simulates the behavior of the cache memory with the help of the timing model interface. Simics has to be run in stall mode in order for the staller to have an affect. Stall mode is supported on some architectures in Simics, like x86 but it is not supported for the ARM instruction set. In Figure 5.3 the g-cache module represents a memory hierarchy with a two level cache structure. The id-splitter is a separate module that splits the instruction and data accesses and sends them to separate L1 caches. The splitters are there to avoid accesses that can cross a cache-line boundary by splitting these into two accesses. The L1 caches are connected to the L2 cache, which is finally connected to the transstaller, which simulates the memory latency [43]. This module is important, as we are interested in the delays or latency that always occur when dealing with interconnects. The trans-staller module source is available in Simics, so modifications can be done to it. This allows us to change the standard trans-staller behavior to a more complex one. In order to simulate a NUMA behavior, the transstaller needs to be modified so that it returns different memory latency times depending on which core or CPU accesses what memory space, as different memory spaces are located physically different. 5.3 Oracle VM VirtualBox The Oracle VM VirtualBox is a x86 virtualization tool that allows users to install different guest operating systems within the program. The VirtualBox allowed us to quickly install and recompile a Linux kernel. We installed 38

46 Figure 5.3: A Two Level Cache System [43]. Ubuntu Server on VirtualBox and recompiled the kernel with the NUMA emulation set. This meant that the standard configuration had to be modified and NUMA and NUMA_EMU had to be set before compilation. Some programs and benchmarks, including Erlang and NUMACTL were also installed. One of the most important reason to use a virtualization tool for this task, is that the whole disk image is available and can easily be converted. In this case The VirtualBox Disk Image (.vdi) was converted with the VirtualBox command-line interface, vboxmanage to a a raw disk image (.raw). This makes it possible to use The Simics craff tool and convert and compress the raw disk image to a Simics.craff image file. A Simics script is then modified to open the new craff image file, allowing the simulated machine to be started with the desired distribution and kernel. 5.4 Erlang This chapter presents the Erlang programming language and explains some features of the language. Erlang is designed by Ericsson and is well suited for distributed concurrent systems running fault-tolerant applications. Erlang runs programs fast on multi-core computers and has the ability to modify a running application without closing the program [29]. These are some interesting features which makes Erlang suitable for cloud based servers. 39

47 5.4.1 Distributed Programming Distributed programming is done in Erlang with the help of libraries. Distributed programs are implemented to run on a distributed system that consist of a number of Erlang runtime systems communicating with each other. The processes running are using message passing to communicate and synchronize with each other [29] Erlang Nodes Distributed Erlang is implemented with nodes. One can start several Erlang nodes on one machine or on different machines interconnected over Ethernet. As an example, two nodes are started on the same machine with two terminals open. Each node is running on a different terminal and both nodes have their own name [29]: Terminal-1: $ erl -sname node1 Terminal-2: $ erl -sname node2 With this setup, the user can put to use the Remote Procedure Call service (rpc) in order to, from one node, call a function located on the other node. The rpc method will obtain the answer from the remote node and fetch it to the calling node Asynchronous Message Passing As Erlang processes share no memory they communicate with message passing. The message passing is asynchronous, as the sender does not wait for the receiver to be ready. The processes have a mailbox queue where incoming messages are saved, until received by the process. In Erlang, the operator "!" is used to send messages. The syntax of "!" is [22]: Pid! Message 40

48 5.5 Results In our simulations the inclusion of cache memory was irrelevant, so the full g-cache module with separate timing models was not used. Instead a simpler script was written that directly uses the trans-staller for all memory accesses: # Connect phys_mem with NUMA = pre_conf_object( staller, trans-staller = conf.staller The last line connects the staller to the memory space. Simics is then started in Stall execution mode, in order to make it possible to stall memory accesses. The trans-staller has been modified to simulate the NUMA behavior for the system. To the trans-staller was added an algorithm that checks which CPU_id accesses what memory space (See Appendix A). The stall times returned by the trans-staller are now adjusted to the corresponding latency of a certain NUMA hardware configuration. In our simulations, the system is a 64 bit x86 machine with 4 x CPU s. The operating system is the 64 bit Ubuntu Server with recompiled Linux kernel with NUMA emulation set. The Linux kernel is started with 4 x 128 MB NUMA nodes. The NUMA scheduling and memory placement policy is set with numactl and in order to ensure memory locality, the policy is simply set to --localalloc, so only the fastest local memory is used by the CPU s. In comparison, the tests are also ran without NUMA policy, where allocations are made to an arbitrary node. The hardware is a symmetric NUMA system with three different node latencies. Below in Figure 5.4 is the NUMA hardware with the SLIT like table showing distance information between the nodes. 41

49 Figure 5.4: The hardware for Simics NUMA. 5.6 Analysis of Emark Benchmarks A number of benchmarks, called Emark was used to evaluate or show how changes in Linux NUMA memory allocation policies impact on the computational performance of a computer system. The Emark benchmarks are used to evaluate the performance of different versions of the Erlang Run-Time System Application (ERTS), and see how it performs on different machines. The benchmarks have also been used to evaluate the performance of different platforms running the same version of ERTS. In our simulations two of the Emark benchmarks are used: Bang and Big. The first implements all to one message passing and the second all to all message passing. These benchmarks measure the performance in elapsed time (ms) [40]. The results seen below show the performance of the ERTS running on our simulated particular hardware. All figures include three different simulations, where the first and fast simulation is ran in Simics under normal execution mode, meaning no timing model is in use for the memory accesses. The two other simulations are ran under Simics stall mode, where all the memory accesses are stalled for a certain amount of cycles, depending on which CPU access what memory space. The difference between the two later simulations ran under stall mode, is that the first is run without any NUMA policy and the 42

50 other using the Linux user space shared library interface, called libnuma and in particular the command line utility numactl, to control the NUMA policy. The NUMA policy we are using for the second simulation, is to allocate on the local node. The benchmarks have been run several times in order to identify odd or inconsistent behavior. The results presented here are the data acquired from one random run of the benchmarks, as the different runs did not show any significant variance between themselves. Below, in Figure 5.5 we can see results from the Emark Big benchmark. The benchmark is run on only one core with the use of Linux taskset CPU affinity utility, and we can see that the benchmark is finished much faster under normal Simics mode, just as expected. Under stall mode the difference is clear between the two different executions. Without any NUMA policy the benchmark takes approximately two times longer to finish as it allocates memory from slower nodes and as the other run uses NUMA policy and allocates only from the fast local node, the benchmark finishes much faster. Figure 5.5: All to all message passing on one core. The second Figure shows how fast the benchmark finished with all 4 CPU s in use. Here we can see similar ratio between the speed of the two executions as with the previous figure with only one core. 43

51 Figure 5.6: All to all message passing on four cores. A comparison between the two first figures can be seen below in Figure 5.7. The results indicate that the Big benchmark is about five to six times faster when run on stall execution mode on the quad-core machine when using all cores compared to using only one. With local memory allocation policy set, using four cores is about four times faster than when only using one core. This is almost exactly the same with the normal execution mode, where memory accesses happen without any stalling. The lines are almost identical because the stalling factor can be excluded from the performance ratio: P erformance Ratio = run 1_single_core run 1 _quad_core run 2_single_core x stall_local run 2 _quad_core x stall_local Similar results can be seen with the Bang benchmark. On one core the result is clear and consistent. We execute the benchmark two times faster when allocations are made on the local node. Again, we can see in Figure 5.9 that the local allocation gives the same performance advantages over random allocation, when using all four cores. 44

52 Figure 5.7: Comparison of one core and four cores (Big). Figure 5.8: All to one message passing on one core. Comparing the speed of the single-core setup with the quad-core setup, shows the increased performance of a factor between 1.9 and

53 Figure 5.9: All to one message passing on four cores. Figure 5.10: Comparison of one core and four cores (Bang). From these results we can see that the modifications done to the function returning stall times in the trans-staller module works somewhat as expected. The NUMA policy set with the command line utility numactl also works as 46

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

Multilevel Load Balancing in NUMA Computers

Multilevel Load Balancing in NUMA Computers FACULDADE DE INFORMÁTICA PUCRS - Brazil http://www.pucrs.br/inf/pos/ Multilevel Load Balancing in NUMA Computers M. Corrêa, R. Chanin, A. Sales, R. Scheer, A. Zorzo Technical Report Series Number 049 July,

More information

Enabling Technologies for Distributed and Cloud Computing

Enabling Technologies for Distributed and Cloud Computing Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading

More information

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju Chapter 7: Distributed Systems: Warehouse-Scale Computing Fall 2011 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note:

More information

Advances in Virtualization In Support of In-Memory Big Data Applications

Advances in Virtualization In Support of In-Memory Big Data Applications 9/29/15 HPTS 2015 1 Advances in Virtualization In Support of In-Memory Big Data Applications SCALE SIMPLIFY OPTIMIZE EVOLVE Ike Nassi Ike.nassi@tidalscale.com 9/29/15 HPTS 2015 2 What is the Problem We

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed Computing Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

COMPUTER HARDWARE. Input- Output and Communication Memory Systems COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Hardware accelerated Virtualization in the ARM Cortex Processors

Hardware accelerated Virtualization in the ARM Cortex Processors Hardware accelerated Virtualization in the ARM Cortex Processors John Goodacre Director, Program Management ARM Processor Division ARM Ltd. Cambridge UK 2nd November 2010 Sponsored by: & & New Capabilities

More information

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are

More information

Hardware Based Virtualization Technologies. Elsie Wahlig elsie.wahlig@amd.com Platform Software Architect

Hardware Based Virtualization Technologies. Elsie Wahlig elsie.wahlig@amd.com Platform Software Architect Hardware Based Virtualization Technologies Elsie Wahlig elsie.wahlig@amd.com Platform Software Architect Outline What is Virtualization? Evolution of Virtualization AMD Virtualization AMD s IO Virtualization

More information

How System Settings Impact PCIe SSD Performance

How System Settings Impact PCIe SSD Performance How System Settings Impact PCIe SSD Performance Suzanne Ferreira R&D Engineer Micron Technology, Inc. July, 2012 As solid state drives (SSDs) continue to gain ground in the enterprise server and storage

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Which ARM Cortex Core Is Right for Your Application: A, R or M?

Which ARM Cortex Core Is Right for Your Application: A, R or M? Which ARM Cortex Core Is Right for Your Application: A, R or M? Introduction The ARM Cortex series of cores encompasses a very wide range of scalable performance options offering designers a great deal

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

Windows Server 2008 R2 Hyper V. Public FAQ

Windows Server 2008 R2 Hyper V. Public FAQ Windows Server 2008 R2 Hyper V Public FAQ Contents New Functionality in Windows Server 2008 R2 Hyper V...3 Windows Server 2008 R2 Hyper V Questions...4 Clustering and Live Migration...5 Supported Guests...6

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE Sudha M 1, Harish G M 2, Nandan A 3, Usha J 4 1 Department of MCA, R V College of Engineering, Bangalore : 560059, India sudha.mooki@gmail.com 2 Department

More information

Windows Server 2008 R2 Hyper-V Live Migration

Windows Server 2008 R2 Hyper-V Live Migration Windows Server 2008 R2 Hyper-V Live Migration White Paper Published: August 09 This is a preliminary document and may be changed substantially prior to final commercial release of the software described

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Optimizing the Performance of Your Longview Application

Optimizing the Performance of Your Longview Application Optimizing the Performance of Your Longview Application François Lalonde, Director Application Support May 15, 2013 Disclaimer This presentation is provided to you solely for information purposes, is not

More information

Computer Virtualization in Practice

Computer Virtualization in Practice Computer Virtualization in Practice [ life between virtual and physical ] A. Németh University of Applied Sciences, Oulu, Finland andras.nemeth@students.oamk.fi ABSTRACT This paper provides an overview

More information

EVALUATING POWER MANAGEMENT CAPABILITIES OF LOW-POWER CLOUD PLATFORMS. Jens Smeds

EVALUATING POWER MANAGEMENT CAPABILITIES OF LOW-POWER CLOUD PLATFORMS. Jens Smeds EVALUATING POWER MANAGEMENT CAPABILITIES OF LOW-POWER CLOUD PLATFORMS Jens Smeds Master of Science Thesis Supervisor: Prof. Johan Lilius Advisor: Dr. Sébastien Lafond Embedded Systems Laboratory Department

More information

VMware Server 2.0 Essentials. Virtualization Deployment and Management

VMware Server 2.0 Essentials. Virtualization Deployment and Management VMware Server 2.0 Essentials Virtualization Deployment and Management . This PDF is provided for personal use only. Unauthorized use, reproduction and/or distribution strictly prohibited. All rights reserved.

More information

NV-DIMM: Fastest Tier in Your Storage Strategy

NV-DIMM: Fastest Tier in Your Storage Strategy NV-DIMM: Fastest Tier in Your Storage Strategy Introducing ArxCis-NV, a Non-Volatile DIMM Author: Adrian Proctor, Viking Technology [email: adrian.proctor@vikingtechnology.com] This paper reviews how Non-Volatile

More information

nanohub.org An Overview of Virtualization Techniques

nanohub.org An Overview of Virtualization Techniques An Overview of Virtualization Techniques Renato Figueiredo Advanced Computing and Information Systems (ACIS) Electrical and Computer Engineering University of Florida NCN/NMI Team 2/3/2006 1 Outline Resource

More information

Windows Server 2008 R2 Hyper-V Live Migration

Windows Server 2008 R2 Hyper-V Live Migration Windows Server 2008 R2 Hyper-V Live Migration Table of Contents Overview of Windows Server 2008 R2 Hyper-V Features... 3 Dynamic VM storage... 3 Enhanced Processor Support... 3 Enhanced Networking Support...

More information

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering DELL Virtual Desktop Infrastructure Study END-TO-END COMPUTING Dell Enterprise Solutions Engineering 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Full and Para Virtualization

Full and Para Virtualization Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels

More information

Virtualization: What does it mean for SAS? Karl Fisher and Clarke Thacher, SAS Institute Inc., Cary, NC

Virtualization: What does it mean for SAS? Karl Fisher and Clarke Thacher, SAS Institute Inc., Cary, NC Paper 347-2009 Virtualization: What does it mean for SAS? Karl Fisher and Clarke Thacher, SAS Institute Inc., Cary, NC ABSTRACT SAS groups virtualization into four categories: Hardware Virtualization,

More information

Rambus Smart Data Acceleration

Rambus Smart Data Acceleration Rambus Smart Data Acceleration Back to the Future Memory and Data Access: The Final Frontier As an industry, if real progress is to be made towards the level of computing that the future mandates, then

More information

I3: Maximizing Packet Capture Performance. Andrew Brown

I3: Maximizing Packet Capture Performance. Andrew Brown I3: Maximizing Packet Capture Performance Andrew Brown Agenda Why do captures drop packets, how can you tell? Software considerations Hardware considerations Potential hardware improvements Test configurations/parameters

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers

Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers ZT Systems (ST based) Applied Micro development platform HP Redstone platform Mitac Dell Copper platform ARM in Servers 1 Server Ecosystem Momentum 2009: Internal ARM trials hosting part of website on

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Distributed Systems LEEC (2005/06 2º Sem.)

Distributed Systems LEEC (2005/06 2º Sem.) Distributed Systems LEEC (2005/06 2º Sem.) Introduction João Paulo Carvalho Universidade Técnica de Lisboa / Instituto Superior Técnico Outline Definition of a Distributed System Goals Connecting Users

More information

LS DYNA Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009 LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The

More information

Monitoring Databases on VMware

Monitoring Databases on VMware Monitoring Databases on VMware Ensure Optimum Performance with the Correct Metrics By Dean Richards, Manager, Sales Engineering Confio Software 4772 Walnut Street, Suite 100 Boulder, CO 80301 www.confio.com

More information

Datacenter Operating Systems

Datacenter Operating Systems Datacenter Operating Systems CSE451 Simon Peter With thanks to Timothy Roscoe (ETH Zurich) Autumn 2015 This Lecture What s a datacenter Why datacenters Types of datacenters Hyperscale datacenters Major

More information

Enterprise Backup and Restore technology and solutions

Enterprise Backup and Restore technology and solutions Enterprise Backup and Restore technology and solutions LESSON VII Veselin Petrunov Backup and Restore team / Deep Technical Support HP Bulgaria Global Delivery Hub Global Operations Center November, 2013

More information

SQL Server Virtualization

SQL Server Virtualization The Essential Guide to SQL Server Virtualization S p o n s o r e d b y Virtualization in the Enterprise Today most organizations understand the importance of implementing virtualization. Virtualization

More information

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices Brian Jeff November, 2013 Abstract ARM big.little processing

More information

SERVER CLUSTERING TECHNOLOGY & CONCEPT

SERVER CLUSTERING TECHNOLOGY & CONCEPT SERVER CLUSTERING TECHNOLOGY & CONCEPT M00383937, Computer Network, Middlesex University, E mail: vaibhav.mathur2007@gmail.com Abstract Server Cluster is one of the clustering technologies; it is use for

More information

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud StACC: St Andrews Cloud Computing Co laboratory A Performance Comparison of Clouds Amazon EC2 and Ubuntu Enterprise Cloud Jonathan S Ward StACC (pronounced like 'stack') is a research collaboration launched

More information

You re not alone if you re feeling pressure

You re not alone if you re feeling pressure How the Right Infrastructure Delivers Real SQL Database Virtualization Benefits The amount of digital data stored worldwide stood at 487 billion gigabytes as of May 2009, and data volumes are doubling

More information

HP ProLiant BL460c takes #1 performance on Siebel CRM Release 8.0 Benchmark Industry Applications running Linux, Oracle

HP ProLiant BL460c takes #1 performance on Siebel CRM Release 8.0 Benchmark Industry Applications running Linux, Oracle HP ProLiant BL460c takes #1 performance on Siebel CRM Release 8.0 Benchmark Industry Applications running Linux, Oracle HP first to run benchmark with Oracle Enterprise Linux HP Leadership» The HP ProLiant

More information

7a. System-on-chip design and prototyping platforms

7a. System-on-chip design and prototyping platforms 7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit

More information

ENERGY EFFICIENCY OF ARM

ENERGY EFFICIENCY OF ARM ENERGY EFFICIENCY OF ARM ARCHITECTURES FOR CLOUD COMPUTING APPLICATIONS Olle Svanfeldt-Winter Master of Science Thesis Supervisor: Prof. Johan Lilius Advisor: Dr. Sébastien Lafond Embedded Systems Laboratory

More information

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Kurt Klemperer, Principal System Performance Engineer kklemperer@blackboard.com Agenda Session Length:

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Accelerating High-Speed Networking with Intel I/O Acceleration Technology White Paper Intel I/O Acceleration Technology Accelerating High-Speed Networking with Intel I/O Acceleration Technology The emergence of multi-gigabit Ethernet allows data centers to adapt to the increasing

More information

Republic Polytechnic School of Information and Communications Technology C226 Operating System Concepts. Module Curriculum

Republic Polytechnic School of Information and Communications Technology C226 Operating System Concepts. Module Curriculum Republic Polytechnic School of Information and Communications Technology C6 Operating System Concepts Module Curriculum Module Description: This module examines the fundamental components of single computer

More information

PARALLELS SERVER 4 BARE METAL README

PARALLELS SERVER 4 BARE METAL README PARALLELS SERVER 4 BARE METAL README This document provides the first-priority information on Parallels Server 4 Bare Metal and supplements the included documentation. TABLE OF CONTENTS 1 About Parallels

More information

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Client/Server Computing Distributed Processing, Client/Server, and Clusters Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the

More information

Technical Investigation of Computational Resource Interdependencies

Technical Investigation of Computational Resource Interdependencies Technical Investigation of Computational Resource Interdependencies By Lars-Eric Windhab Table of Contents 1. Introduction and Motivation... 2 2. Problem to be solved... 2 3. Discussion of design choices...

More information

How To Build A Cloud Computer

How To Build A Cloud Computer Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC

More information

LCMON Network Traffic Analysis

LCMON Network Traffic Analysis LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne

More information

Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction

Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction There are tectonic changes to storage technology that the IT industry hasn t seen for many years. Storage has been

More information

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment Technical Paper Moving SAS Applications from a Physical to a Virtual VMware Environment Release Information Content Version: April 2015. Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary,

More information

Feb.2012 Benefits of the big.little Architecture

Feb.2012 Benefits of the big.little Architecture Feb.2012 Benefits of the big.little Architecture Hyun-Duk Cho, Ph. D. Principal Engineer (hd68.cho@samsung.com) Kisuk Chung, Senior Engineer (kiseok.jeong@samsung.com) Taehoon Kim, Vice President (taehoon1@samsung.com)

More information

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper. www.parallels.

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper. www.parallels. Parallels Cloud Server White Paper An Introduction to Operating System Virtualization and Parallels Cloud Server www.parallels.com Table of Contents Introduction... 3 Hardware Virtualization... 3 Operating

More information

Copyright www.agileload.com 1

Copyright www.agileload.com 1 Copyright www.agileload.com 1 INTRODUCTION Performance testing is a complex activity where dozens of factors contribute to its success and effective usage of all those factors is necessary to get the accurate

More information

Infrastructure Matters: POWER8 vs. Xeon x86

Infrastructure Matters: POWER8 vs. Xeon x86 Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report

More information

How to Choose your Red Hat Enterprise Linux Filesystem

How to Choose your Red Hat Enterprise Linux Filesystem How to Choose your Red Hat Enterprise Linux Filesystem EXECUTIVE SUMMARY Choosing the Red Hat Enterprise Linux filesystem that is appropriate for your application is often a non-trivial decision due to

More information

WHITE PAPER Optimizing Virtual Platform Disk Performance

WHITE PAPER Optimizing Virtual Platform Disk Performance WHITE PAPER Optimizing Virtual Platform Disk Performance Think Faster. Visit us at Condusiv.com Optimizing Virtual Platform Disk Performance 1 The intensified demand for IT network efficiency and lower

More information

IOS110. Virtualization 5/27/2014 1

IOS110. Virtualization 5/27/2014 1 IOS110 Virtualization 5/27/2014 1 Agenda What is Virtualization? Types of Virtualization. Advantages and Disadvantages. Virtualization software Hyper V What is Virtualization? Virtualization Refers to

More information

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Data Direct I/O Technology (Intel DDIO): A Primer > Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

PARALLELS CLOUD SERVER

PARALLELS CLOUD SERVER PARALLELS CLOUD SERVER An Introduction to Operating System Virtualization and Parallels Cloud Server 1 Table of Contents Introduction... 3 Hardware Virtualization... 3 Operating System Virtualization...

More information

PARALLELS CLOUD SERVER

PARALLELS CLOUD SERVER PARALLELS CLOUD SERVER Performance and Scalability 1 Table of Contents Executive Summary... Error! Bookmark not defined. LAMP Stack Performance Evaluation... Error! Bookmark not defined. Background...

More information

Embedded Parallel Computing

Embedded Parallel Computing Embedded Parallel Computing Lecture 5 - The anatomy of a modern multiprocessor, the multicore processors Tomas Nordström Course webpage:: Course responsible and examiner: Tomas

More information

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service Achieving Scalability and High Availability Abstract DB2 Connect Enterprise Edition for Windows NT provides fast and robust connectivity

More information

Parallels Cloud Server 6.0

Parallels Cloud Server 6.0 Parallels Cloud Server 6.0 Parallels Cloud Storage I/O Benchmarking Guide September 05, 2014 Copyright 1999-2014 Parallels IP Holdings GmbH and its affiliates. All rights reserved. Parallels IP Holdings

More information

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM White Paper Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM September, 2013 Author Sanhita Sarkar, Director of Engineering, SGI Abstract This paper describes how to implement

More information

HP Smart Array Controllers and basic RAID performance factors

HP Smart Array Controllers and basic RAID performance factors Technical white paper HP Smart Array Controllers and basic RAID performance factors Technology brief Table of contents Abstract 2 Benefits of drive arrays 2 Factors that affect performance 2 HP Smart Array

More information

Fall 2009. Lecture 1. Operating Systems: Configuration & Use CIS345. Introduction to Operating Systems. Mostafa Z. Ali. mzali@just.edu.

Fall 2009. Lecture 1. Operating Systems: Configuration & Use CIS345. Introduction to Operating Systems. Mostafa Z. Ali. mzali@just.edu. Fall 2009 Lecture 1 Operating Systems: Configuration & Use CIS345 Introduction to Operating Systems Mostafa Z. Ali mzali@just.edu.jo 1-1 Chapter 1 Introduction to Operating Systems An Overview of Microcomputers

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

VMWARE WHITE PAPER 1

VMWARE WHITE PAPER 1 1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the

More information

HRG Assessment: Stratus everrun Enterprise

HRG Assessment: Stratus everrun Enterprise HRG Assessment: Stratus everrun Enterprise Today IT executive decision makers and their technology recommenders are faced with escalating demands for more effective technology based solutions while at

More information

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.

More information

9/26/2011. What is Virtualization? What are the different types of virtualization.

9/26/2011. What is Virtualization? What are the different types of virtualization. CSE 501 Monday, September 26, 2011 Kevin Cleary kpcleary@buffalo.edu What is Virtualization? What are the different types of virtualization. Practical Uses Popular virtualization products Demo Question,

More information

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014 VMware SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014 VMware SAN Backup Using VMware vsphere Table of Contents Introduction.... 3 vsphere Architectural Overview... 4 SAN Backup

More information

SAN Conceptual and Design Basics

SAN Conceptual and Design Basics TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer

More information

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine Where IT perceptions are reality Test Report OCe14000 Performance Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine Document # TEST2014001 v9, October 2014 Copyright 2014 IT Brand

More information