Measuring Cache and Memory Latency and CPU to Memory Bandwidth



Similar documents
Comparing Multi-Core Processors for Server Virtualization

Intel X38 Express Chipset Memory Technology and Configuration Guide

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Hardware Level IO Benchmarking of PCI Express*

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

DDR2 x16 Hardware Implementation Utilizing the Intel EP80579 Integrated Processor Product Line

New!! - Higher performance for Windows and UNIX environments

Memory Sizing for Server Virtualization. White Paper Intel Information Technology Computer Manufacturing Server Virtualization

How System Settings Impact PCIe SSD Performance

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Intel architecture. Platform Basics. White Paper Todd Langley Systems Engineer/ Architect Intel Corporation. September 2010

Bandwidth Calculations for SA-1100 Processor LCD Displays

Intel Desktop Board DQ45CB

Intel Extreme Memory Profile (Intel XMP) DDR3 Technology

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Discovering Computers Living in a Digital World

Intel Desktop Board DP43BF

Upgrading Data Center Network Architecture to 10 Gigabit Ethernet

Intel Desktop Board DG45FC

21152 PCI-to-PCI Bridge

Analyzing the Virtualization Deployment Advantages of Two- and Four-Socket Server Platforms

Accelerating Data Compression with Intel Multi-Core Processors

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Desktop Board DG43RK

Intel Platform and Big Data: Making big data work for you.

Intel Desktop Board DG41BI

Dell PowerEdge Servers Memory

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Table Of Contents. Page 2 of 26. *Other brands and names may be claimed as property of others.

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Configuring RAID for Optimal Performance

Intel Desktop Board DG41WV

Virtuoso and Database Scalability

Configuring and using DDR3 memory with HP ProLiant Gen8 Servers

Intel 865G, Intel 865P, Intel 865PE Chipset Memory Configuration Guide

The Transition to PCI Express* for Client SSDs

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Intel Desktop Board DG31PR

PCI Express* Ethernet Networking

Server: Performance Benchmark. Memory channels, frequency and performance

Intel Desktop Board DQ43AP

Muse Server Sizing. 18 June Document Version Muse

Streaming and Virtual Hosted Desktop Study

White Paper. ACPI Based Platform Communication Channel (PCC) Mechanism. InSarathy Jayakumar Intel Corporation

Legal Notices Introduction... 3

Intel X58 Express Chipset

Family 10h AMD Phenom II Processor Product Data Sheet

Summary. Key results at a glance:

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Out-of-box comparison between Dell and HP blade servers

RESOLVING SERVER PROBLEMS WITH DELL PROSUPPORT PLUS AND SUPPORTASSIST AUTOMATED MONITORING AND RESPONSE

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

Introduction to PCI Express Positioning Information

Dell Statistica. Statistica Document Management System (SDMS) Requirements

DDR3 memory technology

Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors.

Introduction to Windows Server 2016 Nested Virtualization

The Bus (PCI and PCI-Express)

Intel Desktop Board DG41TY

Accelerating Server Storage Performance on Lenovo ThinkServer

Multi-core and Linux* Kernel

Figure 1A: Dell server and accessories Figure 1B: HP server and accessories Figure 1C: IBM server and accessories

Intel Desktop Board D945GCPE Specification Update

Enabling Technologies for Distributed Computing

High Performance Computing and Big Data: The coming wave.

Intel Storage System SSR212CC Enclosure Management Software Installation Guide For Red Hat* Enterprise Linux

Accelerating Business Intelligence with Large-Scale System Memory

Intel Desktop Board D945GCPE

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

Quest vworkspace Virtual Desktop Extensions for Linux

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Accomplish Optimal I/O Performance on SAS 9.3 with

Intel RAID Controllers

DDR4 Memory Technology on HP Z Workstations

Intel Media SDK Library Distribution and Dispatching Process

Benchmarking Cassandra on Violin

Fiber Channel Over Ethernet (FCoE)

Kronos Workforce Central on VMware Virtual Infrastructure

Intel Desktop Board DG31GL

Accelerating Business Intelligence with Large-Scale System Memory

Solution Brief: Microsoft SQL Server 2014 Data Warehouse Fast Track on System x3550 M5 with Micron M500DC Enterprise Value SATA SSDs

Accelerating Microsoft Exchange Servers with I/O Caching

Intel Desktop Board DP55WB

Enabling Technologies for Distributed and Cloud Computing

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

High Availability Server Clustering Solutions

Digi International: Digi One IA RealPort Latency Performance Testing

Adobe LiveCycle Data Services 3 Performance Brief

Customizing Boot Media for Linux* Direct Boot

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Leading Virtualization 2.0

Solution Recipe: Improve PC Security and Reliability with Intel Virtualization Technology

Intel Desktop Board DG33TL

Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization

The Foundation for Better Business Intelligence

Upgrading Intel AMT 5.0 drivers to Linux kernel v2.6.31

Transcription:

White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074

Executive Summary With today s computer architecture industry focused on clock speed and processing core count, it is easy to lose sight in other factors that play important roles in overall system performance. Memory latency and bandwidth are two major factors of performance. These two key aspects of system architecture can be broken up into processor cache and system memory. The open source LMBench Benchmarking suite provides utilities to accurately measure latency and bandwidth. Latency and Bandwidth are important measurements to take. Latency and Bandwidth are important measurements to take. These measurements are good indicators that describe the impact during the tuning processes, compare data from one platform to another and possibly identify other potential performance areas of improvement. The open source model provides a solution that adds no extra cost of benchmarking and portable measurements between customers and manufacturer, different teams within the same organization, or even separate members of the same team. 2 321074

Contents Background...4 Measuring Latency and Bandwidth...7 Introduction to the Tools...7 How to take Measurements...8 Understanding the Results...10 Conclusion...12 3 321074

Background In an industry that is so dependent on performance, it is easy to focus on speed of the CPU or the number of cores per socket. However, memory performance is equally as critical to total system performance. If a CPU is not properly matched with correct memory, it is possible for the CPU go underutilized, causing decreased efficiency and poor overall performance. There are two key factors of performance by which a computer system retrieves data. Latency and bandwidth play large roles in system performance and data manipulation. In the Intel architecture three-chip design, the CPU connects to the Memory Controller Hub (MCH) through a bus call the system bus or Front Side Bus (FSB). The MCH then connects to the I/O Controller Hub (ICH) through a link called Enterprise South Bridge Interface (ESI). The block diagram in Figure 1 shows an example of these connections with a dual processor configuration (referred to as P1 and P2) connected to the Intel 5100 MCH. The CPU will request data by issuing a memory read on the FSB. This read travels from the CPU, through the FSB, into the MCH and over to the memory location as determined by the memory controller. Latency refers to the time it takes for this request to be processed and for the first portion of data to arrive back at the requesting agent, in this case, the CPU. Latency refers to many different paths within computer architecture, including reading from I/O devices. This paper only discusses latency in terms of CPU to memory and CPU to cache accesses. The methodology used in determining latency is the same for both Three Chip and SoC design architectures. Intel also supports a System-on-a-Chip (SoC) design. In this design, the classic 3 Chip architecture becomes integrated into a single package. As shown in Figure 2, the CPU continues connecting the memory controller through the Front Side Bus. For the purpose of this paper, bandwidth will be considered the measured amount of data that passes through any given point per unit of time. Represented as a rate, bandwidth can also be discussed as throughput and given in the units of Megabytes per second (MB/sec). Megabyte is defined as 2 20 or 1,048,576 bytes. 4 321074

Figure 1. Intel 3-Chip design with Multiprocessor 321074 5

Figure 2. Block Diagram for the Intel EP80579 Integrated Processor Numerous factors affect latency and bandwidth. Intel architecture uses a hierarchy to organize the memory structure. With the current architecture, in the instruction and data caches reside at the highest level. Generally a small amount and referred to as Level 1 (L1) Cache, this region resides at the core level. A larger shared section between cores referred to as Level 2 (L2) cache, has slower access times than the higher L1 cache. L2 cache resides in the same package as the cores and is shared among them. Going beyond cache is system memory. Because system memory, or DRAM, resides outside the processor, along the Front Side Bus, and into the memory controller hub, this will have a greater latency than the local caches inside the package of the CPU. Refer to Figure 3 for a block diagram of the layout inside an Intel 6 321074

Core 2 Duo processor. The L1 cache is divided between data and instruction (code) caches and they are typically of equal size. Figure 3. Simple block diagram of major components of the Intel Core 2 Duo Processor Intel Core 2 Duo Processor Core 0 Core 1 Execution Units & Pipelines Data Cache Code Cache Execution Units & Pipeline s Data Cache Code Cache Layer 2 Cache Front Side Bus Unit Measuring Latency and Bandwidth Introduction to the Tools In order to get a better understanding of the system performance, software measures latency and bandwidth. By utilizing tools from the open source community, measurements can be taken and then reproduced with no added cost, thereby making benchmarking portable across all parties involved on a project. Using open source Linux in combination with these utilities, a team can present easily reproducible data without the hassle of proprietary software or other tools. There is a specific suite of open source utilities that measures cache latency, memory latency, cache bandwidth and memory bandwidth. This suite is 321074 7

called LMBench. LMBench contains many utilities for different benchmarks, but two in particular are of concern when taking these measurements. Background information on the tools used can be found here: http://www.bitmover.com/lmbench The following LMBench suite can be found here: http://www.bitmover.com/lmbench/lmbench3.tar.gz It is important to measure the whole data path. If a utility measured latency to local processor cache only, this would provide just a portion of the picture. With the lat_mem_rd utility from LMBench, it is possible to get accurate measurements. The utility measures all levels of cache and into main memory as specified by the commands passed to it. It is also important to measure the bandwidth at the different levels as well. A utility from LMBench called bw_mem takes these measurements. How to take Measurements Measuring Latency. Download and unzip the LMBench utility. Install it by executing the make file. Use the executable binary file called lat_mem_rd found in the bin folder of the utility s directory. Next, use the following command line: ># taskset 0x1./lat_mem_rd N [x] P [y] [depth] [stride] Where [x] equals the number of times the process is run before reporting latency. Typically setting this to 1 is sufficient for accurate measurements. For the -P option, [y] equals the number of processes invoked to run the benchmark. The recommendation for this is always 1. It is sufficient to measure the access latency with only one processing core or thread. The [depth] specification indicates how far into memory the utility will measure. In order to ensure an accurate measurement, specify an amount that will go far enough beyond the cache so that it does not factor in latency measurements. Finally, [stride] is the skipped amount of memory before the next access. If [stride] is not large enough, modern processors have the ability to prefetch the data, thus providing artificial latencies for system memory region. If a stride it too large, the utility will not report correct latencies as it will be skipping past measured intervals. The utility will default to 128 bytes when [stride] is not specified. Refer to Figure 4 for results graph. Binding to the first core and accessing only up to 256M of RAM, the command line would looks as follows: taskset 0x1./lat_mem_rd N 1 P 1 256M 512 This will run through each level of memory and report the latency in nanoseconds. Applying the taskset feature of Linux in a multi-core environment ensures that this utility use the same processing core for each individual run. If numbers are unrealistic, increase the stride of the command to avoid CPU prefetchers artificially inflating the data. Stride should be between 128 and 512. Depth can be anywhere between 64 and 256, ensuring 8 321074

the utility goes deep enough into memory to not be using cache to affect the result. Measuring Bandwidth. In order to measure bandwidth, LMBench provides a utility called bw_mem. The following command line measures bandwidth: >#./bw_mem N [x] P [y] [size] [traffic type] The command line structure is similar to the latency utility where [x] is the number of times the process runs before reporting the result. The parameter [y] specifies the number of running processes used to measure the memory access, and is more critical with this utility. The [size] parameter allocates the specified amount of memory used for the measurement. The size used for this will double, due to the utility allocating twice the specified amount and zeroing it out to do the operations. Finally, the [traffic type] parameter specifies read, write or a combination of both read and write, from CPU to memory. The following are the options available; rd, wr, rdwr. Other operations are also available, but not discussed. When measuring bandwidth of system memory in a multi-core or multithreaded processor, adjusting the [y] parameter completely exercises the system memory. The number of processes specified with the [y] parameter multiplies the amount specified by the [size] parameter. So, when measuring 1024M of memory, for example, and using four cores to measure, [y] should be specified as 4 and [size] as 256M. If measuring cache bandwidth, take note of the memory hierarchy. Shared cache, such as L2 on the Intel Xeon 5400 series processors, behaves similarly to system memory. So as not to go beyond the amount of cache, take care in dividing shared cache among the number of processes, remembering twice the amount specified will be used for memory allocation. When accessing unshared cache, the [size] parameter will not change with the number of processes specified. With unshared cache, results will scale with number of cores used. Take caution when using the taskset function here. When running multi-core measurements, the possibility of isolating particular cores may be useful. For example, when measuring latency bandwidth on the Quad-Core Intel Xeon Processor E5440, the L1 (data) cache size is 32 KB per core, L2 cache is 12 MB shared among all cores. L1 cache is divided up between data and instruction. For the E5440, total cache is 32 (data) + 32 (instruction), quad core makes this a total of 256 KB of L1 cache. Since each core has 32 KB of L1 data cache, and bw_mem uses twice the amount specified to take its measurements, only measure 16 KB for each core. The command line for measuring L1 reads looks as follows:./bw_mem N 10 P 4 16384 rd where 16384 is 16 KB, which is half the amount available in L1 data cache. Now, to measure the L2 cache, the E5440 has a shared 12 MB available. With the utility using twice as much specified, max total should be 6 MB. Distribute 321074 9

this between cores, giving 1.5 MB or 1,536 KB per core. The read bandwidth command line looks as follows:./bw_mem N 10 P 4 1536K rd In order to measure system memory, the command line is very similar to measuring L2, because system memory is also shared. If a total of 2 GB of memory is available, then half of this amount becomes distributed between all cores. For a quad core processor, the amount specified would need to be 256 MB. The command line is:./bw_mem N 10 P 4 256M rd Understanding the Results Viewing the output of the latency measurements shows the first column where in memory the measurement happened and the second column is the latency figure presented in nanoseconds. The bandwidth measurements show a similar fashioned output, but with only one row of two columns. The first figure is the size, in MB, and the second is the bandwidth figured in MB/sec. Latency Results. As seen in Figure 4, graphing this data, the different levels of memory hierarchy become apparent. The Y-axis represents the latency in ns and the X-axis presented as memory depth in megabytes (MB). Since L1 and L2 cache latency ties to the core clock, CPU frequency plays a role in how fast memory accesses happen in real time. This means the number of core clocks stays the same independent of the core frequency. For a comparable result, it is best to convert the latency given by LMBench from nanoseconds into CPU clocks. To do this, multiply the latency by the processor frequency. Time(seconds) * Frequency(Hz) = Clocks of latency Therefore, if a 2.4 GHz processor takes 17 ns to access a certain level of cache, this converts to: 17 x 10-18 seconds * 2400000000 Hz = 17 ns * 2.4 GHz 41 Clocks 10 321074

Figure 4. Graphed Memory Latency Bandwidth Results. Bandwidth results give the result in MB/sec. It is also important to note that when running write traffic, LMBench only measures the write portion and does not take into account reads for ownership and snoop traffic that also occur during a write. If cache and memory region is unknown, use the measurements from latency to decipher the different regions of memory. Parameters that affect Results. Other factors that affect these results are CPU frequency, Front Side Bus speed, number of cores, speed of DRAM and number of DRAM channels, DRAM Ranks and CPU prefetchers. All these can change the outcome of the benchmarks discussed. When comparing results, ensure these parameters are the same. For example, using dual-ranked DIMMs often provides for better utilization of the memory channel and resulting in better bandwidth. If number of DRAM channels is increased, the workload can then be distributed between them for an added increase in bandwidth as well. CPU Prefetching allows the CPU to begin pulling in adjacent data into local cache while it is processing previous data, effectively reducing the access time to the total data. These factors influence the results of measuring latency and bandwidth. 321074 11

Conclusion In conclusion, using the utilities provided by the LMBench suite allows for proper measuring of latency and bandwidth across cache and system memory. This sound methodology provided by the utilities lat_mem_rd and bw_mem measure cache and memory latency and bandwidth. Knowing these values provide for a better understanding of the system performance. This becomes useful for comparing setups between different teams or verifying a customer setup. By making these values consistent between two separate platforms, additional benchmarking can be done with comfort in knowing these major factors of performance do not play a role when comparing. By converting latency measurements into CPU clocks, the speed of the processor no longer plays a factor in system memory accesses, thus making this number comparable to other system configurations. Because the utilities used are open source, and use the Linux operating system, there is no added cost to taking these measurements. The open source model also provides for the ability to recreate the benchmarks on a customer s site, between two groups in the same organization, or even within the same team on two separate platforms. There is no need for proprietary tools and Non Disclosure Agreements to get in the way of an issue or troubleshooting a problem. 12 321074

Author Joshua Ruggiero is a Computer Systems Engineer on the Performance Measurement and Analysis Team with the Embedded and Communications Group at Intel Corporation. He has a Bachelors of Science in Engineering from Arizona State University in Computer Systems Engineering. Acronyms CPU FSB GB GHz KB L1 L2 MB MHz ns SoC Central Processing Unit Front Side Bus Gigabyte (2 30 bytes) Gigahertz Kilobyte (2 10 bytes) Level 1 Cache Level 2 Cache Megabyte (2 20 bytes) Megahertz nanoseconds System on Chip 321074 13

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. This paper is for informational purposes only. THIS DOCUMENT IS PROVIDED "AS IS" WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Intel disclaims all liability, including liability for infringement of any proprietary rights, relating to use of information in this specification. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted herein. Intel, the Intel logo, Intel leap ahead, Intel leap ahead logo, Quad-Core Intel Xeon Processor E5440, Intel 5100 Chipset, Xeon, EP80579, and Intel Core 2 Duo, are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright 2008 Intel Corporation. All rights reserved. 14 321074