System Level Benchmarking Analysis of the Cortex -A9 MPCore

Similar documents
Application Performance Analysis of the Cortex-A9 MPCore

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Details of a New Cortex Processor Revealed. Cortex-A9. ARM Developers Conference October 2007

The ARM Cortex-A9 Processors

Introduction to AMBA 4 ACE and big.little Processing Technology

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Embedded Development Tools

Hardware accelerated Virtualization in the ARM Cortex Processors

Embedded Parallel Computing

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Multi-Threading Performance on Commodity Multi-Core Processors

Architectures, Processors, and Devices

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Communicating with devices

Switch Fabric Implementation Using Shared Memory

Attention. restricted to Avnet s X-Fest program and Avnet employees. Any use

High Performance or Cycle Accuracy?

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

1. PUBLISHABLE SUMMARY

Chapter 11 I/O Management and Disk Scheduling

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

OpenSPARC T1 Processor

Introduction to Cloud Computing

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Chapter 1 Computer System Overview

Bare-Metal, RTOS, or Linux? Optimize Real-Time Performance with Altera SoCs

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution


Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Going Linux on Massive Multicore

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Which ARM Cortex Core Is Right for Your Application: A, R or M?

SPARC64 VIIIfx: CPU for the K computer

A case study of mobile SoC architecture design based on transaction-level modeling

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

Chapter 13. PIC Family Microcontroller

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Cortex -A15. Technical Reference Manual. Revision: r2p0. Copyright 2011 ARM. All rights reserved. ARM DDI 0438C (ID102211)

Thread level parallelism

CHAPTER 1 INTRODUCTION

Cortex-A9 MPCore Software Development

SOC architecture and design

Putting it all together: Intel Nehalem.

Oracle Database Scalability in VMware ESX VMware ESX 3.5

NVIDIA GeForce GTX 580 GPU Datasheet

SABRE Lite Development Kit

What is a System on a Chip?

1 Storage Devices Summary

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Five Families of ARM Processor IP

Computer Systems Structure Input/Output

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Operating Systems 4 th Class

A Powerful solution for next generation Pcs

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Open Flow Controller and Switch Datasheet

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Intel DPDK Boosts Server Appliance Performance White Paper

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

An Implementation Of Multiprocessor Linux

Migrating Application Code from ARM Cortex-M4 to Cortex-M7 Processors

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

VPX Implementation Serves Shipboard Search and Track Needs

PCI Express and Storage. Ron Emerick, Sun Microsystems

FPGA-Accelerated Heterogeneous Hyperscale Server Architecture for Next-Generation Compute Clusters

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

VLIW Processors. VLIW Processors

AMD Opteron Quad-Core

ARM Microprocessor and ARM-Based Microcontrollers

Boundless Security Systems, Inc.

Configuring CoreNet Platform Cache (CPC) as SRAM For Use by Linux Applications

x64 Servers: Do you want 64 or 32 bit apps with that server?

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

IA-64 Application Developer s Architecture Guide

OC By Arsene Fansi T. POLIMI

System Design Issues in Embedded Processing

Lecture 23: Multiprocessors

Next Generation GPU Architecture Code-named Fermi

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Architecture of Hitachi SR-8000

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)

Cortex -A9 MPCore. Technical Reference Manual. Revision: r4p1. Copyright ARM. All rights reserved. ARM DDI 0407I (ID091612)

Operating System Impact on SMT Architecture

System Performance Analysis of an All Programmable SoC

Data and Control Plane Interconnect solutions for SDN & NFV Networks Raghu Kondapalli August 2014

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

How To Compare Two Servers For A Test On A Poweredge R710 And Poweredge G5P (Poweredge) (Power Edge) (Dell) Poweredge Poweredge And Powerpowerpoweredge (Powerpower) G5I (

Transcription:

This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development System Level Benchmarking Analysis of the Cortex -A9 MPCore Roberto Mijat Software Solutions Architect - ARM

Agenda Introduction to the ARM Cortex-A9 MPCore processor ARM Cortex-A9 MPCore RealView Development Platforms Analysis of benchmark results Memory bandwidth Memory latency Multiple outstanding transactions Context switching Cache to cache transfers Understanding the implications of system-wide benchmarks in order to design an optimal multi-core system 2

Acknowledgements ICT-eMuCo Embedded multi-core processing for mobile communication systems Research program sponsored by the EU Partaken by industrial partners (incl. ARM) and top university From ARM Anirban Lahiri Technology Researcher John Goodacre Director of Program Management, Processor Division Adya Shrotriya Intern Nicolas Zea Intern This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development 3

The Cortex-A9 MPCore Processor Generic Interrupt Controller Interrupt distribution External interrupts External un-cached AXI Masters Neon/VFP Neon/VFP Neon/VFP Neon/VFP C-A9 CPU C-A9 CPU C-A9 CPU C-A9 CPU D$ I$ D$ I$ D$ I$ D$ I$ Snoop Control Unit ACP AXI Interface(s) L2 Cache Controller Unified L2 Cache AXI Interface(s) External Memory HOMOGENEOUS CONFIGURABLE 1 TO 4 CORES SUPPORT SMP/AMP/BMP CACHE COHERENT SHARED MEMORY DISTRIBUTED GIC SCALABLE 4

The Cortex-A9 MPCore Processor Generic Interrupt Controller Interrupt distribution External interrupts External un-cached AXI Masters Neon/VFP Neon/VFP Neon/VFP Neon/VFP C-A9 CPU C-A9 CPU C-A9 CPU C-A9 CPU v7 ARM ARCHITECTURE D$ I$ D$ I$ D$ I$ Snoop Control Unit D$ I$ ACP ACCELERATOR COHERENCE PORT AXI Interface(s) L2 Cache Controller Unified L2 Cache AXI Interface(s) External Memory ADDRESS FILTERING NEON MULTIMEDIA SIMD ENGINE 5

Cortex-A9: Technology Leadership Superscalar out-of-order instruction execution Up to 4 instruction cache line pre-fetching Decode up to TWO full instructions per cycle and dispatch up to FOUR Register renaming for speculative execution and loop unrolling FPU/NEON Counters for performance monitoring and PTM Recently announced hard macro 40nm G (TSMC) implementation targeting 2GHz 6

ARM Versatile-PBX Cortex-A9 Platform Dual core ARM Cortex A9 structured ASIC CPU @ 70 to 140MHz 1 NEON and 1 VFP Fast memory system 32KB I&D L1 caches 128K L2 cache 1GB RAM Ethernet, USB, Flash, PrimeCells Same peripheral memory map as PB/11MPCore and PB/A8 File-system on Compact Flash 7

ARM Versatile Express Cortex-A9 Platform Motherboard Express uatx Support for two Express Daughterboards (Processor or FPGA) Backwards compatible peripherals (PB11MPCore, PB-A8, PBX-A9 & EB) Ethernet, USB, Flash, PrimeCells, DVI/HDMI CoreTile Express 4xA9 Quad core ARM Cortex-A9 Silicon test-chip 4 NEON/VFP CPU @ ~400MHz Core:Bus ~2:1 32KB I&D L1 caches 512K L2 cache 1GB 32bit DDR @500Mbps 8

Software Framework for Benchmarks Linux kernel 2.6.28 from kernel.org pre-built images, boot-loaders, patches, file-systems etc available from ARM website Debian 5.0 Lenny Linux file-system compiled for v4t 11

lmbench3 benchmark Comprehensive system benchmark including: Micro-benchmarks, focusing on: Bandwidth Latency Other (system info diagnostics etc) Context switching Stream benchmark Version 3 provides infrastructure to measure the scalability of multi-processor systems Concurrent execution and accurate timing infrastructure Break out of the boundaries of L1 cache subsystem For the purpose of this presentation we ll only look at a small subset of these benchmarks 12

Memory Bandwidth PBX-A9 Single Instance 2 Instances Consider a 2 core platform Knees indicate cache sizes (small [128k] L2 RAM for PBX-A9) Increased effective memory bandwidth for multicore (2 cores) Cache bandwidth doubles DDR2 memory bandwidth doubles Agnostic to alignment Note: Pre-fetching disabled for normalization 13

Memory Bandwidth V2-A9 Single Instance 4 Instances Consider 4 core platform - running 4 concurrent benchmarks (instead of 2) Also at 4 times the frequency of the PBX-A9 b/w showing good 4 cores scalability Increased effective memory bandwidth for higher parallel load L1 Cache bandwidths becomes 4 times DDR2 Memory bandwidth is only showing a doubling. On single instance WR benefits more from OO, write-buffer, outstanding transactions Note: Pre-fetching disabled for normalization 14

Example Misconfigured System!!! Write bandwidth greatly affected if caches are configured as write-through Remember to configure caches as write-back, with allocate-on-write 15

Bandwidth-Latency Relation Latency determines the response time for applications on a multicore Applications requiring short bursts of memory accesses can run concurrently with bandwidth heavy applications without any observable degradation if latency remains constant Internet Browser Video / Image Processing Core0 Core1 16

Memory Latency PBX-A9 Small but visible L2 Similar latencies for Single(S) and Two (M) Instances of LMBench running concurrently Memory latency almost unaffected by presence of multiple (2) cores 32 byte cache line acts as pre-fetch for 16 byte strides Cortex A9 supports prefetching for both forward and backward striding disabled in these test for result normalization Backward striding is less common for real-life applications LMBench tries to use backward striding to defeat prefetching 17

Memory Latency V2 Single Instance 4 Instances 4 Instances of LMBench running - 4 times the application load Memory latency goes up only by about 20% 32 byte cache line acts as pre-fetch for 16 byte strides Application on one CPU mostly unaffected by execution on other CPUs Within the limits of memory bandwidth to DDR Memory 18

STREAM Benchmarks PBX-A9 Bandwidth almost doubles for multiple (2) instances compared to the execution of a single instance Corresponding penalty on latency is marginal Good for streaming, data-intensive applications 19

L2 Latency Configuration PL310 allows configuring the latencies for the L2 cache data & tag RAMs Optimization: Find the minimal latency value for which the system would still work The difference in performance can be double or more Remember DDR memory controllers (PL34x) have similar setting Additional Latency (cycles) 20

Memory Load Parallelism Indicates the number of possible outstanding reads Memory system design determines the ability of the processor to hide memory latency Support for number of outstanding read/writes essential for multicores fully supported by PL310 / PL34x L1 supports 4 linefill requests on average while the implemented DDR2 memory system 2 Systems should support as much memory parallelization as possible 21

Context Switch Time PBX A9 Context switch time is defined here as the time needed to save the state of one process and restore the state of another process When all the processes fit in the cache the context switch time remains relatively low Beyond this it approaches a saturation determined by available main memory bandwidth Keeping the number of active processes low on a processor vastly improves the response time 22

Context Switching Time PBX A9 Single Instance 2 Instances Peak context switch time increases by a small fraction ( < 20%) Indicates that context switches on separate processors are almost mutually orthogonal and enables the MPCore to support more active tasks than a single core time-sliced processor before the system becomes unresponsive 23

A memory system optimized for MP In a MESI compliant SMP system, every cache line is marked with one of the four following states: MODIFIED - Coherent cache line is not up to date with main memory EXCLUSIVE - Up to date and no other copies exist SHARED - Coherent cache line which is up to date with main memory INVALID - This coherent cache line is not present in the cache ARM MPCore processors implement optimizations to the MESI protocol: DUPLICATED TAG RAMs Stored in Snoop Control Unit for quicker access Process of checking if requested data is in other CPUs caches is performed without accessing them DIRECT DATA INTERVENTION (cache-2-cache transfer) Copy clean data from one CPU cache to another MIGRATORY LINES Move dirty data from one CPU to another and skip MESI shared state Avoids writing to L2/L3 and reading the data back from external memory 24

Modified Shared CPU 1 Read from cache line CPU 2 M S I S Request Writeback Linefill Memory

Exclusive Shared CPU 1 Read from cache line CPU 2 E S I S Request Linefill Memory

Cache to Cache Latency Significant benefits achievable if the working set of the application partitioned between the cores can be contained within the sum of their caches Helpful for streaming data between cores may be used in conjunction with interrupts between cores Though dirty lines have higher latency they still have 50% performance benefit 27

Summary and Recommendations Memory bandwidth may be increased by 28 Increasing size of cache lines Increasing width of memory bus/interface Interleave accesses (and buffer reads/writes) Memory latency can be improved by Shortening paths Increasing rate of successful pre-fetching Understanding the current latencies in a system Spreading Memory and CPU intensive applications over the multiple cores provides better performance Subject to bandwidth and latencies of memory system ARM MPCore architecture mitigates migration overheads Running multiple memory intensive applications on a single CPU can be detrimental due to cache conflicts