Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Similar documents

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors By Dr David Levinthal PhD. Version 1.0

Discovering Computers Living in a Digital World

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Next Generation Intel Microarchitecture Nehalem Paul G. Howard, Ph.D. Chief Scientist, Microway, Inc. Copyright 2009 by Microway, Inc.

OC By Arsene Fansi T. POLIMI

Intel Pentium 4 Processor on 90nm Technology

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Thread Level Parallelism II: Multithreading

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Chapter 1 Computer System Overview

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Intel Xeon Processor E5-2600

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

Multithreading Lin Gao cs9244 report, 2006

OpenSPARC T1 Processor

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Parallel Programming Survey

Computer Systems Structure Main Memory Organization

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Artikel-Detailinformationen

Communicating with devices

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Lecture 2: Computer Hardware and Ports.

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

A Quantum Leap in Enterprise Computing

The Bus (PCI and PCI-Express)

Introduction to Microprocessors

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

Server: Performance Benchmark. Memory channels, frequency and performance

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Hardware Level IO Benchmarking of PCI Express*

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor

85MIV2 / 85MIV2-L -- Components Locations

Parallel Algorithm Engineering

Innovativste XEON Prozessortechnik für Cisco UCS

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Intel Virtualization and Server Technology Update

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Intel DPDK Boosts Server Appliance Performance White Paper

Memory Architecture and Management in a NoC Platform

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Operating System Impact on SMT Architecture

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

D1.2 Network Load Balancing

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Analyzing the Virtualization Deployment Advantages of Two- and Four-Socket Server Platforms

Technical Product Specifications Dell Dimension 2400 Created by: Scott Puckett

SUN FIRE X4170, X4270, AND X4275 SERVER ARCHITECTURE Optimizing Performance, Density, and Expandability to Maximize Datacenter Value

SDRAM and DRAM Memory Systems Overview

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Data Sheet Fujitsu PRIMERGY BX920 S2 Dual Socket Server Blade

Computer Systems Structure Input/Output

Server Bandwidth Scenarios Signposts for 40G/100G Server Connections

Processor Reorder Buffer (ROB) Timeout

Big Data Performance Growth on the Rise

OPENSPARC T1 OVERVIEW

The Foundation for Better Business Intelligence

Lecture 1: the anatomy of a supercomputer

PCI Express Basic Info *This info also applies to Laptops

Devices and Device Controllers

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Intel X58 Express Chipset

Tipos de Máquinas. 1. Overview x3550 M2. Features X550M2

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Low Power AMD Athlon 64 and AMD Opteron Processors

Multi-Threading Performance on Commodity Multi-Core Processors

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Performance of Software Switching

SiS AMD Athlon TM 64FX/PCI-E Solution. Silicon Integrated Systems Corp. Integrated Product Division April. 2004

New Developments in Processor and Cluster. Technology for CAE Applications

2. Background Network Interface Processing

Chapter 2 Parallel Computer Architecture

The Pentium 4 and the G4e: An Architectural Comparison

QuickSpecs. HP Integrity cx2620 Server. Overview

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Table Of Contents. Page 2 of 26. *Other brands and names may be claimed as property of others.

The Transition to PCI Express* for Client SSDs

Optimized dual-use server and high-end workstation performance

Transcription:

Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code name for microarchitecture at heart of Core i7 and Xeon 5500/5600 series server chips First released at end of 2008 2

Nehalem System Example: Apple Mac Pro Desktop 2009 Each chip has three DRAM channels attached, each 8 bytes wide at 1.066Gb/s (3*8.5GB/s). Two Nehalem Chips ( Sockets ), each containing four processors ( cores ) running at up to 2.93GHz Can have up to two DIMMs on each channel (up to 4GB/DIMM) QuickPath point-point system interconnect between CPUs and I/O. Up to 25.6 GB/s per link. PCI Express connections for Graphics cards and other extension boards. Up to 8 GB/s per slot. Disk drives attached with 3Gb/s serial ATA link Slower peripherals (Ethernet, USB, Firewire, WiFi, Bluetooth, Audio) 3

Building Blocks to support Family of processors 4

Nehalem Die Photo 5

Nehalem-EX (8-core) 6

MEMORY SYSTEM 7

Nehalem Memory Hierarchy Overview Private L1/L2 per core 32KB L1 I$ CPU Core 32KB L1 D$ 256KB L2$ 4-8 Cores 32KB L1 I$ CPU Core 32KB L1 D$ 256KB L2$ L3 fully inclusive of higher levels (but L2 not inclusive of L1) Local memory access latency ~60ns 1-2 DDR3 DRAM Memory Controllers 8MB-24MB Shared L3$ QuickPath System Interconnect Other sockets caches kept coherent using QuickPath messages Each DRAM Channel is 64/72b wide at up to 1.33Gb/s Each direction is 20b@6.4Gb/s 8

Cache Hierarchy Latencies L1 32KB 8-way, latency 4 cycles L2 256KB 8-way, latency <12 cycles L3 8MB shared, 16-way, latency 30-40 cycles (4 core system) L3 24 MB shared, 24-way, latency 30-60 cycles (4 core system) DRAM, latency ~180-200 cycles 9

Nehalem Virtual Memory Details Implements 48-bit virtual address space, 40-bit physical address space Two-level TLB I-TLB (L1) has shared 128 entries 4-way associative for 4KB pages, plus 7 dedicated fully-associative entries per SMT thread for large page (2/4MB) entries D-TLB (L1) has 64 entries for 4KB pages and 32 entries for 2/4MB pages, both 4-way associative, dynamically shared between SMT threads Unified L2 TLB has 512 entries for 4KB pages only, also 4-way associative 10

Cache coherency: on-chip on-chip snoop cache coherence MESIF (M)odified, (E)xclusive, (S)hared,(I)nvalid, (F) ordwarder Ring as interconnection network Up to 250GB/sec 11

All Sockets can Access all Data ~60ns ~100ns 12

Remote cache requests Directory like scheme to keep shared-memory view The router can connect up to 4 QPI Complex topologies: twisted hypercube 13

Large Scale systems Intel Architecture capable of QPI connected 8-Sockets / 128threads Max 2 QPI hops between two sockets 14

CORE ARCHITECTURE 15

Core Area Breakdown 16

17

Front-End Instruction Fetch & Decode x86 instruction bits internal µop bits 18

Front End I: Branch Prediction Part of instruction fetch unit Several different types of branch predictor Details not public Loop count predictor How many backwards taken branches before loop exit (Also predictor for length of microcode loops, e.g., string move) Return Stack Buffer Holds subroutine targets Renames the stack buffer so that it is repaired after mispredicted returns 19

Front End II: x86 Decoding Translate up to 4 x86 instructions into uops each cycle Only first x86 instruction in group can be complex (maps to 1-4 uops), rest must be simple (map to one uop) Even more complex instructions, jump into microcode engine which spits out stream of uops 20

BE: Out-of-Order Execution Engine Renaming happens at uop level (not original macro-x86 instructions) 21

LSQ Load queue 48 entries Store queue 32 entries Intelligent load scheduling Up to 16 outstanding misses in flight 22

In-Order Fetch RAS In-Order Decode and Register Renaming In-Order retire Out-of-Order Execution 2 SMT Threads per Core Out-of-Order Completion 23

SMT with OoO Execution Core Reorder buffer (remembers program order and exception status for in-order commit) has 128 entries divided statically and equally between both SMT threads ICOUNT (guess?) fetch policy (BRCOUNT, MISSCOUNT, etc..) Reservation stations (instructions waiting for operands for execution) have 36 entries competitively shared by SMT threads Load & Store Queues Statically partitioned and equally between both SMT threads 24

Final "Cualquier tecnología suficientemente avanzada es indistinguible de la magia" Arthur C. Clarke 25