OC By Arsene Fansi T. POLIMI 2008 1



Similar documents
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

POWER8 Performance Analysis

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Computer Architecture TDTS10

OpenSPARC T1 Processor

Putting it all together: Intel Nehalem.

AMD Opteron Quad-Core


OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Introduction to Cloud Computing

Switch Fabric Implementation Using Shared Memory

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

- Nishad Nerurkar. - Aniket Mhatre

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Computer Organization & Architecture Lecture #19

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Architecture of Hitachi SR-8000

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution


Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Chapter 2 Parallel Computer Architecture

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Computer Systems Structure Main Memory Organization

Introduction to Microprocessors

Computer System Design. System-on-Chip

SPARC64 VIIIfx: CPU for the K computer

What is a System on a Chip?

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

The Central Processing Unit:

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Lecture 23: Multiprocessors

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

OPENSPARC T1 OVERVIEW

VLIW Processors. VLIW Processors

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Chapter 1 Computer System Overview

PROBLEMS #20,R0,R1 #$3A,R2,R4

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Thread level parallelism

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Design Cycle for Microprocessors

Multithreading Lin Gao cs9244 report, 2006

StrongARM** SA-110 Microprocessor Instruction Timing

Next Generation GPU Architecture Code-named Fermi

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

EE361: Digital Computer Organization Course Syllabus

NAND Flash Architecture and Specification Trends

Remote Copy Technology of ETERNUS6000 and ETERNUS3000 Disk Arrays

路 論 Chapter 15 System-Level Physical Design

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Lecture 2 Parallel Programming Platforms

CMSC 611: Advanced Computer Architecture

Introducción. Diseño de sistemas digitales.1

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Enterprise Applications

SAN Conceptual and Design Basics

SOC architecture and design

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

1 Storage Devices Summary

Computer Systems Structure Input/Output

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Microprocessor & Assembly Language

Annotation to the assignments and the solution sheet. Note the following points

Chapter 2 Heterogeneous Multicore Architecture

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Instruction Set Architecture (ISA)

Chapter 2 Logic Gates and Introduction to Computer Architecture

CHAPTER 1 INTRODUCTION

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

LSN 2 Computer Processors

ARM Microprocessor and ARM-Based Microcontrollers

A Lab Course on Computer Architecture

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

CHAPTER 7: The CPU and Memory

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

Parallel Programming Survey

Transcription:

IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1

WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5 microprocessor technology (SMT, Dual Core) plus some extensions in order to increase performances. Its core is fabricated in 65-nm silicon-on-insulator (SOI) technology and operates at frequencies of more than 4 GHz. The microprocessor is a 13-FO4 design containing more than 790 million transistors, 1,953 signal I/Os, and more than 4.5 km of wire on ten copper metal layers. 2

ARCHITECTURE OF POWER6 MICROPROCESSOR The Power6 Chip operates at twice the frequency of Power5 In place of speculative out-oforder execution that requires costly circuit renaming, the design concentrates on providing data prefetch. Limited out-of-order execution is implemented for FP instructions. Improvement of the Dispatch and Completion: 7 intr from both cores simultaneously Better SMT speed up due to increased cache size Designed to consume less power 3

THE PROCESSOR CORE Structured in Pipeline Developed to minimize i i logic content in the pipeline Stages Introduction of Decimal arithmetic as well as Vector Multimedia arithmetic ih i Implement action of the Checkpoint Retry and Processor Sparing Instruction fetching and branch handling are performed in the instruction fetch pipe. Instructions from the L2 cache are decoded in pre-code stages P1 through P4 before they are written into the L1 I-cache. Branch prediction is performed using a branch history table (BHT) that contains 2 bits to indicate the direction of the branch. 4

THE PROCESSOR PIPELINE 5

USAGE OF THE PIPELINE Branch and logical condition instructions are executed in the branch and conditional pipeline FX (Fixed Point) instructions are executed in the FX pipeline, load/store instructions ti are executed in the load pipeline, FP instructions are executed in the FP pipeline, and decimal and vector multimedia extension instructions are executed in the decimal and vector multimedia execution unit. Data generated by the execution units is staged through the checkpoint recovery (CR) pipeline and saved in an error- correction code (ECC)-protected buffer for recovery The FX unit (FXU = Fixed Point Unit) is designed to execute dependent instructions back to back. 6

USAGE OF THE PIPELINE Instruction fetching and branch handling : a dedicated 64-KB four-way set-associative L1 I-cache => Fast address translation. The POWER6 processor also recodes some of the instructions in the pre-decode stages to help optimize the implementation of later pipeline stages. Instruction sequencing : handled by the IDU. For high dispatch bandwidth, the IDU employs two parallel instruction dataflow paths, one for each thread. Both threads can be dispatched simultaneously. FX instruction execution : The core implements two FXUs to handle FX instructions and generate addresses for the LSUs. The most signature features of these FXUs is that they support back-to-back execution of dependent instructions with no intervening cycles required to bypass the data to the dependent d instruction. i Binary FP instruction execution : The core includes two BFUs, essentially mirrored copies, which have their register files next to each other to reduce wiring. i In general, the POWER6 processor is an in-order machine, but the BFU instructions can execute slightly out of order due of multiples empty slots in FP instructions (divide, Square root). The BFU notifies the IDU when these slots will occur, and the IDU can dispatch in the middle of these slots. 7

USAGE OF THE PIPELINE Data fetching : performed by the LSU. The LSU contains two load/store execution pipelines, with each pipeline able to execute a load or store operation in each cycle. The LSU contains several subunits: the load/store address generation and execution; the L1 D-cache array and the cache array supporting set-predict and directory arrays, address translation, store queue, load miss queue (LMQ), and data pre-fetch engine. Accelerator : The POWER6 core implements a vector unit to support the PowerPC VMX instruction set architecture (ISA) and a decimal execution unit to support the decimal ISA. Simultaneous multithreading: implements a two-thread SMT for each core. In spite of Software thread priority, hardware automatically lowers the priority of a thread when it detects that the thread is being stalled. Cache Hierarchy : Has 3 levels of caches 8

MEMORY & I/O SUBSYSTEMS MEMORY : Each POWER6 chip includes two integrated memory controllers that each of them commands up to 4 parallel channels. A channel supports a 2-byte read data path, a 1-byte write data path, and a command path that operates four times faster than the DRAM frequency Each memory controller is divided into two regions that operate at different frequencies : asynchronous region (four times the frequency of the attached DRAM), and the synchronous region (Half of the core frequency). Memory is protected by SECDED ECCs. Scrubbing is employed to find and correct soft, correctable errors. I/O FEATURES : I/O Controller is the same as for Power 5. A pipelined I/O high throughput mode was added whereby DMA write operations initiated by the I/O controller are speculatively pipelined. => This ensures that t in the largest systems, inbound I/O throughput h t is not limited it by the tenure of the coherence phase of the DMA write operations. 9

SMP INTERCONNECTION (Symmetric Processors) Relying on the traffic reduction, coherence and data traffic share the same physical links by using a time-division-multiplexing (TDM) approach. the ring-based topology is ideal for facilitating a non-blocking broadcast coherence-transport mechanism since it involves every node in the operation of all the others. 10

CONCLUSION With its high-frequency core architecture, enhanced SMT capabilities, balanced system throughput, and scalability extensions, the POWER6 microprocessor provides higher levels of performance than the predecessor POWER5 microprocessor-based systems while offering greater flexibility in system packaging trade-offs. Additionally, improvements in functionality, RAS, and power management have resulted in valuable new characteristics of POWER6 processor-based systems. 11

THANKS FOR YOUR ATTENTION REFERENCES: - IBM Power6 MicroArcitecture, H. Q. Le, W. J. Starke, J. S. Fields, F. P. O Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, M. T. Vaden 12