ADVANCED COMPUTER ARCHITECTURE

Similar documents

High Performance Computing Systems and Enabling Platforms

Week 1 out-of-class notes, discussions and sample problems

LSN 2 Computer Processors

Program Optimization for Multi-core Architectures

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Generations of the computer. processors.

An Introduction to Parallel Computing/ Programming

GPUs for Scientific Computing

Course Development of Programming for General-Purpose Multicore Processors

Parallel Programming Survey

CHAPTER 1 INTRODUCTION

Introduction to Cloud Computing

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

How To Understand The Design Of A Microprocessor

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

A Lab Course on Computer Architecture

Weighted Total Mark. Weighted Exam Mark

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Clustering Billions of Data Points Using GPUs

Introduction to GPU Programming Languages

GPU Parallel Computing Architecture and CUDA Programming Model

High Performance Computer Architecture

~ Greetings from WSU CAPPLab ~

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Control 2004, University of Bath, UK, September 2004

FPGA-based Multithreading for In-Memory Hash Joins

Parallelism and Cloud Computing

Master Degree in Computer Science and Networking

Introduction to GPU hardware and to CUDA

Embedded Parallel Computing

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

CUDA programming on NVIDIA GPUs

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

CS 394 Introduction to Computer Architecture Spring 2012

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Multi-Threading Performance on Commodity Multi-Core Processors

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Computer Architecture TDTS10

Embedded Systems: map to FPGA, GPU, CPU?

ultra fast SOM using CUDA

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

IA-64 Application Developer s Architecture Guide

Computer Organization

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Next Generation GPU Architecture Code-named Fermi

Operating System Impact on SMT Architecture

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Choosing a Computer for Running SLX, P3D, and P5

Driving force. What future software needs. Potential research topics

Gildart Haase School of Computer Sciences and Engineering

Architectures and Platforms

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Parallel Algorithm Engineering

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

CMSC 611: Advanced Computer Architecture

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

SPEEDUP - optimization and porting of path integral MC Code to new computing architectures

Application Performance Analysis of the Cortex-A9 MPCore

The Importance of Performance Assurance For E-Commerce Systems

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Optimizing Code for Accelerators: The Long Road to High Performance

Processor Architectures

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

A Comparison of Distributed Systems: ChorusOS and Amoeba

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Recommended hardware system configurations for ANSYS users

CFD Implementation with In-Socket FPGA Accelerators

The Motherboard Chapter #5

A graph-oriented task manager for small multiprocessor systems.

Speed Performance Improvement of Vehicle Blob Tracking System

Binary search tree with SIMD bandwidth optimization using SSE

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Multicore Parallel Computing with OpenMP

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Router Architectures

Parallel Firewalls on General-Purpose Graphics Processing Units

Visualization Enables the Programmer to Reduce Cache Misses

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

CISC, RISC, and DSP Microprocessors

Introduction to Cloud Computing

Multicore Programming with LabVIEW Technical Resource Guide

Understanding the Benefits of IBM SPSS Statistics Server

Operating Systems 4 th Class

Computer Architectures

Performance evaluation

Hardware design for ray tracing

Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

HPC with Multicore and GPUs

Applying Parallel and Distributed Computing for Image Reconstruction in 3D Electrical Capacitance Tomography

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Transcription:

ADVANCED COMPUTER ARCHITECTURE Marco Ferretti Tel. Ufficio: 0382 985365 E-mail: marco.ferretti@unipv.it Web: www.unipv.it/mferretti, eecs.unipv.it 1

Course syllabus and motivations This course covers the internal architecture and operation of modern processors, and the basics of parallel programming. The increase in performance of processors in the last years is mainly due to three factors: speed-up in instruction execution (what is actually the speed of execution, how can it be measured?) increase in the number of instructions executed in parallel (given a set of instructions, which are the properties they must satisfy to be executed in parallel) increase in the number of processors that are embedded in the same chip/board and that can be programmed to execute an algorithm 2

Course syllabus and motivations Increase in Cpu performance from 1978 to 2010 (Fig. 1.1 H-P5) 3

Course syllabus and motivations Increase in clock speed 1978 to 2010 (Fig. 1.1 H-P5) 4

Course syllabus and motivations Since 2003 this trend has slowed down consistently (can you guess why?) Designers have tried alternative approaches: using multiple execution units. This can be realized in a few ways, also with a mixed approach: Multithreading Multi-core processors, enhanced integration with GPUs Shared-memory multiprocessors systems Multiprocessors with partitioned memory (Question: these solutions require a novel approach with respect to a uniprocessor, can you figure out?) 5

Part I : Basics in modern architectures PIPELINING basics Course syllabus (A) Instruction Level Parallelism (ILP): the DYNAMIC approach Instruction Level Parallelism: the STATIC approach Fundamentals of CACHING Part II: Multi-threading Shared memory multiprocessors Partitioned memory multiprocessors Multimedia SIMD extensions, data parallel support. Part III: parallel programming with openmp (laboratory) 6

Course syllabus schedule (B) Part I : Basics in modern architectures PIPELINING basics Fundamentals of CACHING Part II: Multi-threading Shared memory multiprocessors (intro) Part III: parallel programming with openmp (laboratory) 7

Course syllabus schedule (B) Part I : Instruction Level Parallelism (ILP): the DYNAMIC approach Instruction Level Parallelism: the STATIC approach Part II: Shared memory multiprocessors (details) Partitioned memory multiprocessors Multimedia SIMD extensions, data parallel support. 8

Books J. L. Hennessy & D. A. Patterson: Computer Architecture: A Quantitative Approach, Elsevier - Morgan Kaufmann. 3 rd ed. 2003: in the charts as Hennessy-Patterson 4 th ed. 2007: in the charts as H-P4 5 ed. 2011: in the charts as H-P5 D. A. Patterson & J. L. Hennessy : Computer Organization and Design, The Hardware/Softwae Interface, third edition, 2005, Elsevier. in the charts as Patterson-Hennessy A. S. Tanenbaum: Structured Computer Organization, 5 th ed. Pearson, 2005. 9

Books Italian editions J. L. Hennessy & D. A. Patterson: Architettura degli elaboratori Apogeo, 2008 (quarta edizione americana). 4 th ed. 2007: in the charts as H-P4 D. A. Patterson & J. L. Hennessy : Struttura e Progetto dei Calcolatori: l interfaccia Hardware-software, seconda edizione (terza edizione americana) Zanichelli, 2006. D. A. Patterson & J. L. Hennessy : Struttura e Progetto dei Calcolatori: terza edizione (quarta edizione americana) Zanichelli, 2010. A. S. Tanenbaum: Architettura dei Calcolatori, un approccio strutturale, 5 t ed. Pearson/Prenctice Hall, 2006. 10

Further material Most recent architectures are not properly described in textbooks. The course uses papers, benchmarks and graphical material drawn from web resources, mainly from INTEL: http://www.intel.com/technology/itj/ Wikipedia: en.wikipedia.org Anandtech: www.anandtech.com Tom's Hardware: www.tomshardware.com Hardware Upgrade: www.hwupgrade.it 11

Lessons charts Download charts used during class from the ACA section of the professor s web site Use the charts as a trace of the arguments Textbook are the course reference material; detailed instructions on which texbooks and which chapters cover the exam will be published in due course on the web site. 12

Course entry requirements Basic knowledge of the Von-neuman CPU and of virtual memory Basic knowledge of operating systems (processes, thread, mutual exclusion, synchronization) Elementary knowledge of assembly language Knowledge of the C-language 13

Testing and exams Parallel programming project assigned to groups of 2/3 persons Written test individual Oral test individual and discussion of group project 14

Course web site www.unipv.it/mferretti section on ACA to be enlarged with learning material in English Past exams and exercises: texts available also in the Italian section 15

Course outline The architecture from the programmer s view point 10000x10000 array, Intel Core 2 Duo @ 2.8 Ghz int sum1(int** m, int n) { int i,j,sum=0; for (i=0; i<n;i++) for (j=0; j<n; j++) sum += m[i][j]; returm sum; } int sum2(int** m, int n) { int i,j,sum=0; for (i=0; i<n;i++) for (j=0; j<n; j++) sum += m[j][i]; returm sum; } 0.4 seconds 1,7 seconds (4.2 times slower!!) 16

Course outline Understanding the operation of modern architectures requires a little bit of analysis of their evolution All modern processor have roughly the same operation mode, and quite similar internal architecture, the so-called RISC architecture This denomination is almost forgot, it was very much in use in the 80 s, it marked the distinction between the old CISC architectures and the new ones, RISC, that allowed to reach higher performances. 17

Course outline RISC architecture are simpler (and nicer) in concept, which allowed to deploy some fundamentals techniques to speed-up program execution: pipelining (partially overlaps instruction execution) parallel execution of independent instructions branch prediction (knowing in advance which instruction to fetch) speculation (instruction executed before knowing if they belong to the correct path, with possible undoing ) compile-time instruction re-ordering (static schedulation) to maximize processor utilization. 18

Course outline These techniques are used in all modern architectures, though their actual implementation differ from processor to processor. After the major CISC to RISC step, performance increase has been due mainly to technologies, rather than to new concepts (To be debated). A single core in a dual/quad core Intel CPU is still very similar to an old Pentium III. 19

Cosa studieremo Course outline questo corso Fabrication improvements however have almost stopped from 2003 onwards, so that uniprocessor performances have had no sensible increase any longer. The requirement of more and more computational power has led to alternative approaches. The basic idea was coupling two or more processors ( core ) in a multi-core architecture: modern dual and quad-core processors are small shared (and cache) multiprocessor systems. 20

Course outline Two or even more programs can be executed in parallel on the same multi-core processor (provided software can actually exploit available cores) Nothing new! this is the architecture of shared-memory multiprocessor systems (that featured private caches, yet) And even before multi-core processors, multi-threading tried already to carry out parallele execution of instructions belonging to different threads. 21

Course outline If even more computational power is required, it is unfeasible using processors that share the same main memory. It is mandatory using partitioned-memory multiprocessors, even thousands of them, each possibly a multi-core. Moder multiprocessors featuring the highest performances can be a synthesis of different architectural solutions: hundreds or thousands of computational units, linked with an ad-hoc network, each computational unit hosting multiple cores that share main memory, each core itself capable of multi-threading. 22

Course outline Old alternative architectures, such as vector processors, have almost completely disappeared. They left as their heritage multi-media SIMD instructions currently embedded in all processors, and in graphics processors (GPU). Currently, the development environments support more and more GPU, even in general-purpose, non-graphicals applications. 23