CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
|
|
- Christine Owens
- 8 years ago
- Views:
Transcription
1 CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
2 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore Parallelism o Superscalarity o Simultaneous Hyperthreading o SIMD Vectorization Introduction to the Execution-Core-Memory Model Accessing the Hardware via SLURM Introduction to Tools and Exercise 1 Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
3 Intel s Tick-Tock Model Tick Tick Tick Tick 65 nm Fabrication 45 nm Fabrication 32 nm Fabrication 22 nm Fabrication 12 nm Fabrication Merom Penryn Nehalem Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Core Microarchitecture Nehalem Microarchitecture Sandy Bridge Microarchitecture Haswell Microarchitecture Tock Tock Tock Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
4 Multicore Parallelism Design of nomad Two-Socket Sandy Bridge-EP System (2x Xeon E5-2670) Two NUMA Domains Eight Cores per Socket Nominal CPU Frequency: 2.6 GHz 2-SMT Advanced Vector Extensions (AVX) Memory MC PCIx Package 0 PCIx shared LLC PCIx PCIx PCIx QPI QPI QPI QPI PCIx Package 1 PCIx shared LLC PCIx PCIx PCIx MC Memory Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
5 Superscalarity Superscalar Design: Six issue ports/ execution units that can execute instructions simultaneously o E.g. Floating Point-Add (Port 1) and Floating-Point Mult (Port 0) 4 uop/cycle frontend limit 256KB Unified L2 Cache 256 bit Complex Decoder 128 bit Simple Decoder Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 ALU V-MUL V-SHUF Fdiv 256- FP MUL 256- FP Blend 32KB L1 Instruction Cache 256 bit Predecode ALU V-ADD V-SHUF 256- FP ADD Instruction Queue Simple Decoder Decoded Instruction Queue ALU JMP 256- FP Shuf 256- FP Bool 256- FP Blend Simple Decoder Renamer / Scheduler / Dispatcher Load Data AGU Memory Control 128 bit 256 bit 32KB L1 Data Cache MSROM Load Data AGU 1536 uop (L0) Cache Store Data Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
6 Simultaneous Multithreading Control Logic per Core is duplicated (program counter, registers, etc.) 256 bit Complex Decoder 32KB L1 Instruction Cache 128 bit Predecode Simple Decoder Instruction Queue Simple Decoder Simple Decoder MSROM 1536 uop (L0) Cache 2-SMT: Core Resources (execution units) are shared between two hardware threads 256KB Unified L2 Cache Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 ALU V-MUL V-SHUF Fdiv 256- FP MUL 256- FP Blend ALU V-ADD V-SHUF 256- FP ADD Decoded Instruction Queue Renamer / Scheduler / Dispatcher ALU JMP 256- FP Shuf 256- FP Bool 256- FP Blend Load Data AGU Load Data AGU Store Data Memory Control 256 bit 128 bit 256 bit 32KB L1 Data Cache Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
7 SIMD Vectorization Starting with Sandy Bridge we get Advanced Vector Extensions (AVX) (256b SIMD), which extends the previous Streaming SIMD Extensions (SSE) (128b SIMD): core A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] 32 bits (1 float) A[0] A[1] A[2] A[3] 256 bits ymmx xmmx A[0] A[1] A[2] A[3] 64 bits (1 double) A[0] 128 bits 128 bits A[1] 256 bits 16 B L1 cache 2x16 B Using AVX, we can perform four operations with a single instruction. 32 B L2 cache Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
8 Summary Calculating the peak double-precision floating-point performance of nomad: Two Sockets Eight Cores per 2.6 GHz Two-way superscalar with respect to floating-point o AVX Mult on Port 0, AVX Add on Port 1 256b AVX o Four double-precision operations with a single instruction 2 x 8 x 2.6 GHz x 2 instr/cycle x 4 ops/instr = Gflop/s Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
9 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore Parallelism o Superscalarity o Simultaneous Hyperthreading o SIMD Vectorization Introduction to the Execution-Core-Memory Model Accessing the Hardware via SLURM Introduction to Tools and Exercise 1 Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
10 Introduction to ECM Model Model to predict performance of streaming kernels Collect all relevant information o TOL Cycles spent in core that can be overlapped with L1-L2 communication o TnOL Cycles spent in core that can not be overlapped with L1-L2 communication o Tij Cycles spent transfering data between i and j (L1-L2, L2-L3, L3-MEM) Shorthand notation: {TOL TnOL TL1L2 TL2L3 TL3MEM} ECM for Sandy Bridge: H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Preprint: arxiv: Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
11 ECM on Sandy Bridge Schönauer Vector Triad as example: for (i=0; i<n; ++i) A[i] = B[i] + C[i] * D[i]; Granularity will always be one cache line (CL), because it is the unit of transfer in the memory hierarchy o Cache lines are 64B on Intel Processors Work for the core using 256 bit AVX: o Six loads (B, C, D) o Two Multiplications o Two Additions o Two Stores (A) Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
12 Schönauer Triad on Sandy Bridge Cycle Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 ALU, AVX Mul ALU, AVX Add Load AGU Load AGU Store ALU, Branch 1 Mul Add Load B Load B 2 Mul Add Store Store A 3 Load C Load C 4 Store Store A 5 Load D Load D 6 Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
13 ECM on Sandy Bridge Core: TOL= 2 cycles, TnOL = 6 cycles o See previous Slide L1-L2: Need to transfer five cache lines between L1 and L2 per cache line update in core o Takes 10 cycles at 32 bytes/cycle bandwidth L2-L3: Need to transfer five cache lines between L2 and L3 per cache line update in core o Takes 10 cycles at 32 bytes/cycle bandwidth L3-MEM: Need to transfer five cache lines between L3 and MEM per cache line update in core o o Measure sustained chip bandwidth for Schönauer Vector Triad using likwid Sandy Bridge 28.4 GB/s à 64B*2.7GHz/28.4GB/s = 6.08 cycles/cl Takes ~30 cycles on Sandy Bridge Data for Sandy Bridge: { } Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
14 ECM on Sandy Bridge Performance prediction works {TOL TnOL TL1L2 TL2L3 TL3MEM} Tcore = max(tnol, TOL) TECM = max(tnol+tdata, TOL) Sandy Bridge o { } o TECM = { } Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
15 ECM on Sandy Bridge Schönauer Vector Triad A(:) = B(:) + C(:) * D(:) Xeon E (phinally) 56 cycles Performance [cycles per cache line] cycles 26 cycles 0 6 cycles Dataset Size [kb] Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
16 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore Parallelism o Superscalarity o Simultaneous Hyperthreading o SIMD Vectorization Introduction to the Execution-Core-Memory Model Accessing the Hardware via SLURM Introduction to Tools and Exercise 1 Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
17 SLURM Overview of Systems o Log in to headnode o ssh faui36b Submit interactive Job o srun -w nomad -c1 -t100:00 --pty bash Hints o nomad Host to request o -c1 Number of requested cores (1 to develop/compile, 32 to measure) o -t100:00 Reservation Time, 100 minutes is maximum (Don t forget to save your o work, if your time runs out, you ll be kicked from the system!) --pty bash Interactive Job (Problems with the terminal? Connect manually in another terminal: ssh nomad) Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
18 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore Parallelism o Superscalarity o Simultaneous Hyperthreading o SIMD Vectorization Introduction to the Execution-Core-Memory Model Accessing the Hardware via SLURM Introduction to Tools and Exercise 1 Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
19 Tools: likwid Developed by Jan Treibig and Thomas Röhl at RRZE Swiss Army Knife for HPC o likwid-topology Show node information o likwid-pin Set thread affinity o likwid-bench - Microbenchmarks o likwid-perfctr Read Hardware Performance Counters o likwid-powermeter Measure Energy Usage For any measurements, use: o likwid-pin c 1 <your_binary> <your_parameters> Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
20 Tools: Setting up likwid Download from o oder o Adjust config.mk after unpacking o compiler = MIC o PREFIX = Installation path, e.g. $(HOME)/likwid/ Compile & Install o make && make install Test nomad:~ hofmann$./likwid/likwid-topology CPU type:... Intel Core SandyBridge EP processor Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
21 More Tools To plot your results I recommend you either use grace or gnuplot seq/jot are useful tools to generate sequences of numbers o Useful in combination with for to generate plots, e.g. for i in $(seq 100) do likwid-pin c 1./stencil $i done More useful tools for processing text output o grep, awk, cut, sed, head, tail,... Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
22 Exercise 1 Part 1 (Single-Core Implementation) o Implement 2D Jacobi for CPU (use skeleton code from website to get started) o Measure for Matrix Sizes up to 3 times the size of the last-level cache (L3 cache) o Plot Results (Dateset size on x-axis, MUp/s [Mega Updates/s] on y-axis) You should see the effect of dropping out of the different cache levels Part 2 (Theory) o Try to come up with the ECM model for 2D Jacobi on Sandy Bridge Advanced Students can try to use IACA [1] to analyze the inner loop (update.c) o Insert model predictions into plot does your code achieve the theoretical peak performance? [1] Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,
Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture
Programming Techniques for Supercomputers: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute Node Architecture SimultaneousMultiThreading (SMT) Prof. Dr. G. Wellein (a,b),
More informationAlgorithms of Scientific Computing II
Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware
More informationPSE Molekulardynamik
OpenMP, bigger Applications 12.12.2014 Outline Schedule Presentations: Worksheet 4 OpenMP Multicore Architectures Membrane, Crystallization Preparation: Worksheet 5 2 Schedule 10.10.2014 Intro 1 WS 24.10.2014
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationInnovativste XEON Prozessortechnik für Cisco UCS
Innovativste XEON Prozessortechnik für Cisco UCS Stefanie Döhler Wien, 17. November 2010 1 Tick-Tock Development Model Sustained Microprocessor Leadership Tick Tock Tick 65nm Tock Tick 45nm Tock Tick 32nm
More informationCSE 6040 Computing for Data Analytics: Methods and Tools
CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS
More informationIntel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationPutting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
More informationCycles for Competitiveness: A View of the Future HPC Landscape
Cycles for Competitiveness: A View of the Future HPC Landscape October 6, 2010 Stephen R. Wheat, Ph.D. Sr. Director, HPC WW Business Operations Intel, Data Center Group Legal Disclaimer Intel may make
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationSchedule. 9:00-9:10 Section 1 - Basic intro to power and energy. 9:30-9:45 Section 3 - Component specific measurement techniques
Schedule 9:00-9:10 Section 1 - Basic intro to power and energy 9:10-9:30 Section 2 - Devices for measuring power 9:30-9:45 Section 3 - Component specific measurement techniques 9:45-10:00 Section 4 - Advanced
More informationNext Generation Intel Microarchitecture Nehalem Paul G. Howard, Ph.D. Chief Scientist, Microway, Inc. Copyright 2009 by Microway, Inc.
Next Generation Intel Microarchitecture Nehalem Paul G. Howard, Ph.D. Chief Scientist, Microway, Inc. Copyright 2009 by Microway, Inc. Intel usually introduces a new processor every year, alternating between
More informationStovepipes to Clouds. Rick Reid Principal Engineer SGI Federal. 2013 by SGI Federal. Published by The Aerospace Corporation with permission.
Stovepipes to Clouds Rick Reid Principal Engineer SGI Federal 2013 by SGI Federal. Published by The Aerospace Corporation with permission. Agenda Stovepipe Characteristics Why we Built Stovepipes Cluster
More informationIMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION
More informationKerMon: Framework for in-kernel performance and energy monitoring
1 KerMon: Framework for in-kernel performance and energy monitoring Diogo Antão Abstract Accurate on-the-fly characterization of application behavior requires assessing a set of execution related parameters
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationPerformance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor
Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor Taiwo O. Ojeyinka Department of Computer Science, Adekunle Ajasin University, Akungba-Akoko Ondo State, Nigeria. Olusola Olajide
More informationMulti-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationAssessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More information~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
More informationMemory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86 64 Processors
Memory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86 64 Processors Robert Schöne, Daniel Hackenberg, and Daniel Molka Center for Information Services and High Performance Computing
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationOC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
More informationOpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
More informationSPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationArchitecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
More informationComputer System. Chapter 1. 1.1 Introduction
Chapter 1 Computer System 1.1 Introduction The Japan Meteorological Agency (JMA) installed its first-generation computer (IBM 704) to run an operational numerical weather prediction model in March 1959.
More informationOpenMP* 4.0 for HPC in a Nutshell
OpenMP* 4.0 for HPC in a Nutshell Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group (michael.klemm@intel.com) *Other brands and names are the property of their respective owners.
More informationMulticore Processor and GPU. Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/
Multicore Processor and GPU Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/ Moore s Law The number of transistors on integrated circuits doubles approximately every two years! CPU performance
More informationKeys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
More informationACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist
ACANO SOLUTION VIRTUALIZED DEPLOYMENTS White Paper Simon Evans, Acano Chief Scientist Updated April 2015 CONTENTS Introduction... 3 Host Requirements... 5 Sizing a VM... 6 Call Bridge VM... 7 Acano Edge
More informationPOWER8 Performance Analysis
POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs satsadas@in.ibm.com #OpenPOWERSummit Join the conversation at #OpenPOWERSummit
More informationYALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
More informationThe K computer: Project overview
The Next-Generation Supercomputer The K computer: Project overview SHOJI, Fumiyoshi Next-Generation Supercomputer R&D Center, RIKEN The K computer Outline Project Overview System Configuration of the K
More informationServer Bandwidth Scenarios Signposts for 40G/100G Server Connections
Server Bandwidth Scenarios Signposts for 40G/100G Server Connections Presented by Kimball Brown kimball@lightcounting Server Breakdown x86 Servers e garner about 2/3 of server e revenues, eeues, but over
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationIntel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
More informationIBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
More informationEVALUATING NEW ARCHITECTURAL FEATURES OF THE INTEL(R) XEON(R) 7500 PROCESSOR FOR HPC WORKLOADS
Computer Science Vol. 12 2011 Paweł Gepner, David L. Fraser, Michał F. Kowalik, Kazimierz Waćkowski EVALUATING NEW ARCHITECTURAL FEATURES OF THE INTEL(R) XEON(R) 7500 PROCESSOR FOR HPC WORKLOADS In this
More informationSoftware implementation of Post-Quantum Cryptography
Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software
More informationEvaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors.
Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors. Executive Summary: In today s data centers, live migration is a required
More informationHigh Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More information2
1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationPerformance of Software Switching
Performance of Software Switching Based on papers in IEEE HPSR 2011 and IFIP/ACM Performance 2011 Nuutti Varis, Jukka Manner Department of Communications and Networking (COMNET) Agenda Motivation Performance
More informationA quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationFLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic
More informationOpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
More informationHome Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015
INF5063: Programming heterogeneous multi-core processors because the OS-course is just to easy! Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks October 20 th 2015 Håkon Kvale
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationOperating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
More informationPROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
More informationThe Foundation for Better Business Intelligence
Product Brief Intel Xeon Processor E7-8800/4800/2800 v2 Product Families Data Center The Foundation for Big data is changing the way organizations make business decisions. To transform petabytes of data
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationAccelerate Discovery with Powerful New HPC Solutions
solution brief Accelerate Discovery with Powerful New HPC Solutions Achieve Extreme Performance across the Full Range of HPC Workloads Researchers, scientists, and engineers around the world are using
More informationPerformance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab
Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event
More informationVLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationExascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
More informationIntel Virtualization Technology
Intel Virtualization Technology Examining VT-x and VT-d August, 2007 v 1.0 Peter Carlston, Platform Architect Embedded & Communications Processor Division Intel, the Intel logo, Pentium, and VTune are
More informationA Quantum Leap in Enterprise Computing
A Quantum Leap in Enterprise Computing Unprecedented Reliability and Scalability in a Multi-Processor Server Product Brief Intel Xeon Processor 7500 Series Whether you ve got data-demanding applications,
More informationNew Developments in Processor and Cluster. Technology for CAE Applications
7. LS-DYNA Anwenderforum, Bamberg 2008 Keynote-Vorträge II New Developments in Processor and Cluster Technology for CAE Applications U. Becker-Lemgau (Intel GmbH) 2008 Copyright by DYNAmore GmbH A - II
More informationBuilding an energy dashboard. Energy measurement and visualization in current HPC systems
Building an energy dashboard Energy measurement and visualization in current HPC systems Thomas Geenen 1/58 thomas.geenen@surfsara.nl SURFsara The Dutch national HPC center 2H 2014 > 1PFlop GPGPU accelerators
More informationOverview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationHow To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationOverview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
More informationBuilding a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
More informationOpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
More information1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
More informationPerformance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10
Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging
More informationand RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
More informationParallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.
Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr GPU internals What makes a GPU tick? NVIDIA GeForce GTX 980 Maxwell GPU.
More informationTechnical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.
Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationIntroduction to High Performance Cluster Computing. Cluster Training for UCL Part 1
Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these
More informationPower System Probabilistic and Security Analysis on Commodity High Performance Computing Systems
Power System Probabilistic and Security Analysis on Commodity High Performance Computing Systems Tao Cui Carnegie Mellon University 5 Forbes Ave. Pittsburgh, PA 15213 tao.cui@ieee.org ABSTRACT Large scale
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationMulti-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationMultithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationSPEEDUP - optimization and porting of path integral MC Code to new computing architectures
SPEEDUP - optimization and porting of path integral MC Code to new computing architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić, A. Bogojević Scientific Computing Laboratory, Institute of Physics
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationHETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
More information