High Performance Computing: The processor

Similar documents
Intel 64 and IA-32 Architectures Software Developer s Manual


Computer Architecture TDTS10

Software implementation of Post-Quantum Cryptography

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Rethinking SIMD Vectorization for In-Memory Databases

GPUs for Scientific Computing

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

PROBLEMS #20,R0,R1 #$3A,R2,R4

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

LSN 2 Computer Processors

Binary search tree with SIMD bandwidth optimization using SSE

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

GPU Hardware Performance. Fall 2015

Algorithms of Scientific Computing II

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Pexip Speeds Videoconferencing with Intel Parallel Studio XE

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Evaluation of CUDA Fortran for the CFD code Strukti

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Next Generation GPU Architecture Code-named Fermi

Big Data Visualization on the MIC

CPU Organization and Assembly Language

MAQAO Performance Analysis and Optimization Tool

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Introduction to GPU Programming Languages

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Introduction to Cloud Computing

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

picojava TM : A Hardware Implementation of the Java Virtual Machine

Performance Analysis and Optimization Tool

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Generations of the computer. processors.

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

İSTANBUL AYDIN UNIVERSITY

Instruction Set Architecture

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

All ju The State of Software Development Today: A Parallel View. June 2012

Let s put together a Manual Processor

Cycles for Competitiveness: A View of the Future HPC Landscape

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

SPARC64 VIIIfx: CPU for the K computer

CSE 6040 Computing for Data Analytics: Methods and Tools

How to Write Fast Code SIMD Vectorization , spring th and 14 th Lecture

DNA Data and Program Representation. Alexandre David

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

High Performance Computing, an Introduction to

Parallel Programming Survey

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Pipelining Review and Its Limitations

Exploiting SIMD Instructions

Week 1 out-of-class notes, discussions and sample problems

IA-32 Intel Architecture Software Developer s Manual

Instruction Set Architecture (ISA) Design. Classification Categories

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Computer Architecture. Secure communication and encryption.

Computer Organization and Architecture

Performance Application Programming Interface

PSE Molekulardynamik

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Algorithm Engineering

CUDA programming on NVIDIA GPUs

VLIW Processors. VLIW Processors

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

GPI Global Address Space Programming Interface

High-speed image processing algorithms using MMX hardware

Multi-Threading Performance on Commodity Multi-Core Processors

Putting it all together: Intel Nehalem.

CHAPTER 7: The CPU and Memory

Benchmarking Large Scale Cloud Computing in Asia Pacific

Introduction to GPU Architecture

Instruction Set Design

Chapter 2 Logic Gates and Introduction to Computer Architecture

CPU performance monitoring using the Time-Stamp Counter register

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

IA-64 Application Developer s Architecture Guide

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Administrative Issues

The programming language C. sws1 1

SOC architecture and design

WAR: Write After Read

Performance monitoring of the software frameworks for LHC experiments

Overview of HPC Resources at Vanderbilt

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

A quick tutorial on Intel's Xeon Phi Coprocessor

Using the Game Boy Advance to Teach Computer Systems and Architecture

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor

Transcription:

High Performance Computing: The processor Erik Saule esaule@uncc.edu ITCS 6010/8010: Special Topics: High Performance Computing. 01/16/2014 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 1 / 27

Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/2014 2 / 27

Schematic (Nehalem: Intel Core i7) Source: intel.com Erik Saule (6010/8010 HPC) The Processor 01/16/2014 3 / 27

Schematic (Nehalem: Intel Core i7) Source: intel.com Erik Saule (6010/8010 HPC) The Processor 01/16/2014 3 / 27

A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 4 / 27

A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Instruction pipelining Source: Wikipedia One instruction per cycle (thr). Erik Saule (6010/8010 HPC) The Processor 01/16/2014 4 / 27

A classic RISC pipeline Typical architecture Instructions are pushed in a pipeline the one after the other Source: a Geogia-Tech assignment by Hyesoon Kim Instruction pipelining ADD R0, R1 LI R0, +42 LOAD R0, R1 BNZ R0, +3 Pipeline Stall (Bubbles) Source: Wikipedia One instruction per cycle (thr). Source: Wikipedia Erik Saule (6010/8010 HPC) The Processor 01/16/2014 4 / 27

How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 5 / 27

How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 5 / 27

How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 5 / 27

How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Loop in the ALU Add an output to the ALU to give to the next Instruction Decode in case it is needed. (But complicated for more than 1 cycle of difference) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 5 / 27

How to avoid bubbles? Control Hazard Happens when the processor has not decided yet which is the next instruction to execute. Typically because of a jump, which might but either: Jump +12 JumpR R0 BNZ R0, +2 Branchment Prediction Keep a small table of what was choosen before, and guess which choice will be taken. If wrong, flush the pipeline. Data Hazard An instruction use the result of the previous instruction. R0 = R1+R2 R3 = R0+R4 Loop in the ALU Add an output to the ALU to give to the next Instruction Decode in case it is needed. (But complicated for more than 1 cycle of difference) Out-of-order Execution Keep a pool of instructions to come and execute a safe one. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 5 / 27

Core architecture (Nehalem) Source: TAMU supercomputing facility. http://sc.tamu.edu/systems/eos/ Erik Saule (6010/8010 HPC) The Processor 01/16/2014 6 / 27

Core architecture (Nehalem) Source: TAMU supercomputing facility. http://sc.tamu.edu/systems/eos/ Erik Saule (6010/8010 HPC) The Processor 01/16/2014 6 / 27

Core architecture (Nehalem) Source: TAMU supercomputing facility. http://sc.tamu.edu/systems/eos/ Erik Saule (6010/8010 HPC) The Processor 01/16/2014 6 / 27

Summary A modern multicore processors are complicated. There are shared ressources between cores I/O port L3 cache Memory Controller(s) It is VERY difficult to predict their behavior. One core can retire more than one instruction per cycle. The philosophy of RISC optimization still applies. Writting efficient code manually is going to be tedious. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 7 / 27

Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/2014 8 / 27

Single Instruction Multiple Data Flynn s Taxonomy Single Instruction Single Data: Simple processor model. Single Instruction Mutiple Data: Vector processor. Multiple Instruction Single Data: Odd and uncommon. Multiple Instruction Multiple Data: Typically shared memory. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 9 / 27

Single Instruction Multiple Data Flynn s Taxonomy Principle Single Instruction Single Data: Simple processor model. Single Instruction Mutiple Data: Vector processor. Multiple Instruction Single Data: Odd and uncommon. Multiple Instruction Multiple Data: Typically shared memory. Perform in a single instruction the same operation of multiple data. In typical processor, it use special registers which contains multiple values. Often called packed register (Packed Integer, Packed Double,...) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 9 / 27

History Intel processors MMX (1996): 64-bit integer packed register (overlap with FPU). SSE (1999): new 128 bit register for SP. SSE2 (2001): extend SSE to integer. SSE3 (2004): Horizontal operation, misaligned read, convert. SSSE3 (2006): new minor (mostly horizontal) operations. SSE4 (2006): minor, String, Bit counts. AES (2008): to help AES encryption. AVX (2008): introduces 256 bits registers and 3 operand instructions. Overlap with SSE registers. AVX2 (2013?): expand AVX to most instruction. FMA. gather. Some nuances with AMD, but mostly the same. Other classical ones: AltiVec for Power and Neon for ARM. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 10 / 27

How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 11 / 27

How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 11 / 27

How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 11 / 27

How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Use high level types Use C++ types and operation overloading to achieve some abstraction. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 11 / 27

How to use vector instructions Code and pray The compiler can do some of it. Typically does some vectorization with -O2. Might be necessary to use some -march -mtune options. Guide the compiler #pragma simd #pragma vector aligned Intel Cilk Plus Elemental extension: declspec(vector) Write assembly Long and tedious. Previous section showed how difficult understanding architectures can be. Use high level types Use C++ types and operation overloading to achieve some abstraction. Use intrinsics/builtin Most compilers expose vector types on which built-in/intrinsics functions can be applied. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 11 / 27

AVX (SandyBridge) and AVX2 (Haswell) Registers There are 16 registers of 256-bits called YMM[0-15]. They overlap with XMM[0-15] registers for SSE. Declare them in C (with proper include): mm256s: single precision mm256d: double precisions mm256i: integers Notice the lack of signedness and size of the integers All operations are suffixed with the type of data they operate on: ps: packed single precision pd: packed double precision epi32: signed 32-bit integers (mostly AVX2) epu8: unsigned 8 bit integers (mostly AVX2) Note: read the documentation carefully, many are confusing. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 12 / 27

Different kinds of instructions Unary vertical operations m256d mm256 sqrt pd( m256d a); m256d mm256 rsqrt ps( m256 a); m256d mm256 rcp ps( m256 a); Erik Saule (6010/8010 HPC) The Processor 01/16/2014 13 / 27

Different kinds of instructions Unary vertical operations m256d mm256 sqrt pd( m256d a); m256d mm256 rsqrt ps( m256 a); m256d mm256 rcp ps( m256 a); For instance: m256d a = mm256 set pd(4.0, 9.0, 16.0, 25.0); 4 9 16 25 m256d b = mm256 sqrt pd(a); 2 3 4 5 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 13 / 27

Different kinds of instructions Binary vertical operations m256d mm256 add epi16( m256i m1, m256i m2); (AVX2) m256d mm256 mul ps( m256 m1, m256 m2); m256d mm256 div pd( m256d m1, m256d m2); m256d mm256 max ps( m256 m1, m256 m2); m256d mm256 xor pd( m256d m1, m256d m2); m256 mm256 fmadd ps( m256 a, m256 b, m256 c); (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 14 / 27

Different kinds of instructions Binary vertical operations m256d mm256 add epi16( m256i m1, m256i m2); (AVX2) m256d mm256 mul ps( m256 m1, m256 m2); m256d mm256 div pd( m256d m1, m256d m2); m256d mm256 max ps( m256 m1, m256 m2); m256d mm256 xor pd( m256d m1, m256d m2); m256 mm256 fmadd ps( m256 a, m256 b, m256 c); (AVX2) For instance: m256d a = mm256 set pd(1., 2., 3., 4.); m256d b = mm256 set pd(10., 20., 30., 40.); m256d c = mm256 add pd(a, b); 1 2 3 4 10 20 30 40 11 22 33 44 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 14 / 27

Different kinds of instructions Horizontal operations m256d mm256 hadd pd( m256d m1, m256d m2); m256d mm256 dp ps( m256 m1, m256 m2, const int mask); Erik Saule (6010/8010 HPC) The Processor 01/16/2014 15 / 27

Different kinds of instructions Horizontal operations m256d mm256 hadd pd( m256d m1, m256d m2); m256d mm256 dp ps( m256 m1, m256 m2, const int mask); For instance: m256d a = mm256 set pd(1., 2., 3., 4.); m256d b = mm256 set pd(10., 20., 30., 40.); m256d c = mm256 hadd pd(a, b); 1 2 3 4 10 20 30 40 3 7 30 70 Erik Saule (6010/8010 HPC) The Processor 01/16/2014 15 / 27

Different kinds of instructions Getting data from regular registers m256i mm256 setzero si256(void); m256d mm256 set1 ps(float); m256 mm256 set pd(float, float, float, float); Erik Saule (6010/8010 HPC) The Processor 01/16/2014 16 / 27

Different kinds of instructions Getting data from regular registers m256i mm256 setzero si256(void); m256d mm256 set1 ps(float); m256 mm256 set pd(float, float, float, float); Getting data from vector register m128i mm256 extractf128 si256( m256i m1, const int offset); m256d mm256 shuffle pd( m256d m1, m256d m2, const int select); m256 mm256 blend ps( m256 m1, m256 m2, const int mask); m256 mm256 permute ps( m256 m1, int control); Erik Saule (6010/8010 HPC) The Processor 01/16/2014 16 / 27

Different kinds of instructions Getting data from memory m256 mm256 load ps(float const *a); m256d mm256 loadu pd(double const *a); (unaligned) m256 mm256 broadcast ps( m128 const *a); m256d mm256 broadcast sd(double const *a); m128d mm i64gather pd(double const * base, m128i vindex, const int scale); (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 17 / 27

Data Layout Array of Structure Vs Structure of Array (AoS/SoA). A gaming example. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 18 / 27

Capabilities Being careful Of course not all processors can use all the instructions. Using a non existing instruction leads to an IllegalInstructionException (most of the time). You can know which processor you are using with the CPUID assembly instruction. Linux expose the available instruction set in /proc/cpuinfo Erik Saule (6010/8010 HPC) The Processor 01/16/2014 19 / 27

Capabilities Being careful Of course not all processors can use all the instructions. Using a non existing instruction leads to an IllegalInstructionException (most of the time). You can know which processor you are using with the CPUID assembly instruction. Linux expose the available instruction set in /proc/cpuinfo On the software side Many libraries and software probe the capability of the CPU and chose which code path to take. FFmpeg. http://www.ffmpeg.org/doxygen/0.6/cpuid_8c-source.html Intel MKL. It used to be way to restrictive. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 19 / 27

Summary Lots of performance available by using vectorial operations AVX can process in a single instruction: 4 double precision floating point value 8 single precision floating point value 8 32-bit integer (AVX2) 16 16-bit integer (AVX2) 32 8-bit integer (AVX2) Fused Multiply Add can potentially double that number again (AVX2) Not only vertical operations: horizontal operations data reordering gather (AVX2) Erik Saule (6010/8010 HPC) The Processor 01/16/2014 20 / 27

Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/2014 21 / 27

Why Hardware Threading? Context switching Sometimes CPU idles because it waits for data. there is a cache/page fault. it waits for the systems to wake it up (waiting on a condition to be met by another process). it is waiting for I/Os. its time-quantum has elapsed. Low level reason Sometimes some ressources of the processor are left idle. waiting for data in L2 or L3 cache to reach L1. contention on the L1 cache. pipeline bubbles. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 22 / 27

What is it? Multiple hardware context The principle is to have multiple hardware context within a core: Replicate the registers. Do not replicate execution components. Switching from one to the other is now virtually free. Keeps more potential work to run on the execution component. Efficiency increases. Multiple kinds One thread runs until it blocks and switch. Round robin on the threads. Decode everything at once and let them execute as possible. (SMT) But it is a very old idea used to overlap disk I/O and computation. Now, this (or similar ideas) are used by most computing hardware. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 23 / 27

Interesting facts about threading Often used to overlap data transfers and operations. Can cause more contention on shared ressources leading to performance drop. On compute kernels, 30% can be gained. On I/O bound application, large factor can be gained. On x86 CPUs, rarely more than 2-way SMT. (Xeon Phi is 4-way.) Sparc had 16-way threading on some processors. Erik Saule (6010/8010 HPC) The Processor 01/16/2014 24 / 27

Conclusion (before talking about OpenMP) Generalities There is parallelism at many degrees: at the processor level (multiple cores) at the core level with SMT within a core (multiple instruction ports) within a port (SIMD) To understand performance, you need to understand how things are going to be executed. Very difficult to predict actual performance. Things that matter Frequency Number of core Instruction set Architecture of the core Erik Saule (6010/8010 HPC) The Processor 01/16/2014 25 / 27

Outline 1 General architecture 2 Vector processing 3 Hardware Threading 4 Resources Erik Saule (6010/8010 HPC) The Processor 01/16/2014 26 / 27

Resources Classical RISC: http://en.wikipedia.org/wiki/classic_risc_pipeline Detailed explanation of Nehalem: http://sc.tamu.edu/systems/eos/ Composer XE documentation for SIMD instruction: http://software.intel.com/sites/products/documentation/ doclib/stdxe/2013/composerxe/compiler/cpp-mac/guid-64e5bfbb-fe9e-47cb-82a7-76c2ad57ed9c.htm On pragma simd: http://software.intel.com/en-us/articles/requirements-for-vectorizing-loops-with-pragma-simd Intel compiler ways to vectorize automatically: http://software.intel.com/en-us/articles/using-intel-avx-without-writing-avx SIMD in gcc: http://ds9a.nl/gcc-simd/index.html (a little old) Builtin in gcc: http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/target-builtins.html Intel developper manual: http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html SMT on wikipedia: http://en.wikipedia.org/wiki/simultaneous_multithreading CPUID: http://en.wikipedia.org/wiki/cpuid On Stream Programming (vectorization): http://www.akkadia.org/drepper/summit09-stream.pdf sandybridge/haswell architectural comparison: http://www.realworldtech.com/haswell-cpu/ Erik Saule (6010/8010 HPC) The Processor 01/16/2014 27 / 27