Intel Processor Architectures. Kapil Agrawal Sadika Amreen Reazul Hoque Thananon Patinyasakdikul

Similar documents

A Powerful solution for next generation Pcs

Putting it all together: Intel Nehalem.

GPUs for Scientific Computing

Introduction to Cloud Computing

Haswell Cryptographic Performance

2

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Computer-System Architecture

LOOKING FOR AN AMAZING PROCESSOR. Product Brief 6th Gen Intel Core Processors for Desktops: S-series

Intel Xeon Processor E5-2600

Intel Pentium 4 Processor on 90nm Technology

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Legal Notices and Important Information

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems

Intel Core i Processor (3M Cache, 3.30 GHz)

What are the differences between Intel Core i3, i5 and i7

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

picojava TM : A Hardware Implementation of the Java Virtual Machine

LABS. Boston Solutions. Future Intel Xeon processor E v3 product families. September Powered by

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Accelerating Innovation in the Desktop. Rob Crooke VP and General Manager Business Client Group

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

PROBLEMS #20,R0,R1 #$3A,R2,R4

Intel Core i3-2310m Processor (3M Cache, 2.10 GHz)

CSE 6040 Computing for Data Analytics: Methods and Tools

Chapter 1 Computer System Overview

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Schedule. 9:00-9:10 Section 1 - Basic intro to power and energy. 9:30-9:45 Section 3 - Component specific measurement techniques

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Generations of the computer. processors.

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Operating System Impact on SMT Architecture

Performance monitoring of the software frameworks for LHC experiments

Performance evaluation

Performance Application Programming Interface

Full and Para Virtualization

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Instruction Set Architecture (ISA)

SOC architecture and design

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

İSTANBUL AYDIN UNIVERSITY

Introduction to GPU Programming Languages

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to GPU hardware and to CUDA

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

IOS110. Virtualization 5/27/2014 1

Glitch Free Frequency Shifting Simplifies Timing Design in Consumer Applications

Software implementation of Post-Quantum Cryptography

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Parallel Algorithm Engineering

Architecture of Hitachi SR-8000

Next Generation GPU Architecture Code-named Fermi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Hard Disk Drive vs. Kingston SSDNow V+ 200 Series 240GB: Comparative Test

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

SYSTEM ecos Embedded Configurable Operating System

CSCI 4717 Computer Architecture. Function. Data Storage. Data Processing. Data movement to a peripheral. Data Movement

EC 362 Problem Set #2

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Enterprise Applications

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

CS352H: Computer Systems Architecture

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

ECDF Infrastructure Refresh - Requirements Consultation Document

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

Rethinking SIMD Vectorization for In-Memory Databases

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

White Paper on Consolidation Ratios for VDI implementations

Presentation Headline To Go Here

Choosing a Computer for Running SLX, P3D, and P5

INTEL IPP REALISTIC RENDERING MOBILE PLATFORM SOFTWARE DEVELOPMENT KIT

The Motherboard Chapter #5

White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor

Intel X58 Express Chipset

ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER

- An Essential Building Block for Stable and Reliable Compute Clusters

Processor Architectures

High-speed image processing algorithms using MMX hardware

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Configuring Memory on the HP Business Desktop dx5150

CS 147: Computer Systems Performance Analysis

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Cycles for Competitiveness: A View of the Future HPC Landscape

Innovativste XEON Prozessortechnik für Cisco UCS

Transcription:

Intel Processor Architectures Kapil Agrawal Sadika Amreen Reazul Hoque Thananon Patinyasakdikul

Overview Analysis of Haswell and Ivy Bridge processor architecture by Intel Intel Core i5 Ivy Bridge 3570k Haswell 4670k Intel Core i7 Ivy Bridge 3770K Haswell 4770k

Overview Threading Intel s TICK TOCK Model Instruction Sets Cache Performance Power Consumption Overall Performance Price/performance ratio Branch prediction

Threads Thread in processing is another way of virtualizing programs running concurrently. Can be thought of as the software version of running multiple CPUs A thread is a path of execution through a program.

Single-Thread Vs. Multi-Thread Single threaded programs have one path of execution. Multi-threaded programs have two or more paths of executions Single threaded programs can perform only one task at a time, and have to finish each task in sequence before they can start another.

Some background Core i3 - Two cores, hyper-threaded - Acts like 4 cores Core i5 - Four cores - Turbo-boost feature Core i7 - Four cores, hyper-threaded - Acts like 8 cores - Turbo-boost feature

Ivy Bridge Haswell

Haswell Execution Units

Haswell s Compute Instructions Haswell includes the AVX2 instructions. - AVX2: Advanced Vector Extension 2 - FMA3: Fused Multiply-Add Extension 3 - - (allows numbers to be multiplied and added in one operation) These extensions are not extensively supported by applications yet.

Example AVX came with 12 new instructions some of which are suitable for 3 variables Example: AVX: C = A + B Before AVX: A = A + B and C = A AVX2 takes this a step further. The integer execution units now can work with 256-bit numbers.

FMA AMD introduced FMA instructions with the Bulldozer core, which can work with four variables. i.e. D = A x B + C. Intel went with FMA3, with a maximum of three variables. i.e. C = A x B + C. Intel's simpler version can still improve performance quite a bit for AVX2- compiled software.

Intel AVX2 Instruction Set Includes - 256 bit Integer vectors - FMA: Fused Multiply-Add Benefits - High performance computing - Audio & Video - Games New Integer Instructions - Indexing and Hashing - Cryptography - Endian Conversion

Haswell - TSX Performance-boost for software is the new Transactional Synchronization Extensions (TSX) Improves the way multiple threads of the same program handle data in the memory Multi-threaded software should scale better to multiple cores as a result.

Difference in Cache Metric Ivy Bridge Haswell L1 Load Bandwidth 32 Bytes/cycle 32 Bytes/cycle Store Bandwidth 16 Bytes/cycle 32 Bytes/cycle L2 Bandwidth to L1 32 Bytes/cycle 64 Bytes/cycle L2 Unified TLB 4K: 512, 4-way 4K+2M shared: 1024, 8-way

Haswell has a lower power consumption than Ivy Bridge in idle state.

Branch Prediction IVY BRIDGE Misprediction penalty 15 clock cycles or more for branches inside the μop cache and slightly more for branches in the level-1 code cache HASWELL It was measured to 15-20 clock cycles. Varies a lot Pattern recognition for conditional jumps Nested loops and loops with branches inside are not predicted particularly well Loops are successfully predicted up to a count of 32 or a little more. Nested loops and branches inside loops are predicted reasonably well

Branch Prediction Pattern recognition for indirect jumps and calls Prediction of function returns IVY BRIDGE Indirect jumps and indirect calls (but not returns) are predicted using the same two-level predictor as branch instructions The return stack buffer has 16 entries for near returns HASWELL Indirect jumps and indirect calls are predicted well The return stack buffer has 16 entries for near returns

Summary Ivy Bridge PROS Needs less power, Insignificantly cheaper Slightly better price / performance ratio CONS Performs a bit worse in all types of programs Lacks some instructions

Summary - Haswell PROS Insignificantly faster in all kinds of tasks Features AVX2 / FMA3 instructions CONS Requires slightly more power Priced a bit higher Insignificantly worse price / performance ratio

Single Thread Performance Multi-Thread Performance

Memory Intensive Application Discrete Graphics Performance

Price/Performance ratio

Questions?

References: The Haswell paradox: The best CPU in the world unless you re a PC enthusiast http://www.extremetech.com/computing/157337-thehaswell-paradox-the-best-cpu-in-the-world-unless-yourea-pc-enthusiast Fourth generation Intel Core preview: all about Haswell http://us.hardware.info/reviews/4308/4/fourthgeneration-intel-core-preview-all-about-haswell-avx2 CPU World- Intel Core i5-3570k vs i5-4670k http://www.cpu-world.com/compare/579/ Intel_Core_i5_i5-3570K_vs_Intel_Core_i5_i5-4670K.html Intel Core i7 4960X (Ivy Bridge) E review http://www.anandtech.com/show/7255/intel-corei7-4960x-ivy-bridge-e-review/2