2 Multicore Landscape Intel Dual and quad-core Pentium family. 80-core demonstration last year. AMD Dual, triple (?!), and quad-core Opteron family. IBM Dual and quad-core Power family Cell Broadband Engine Sun (Multithreaded) Niagara: 8 cores (64 threads).
3 Multicore Classes Homogenous Multicore Replication of the same processor type on the die as a shared memory multiprocessor. Examples: AMD and Intel dual- and quad-core processors Heterogeneous/Hybrid Multicore Different processor types on a die Example: IBM Cell Broadband Engine One standard core (PowerPC) + 8 specialiced cores (SPEs) Rumors: Intel and AMD to follow suit. Combining GPU-like processors with standard multicore cores
4 Homogeneous Multicore The major mainstream chips are all of this form Examples: Intel, IBM, Sun, AMD, and so on. Take a single core, stamp it out lots of times on a die. This will work for now, but it likely will not scale to high core counts. Why?
5 AMD Dual-, Triple-, and Quad-Core
6 AMD Triple-Core? Actually quad-core processors with one core disabled. Manufacturing defects that kill one core but leave the rest functional would be thrown out as failed quad core. Why not resell them as triple core? Not a new idea Lower clock-rate versions of processors are identical to their higher clock-rate siblings but were part of a batch that failed to meet certain manufacturing tolerances but were otherwise fine.
7 Multicore Landscape AMD Rev. H Quad-Core
8 IBM: Power6
9 Sun Niagara 8 cores, 8 threads per core = 64 threads
10 Multicore Landscape Intel Nehalem
11 Dies Observations Core replication obvious. Big chunks of real estate dedicated to caches. Increasing real estate dedicated to memory and bus logic. Why? In the past, SMPs were usually 2 or 4 CPUs, so this logic was on other chips on the motherboard. Even with big fat SMPs, the logic was off on a different board (often this was what the $$$ went for in the expensive SMPs.)
12 Wait a Minute! There s More! Itanium processors are based on the concept of VLIW (Very Long Instruction Word). Each instruction is a 128- bit word composed of four 32-bit instructions to be executed concurrently. Difficulty of identifying parallelism is left to the compiler instead of hardware or programmer. VLIW dead with Itanic?
13 VLIW Lives! VLIW not dead. Intel Terascale demo processor 80 VLIW cores instead of 80 superscalar cores similar to those found in the Core2 architecture more commonly found today.
14 Multicore: Present Core count is increasing. Operating systems schedule processes out to the various cores in the same way they always have on traditional multiprocessor systems. Applications get increased performance for free by reducing the # processes per core yielding a decrease in contention for processor resources.
15 Resource Contention Assume each process or thread has its own CPU. Example: The little clock in the corner of the screen has its own CPU. Result: Massive contention for resources when simultaneous threads all fight for off-chip resources. Process preemption when doing something slow, e.g., going to memory. Other useful work can occur. If a processor exists for each process, everyone could proceed at the same time yet will still be fighting for a bus that is not growing at the same rate as the core count. Same with caches.
16 Blast Back to the Past 70s and 80s Network topology will matter again! Mesh-based algorithms, hypercubes, tori, etc. 80-core Intel Terascale demo processor had a huge component related to interconnection topology. Systolic algorithms are making a comeback!
17 Reminder: Reading Assignment Brian Hayes, Computing in a Parallel Universe, American Scientist, November-December computing-in-a-parallel-universe Herb Sutter, The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb's Journal, 30(3), March
18 Why Memory Matters Yes for traditional multicore No for emerging multicore
19 Memory Architecture in Multicore As you saw in one of the readings The cache is still a key performance feature. but interesting ideas exist that will turn the usual memory architecture upside down. hence, we look at traditional memory architecture so we can compare and contrast it with emerging memory architecture of the Cell, GPGPU, and so on.
20 Caches: The Multicore Feature of Interest Introduced in the 1960s as a way to overcome the memory wall. Memory speeds did not advance as fast as the speed of functional units. Consequence: Processor outruns memory, leading to decreased utilization. Utilization = (t total t idle ) / t total What happens when you go out to main memory? Idle time. Decreased utilization = less work per unit time. This costs money because time=$$ so time doing nothing is $$$ wasted.
21 Idle Time Caches were not the sole fix for idle time. Preemptive multitasking and time sharing were actually the dominant methods in the early days. But every program inevitably has to go out to memory, and you do not always have enough jobs to swap in while others wait for memory. Plus, do you really want to be preempted every time you ask for data? Of course not! So, caches important (for now :-)
22 Caches: Quick Overview Traditional Single-Level Memory Multiple Memory Levels
23 Caches: In Action Access a location in memory Copy the location and its neighbors into the faster memories closer to the CPU.
24 Caches: In Action Access a location in memory Copy the location and its neighbors into the faster memories closer to the CPU.
25 Caches: In Action Access a location in memory Copy the location and its neighbors into the faster memories closer to the CPU.
26 Caches: In Action Access a location in memory Copy the location and its neighbors into the faster memories closer to the CPU.
27 Caches: In Action Access a location in memory Copy the location and its neighbors into the faster memories closer to the CPU.
28 Caches: In Action Access a location in memory Copy the location and its neighbors into the faster memories closer to the CPU.
29 Caches: In Action Next time you access memory, if you already pulled the address into one of the cache levels in a previous access, the value is provided from the cache.
30 Caches: In Action Next time you access memory, if you already pulled the address into one of the cache levels in a previous access, the value is provided from the cache.
31 Caches: In Action Next time you access memory, if you already pulled the address into one of the cache levels in a previous access, the value is provided from the cache.
32 Caches: In Action Next time you access memory, if you already pulled the address into one of the cache levels in a previous access, the value is provided from the cache.
33 Caches: In Action Next time you access memory, if you already pulled the address into one of the cache levels in a previous access, the value is provided from the cache.
34 Why Do Caches Work? Locality Spatial and Temporal Locations near each other in space (address) are highly likely to be accessed near each other in time. Cost? A high cost for one access but amortize this out with faster accesses afterwards. Burden? The machine makes sure that memory is kept consistent. If a part of cache must be reused, the cache system writes the data back to main memory before overwriting it with new data. Hardware cache design deals with managing mappings between the different levels and deciding when to write back down the hierarchy.
35 Caches: Memory Consistency? What happens when you modify something in memory?
36 Caches: Memory Consistency? What happens when you modify something in memory?
37 Caches: Memory Consistency? What happens when you modify something in memory? Writes to memory become cheap. Only go to slow memories when needed. Called write-back memory.
38 Caches: Memory Consistency? Eventually written values must make it back to the main store. When? Typically, when a cache block is replaced due to a cache miss, where new data must take the place of old. The programmer does NOT see this. Hardware takes care of all this but things can go wrong very quickly when you modify this model. Forecast for emergent chip multiprocessors: Cell, GPGPU, Terascale, and so on.
39 Common Memory Models Shared Memory Architecture Distributed Memory Architecture Which is Intel? Which is AMD? Others?
40 Memory Models: Shared Memory Before Only one processor has access to modify memory. How do we avoid problems when multiple cache hierarchies see the same memory?
41 Caching Issues Assume two processors load locations that are neighbors, so data is replicated in the local processor caches. Now, let one processor modify a value. The memory view is now inconsistent. One processor sees one version of memory, the other sees a different version. How do we resolve this in hardware such that the advantages of caches are still seen by application developers in terms of performance while ensuring a consistent (or, coherent) view of memory?
42 Caching Issues
43 Caching Issues
44 Caching Issues
45 Caching Issues
46 Caching: Memory Consistency? Easy to see memory consistency problem if we restrict each cache hierarchy to being isolated from the others, only sharing main memory. Key insight Make this inconsistency go away by making the caches aware of each other.
47 What is Memory Coherence? Definition (Courtesy: Parallel Computer Architecture by Culler and Singh) 1. Operations issued by any particular process occur in the order in which they were issued to the memory system by that process. 2. The value returned by each read operation is the value written by the last write to that location in the serial order. Assumption: The above requires a hypothetical ordering for all read/write operations by all processes into a total order that is consistent with the results of the overall execution. Sequential Consistency (SC) The memory coherence hardware assists in enforcing SC.
48 Implicit Properties of Coherence The key to solving the cache coherence problem is the hardware implementation of a cache coherence protocol. A cache coherence protocol takes advantage of two hardware features 1. State annotations for each cache block (often just a couple bits per block). 2. Exclusive access to the bus by any accessing process.
49 Bus Properties All processors on the bus see the same activity. So, every cache controller sees bus activity in the same order. Serialization at the bus level results from the phases that compose a bus transaction: Bus arbitration: The bus arbiter grants exclusive access to issue commands onto the bus. Command/address: The operation to perform ( Read, Write ), and the address. Data: The data is then transferred.
50 Granularity Cache coherence applies at the block level. Recall that when you access a location in memory, that location and its neighbors are pulled into the cache(s). These are blocks. Note: To simplify the discussion, we will only consider a single level of cache. The same ideas translate to deeper cache hierarchies.
51 Cache Coherency via the Bus Key Idea: Bus Snooping All CPUs on the bus can see activity on the bus regardless of if they initiated it.
52 Cache Coherency via the Bus Key Idea: Bus Snooping All CPUs on the bus can see activity on the bus regardless of if they initiated it.
53 Cache Coherency via the Bus Key Idea: Bus Snooping All CPUs on the bus can see activity on the bus regardless of if they initiated it.
54 Cache Coherency via the Bus Key Idea: Bus Snooping All CPUs on the bus can see activity on the bus regardless of if they initiated it.
55 Invalidation vs. Update A cache controller snoops and sees a write to a location that it has a now-outdated copy of. What does it do? Invalidation Mark cache block as invalid, so when CPU accesses it again, a miss will result and the updated data from main memory will be loaded. Requires one bit per block to implement. Update See the write and update the caches with the value observed being written to main memory.
56 Write-Back vs. Write-Through Caches Write-Back On a write miss, the CPU reads the entire block from memory where the write address is, updates the value in cache, and marks the block as modified (aka dirty). Write-Through When the processor writes, even to a block in cache, a bus write is generated. Write-back is more efficient with respect to bandwidth usage on the bus, and hence, ubiquitously adopted.
57 Cache Coherence and Performance Unlike details with pipelining that only concern compiler writers, you the programmer really need to acknowledge that this is going on under the covers. The coherence protocol can impact your performance.
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Multi-core Systems What can we buy today? Ian Watson & Mikel Lujan Advanced Processor Technologies Group COMP60012 Future Multi-core Computing 1 A Bit of History AMD Opteron introduced in 2003 Hypertransport
1 Introduction The purpose of this paper is two fold The first part gives an overview of cache, while the second part explains how the Pentium Processor implements cache A simplified model of a cache system
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 Administrivia FlameWar Games Night Next Friday, April 27 th 5pm
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) firstname.lastname@example.org http://www.mzahran.com Modern GPU
TDTS 08 Advanced Computer Architecture [Datorarkitektur] www.ida.liu.se/~tdts08 Zebo Peng Embedded Systems Laboratory (ESLAB) Dept. of Computer and Information Science (IDA) Linköping University Contact
New York University, Leonard N. Stern School of Business C20.0001 Information Systems for Managers Fall 1999 Hardware Fundamentals Hardware is a general term used to describe the electronic machines that
A Brief Review of Processor Architecture Why are Modern Processors so Complicated? Basic Structure CPU PC IR Regs ALU Memory Fetch PC -> Mem addr [addr] > IR PC ++ Decode Select regs Execute Perform op
The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And
Computer Organization and Architecture Chapter 3 Top-Level View of System Function and Interconnection Computer Components Von Neumann Architecture Data and Instructions stored in single r/w memory Contents
Building Blocks CPUs, Memory and Accelerators Outline Computer layout CPU and Memory What does performance depend on? Limits to performance Silicon-level parallelism Single Instruction Multiple Data (SIMD/Vector)
Multiprocessor Cache Coherence M M BUS P P P P The goal is to make sure that READ(X) returns the most recent value of the shared variable X, i.e. all valid copies of a shared variable are identical. 1.
Wider instruction pipelines 2016. május 13. Budapest Gábor Horváth associate professor BUTE Dept. Of Networked Systems and Services email@example.com How to make the CPU faster Option 1: Make the pipeline
Caches (Writing) P & H Chapter 5.2-3, 5.5 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University Goals for Today: caches Writing to the Cache Write-through vs Write-back Cache Parameter
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles firstname.lastname@example.org Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
From Pentium to Pentium 4: memory hierarchy evolution Rui Miguel Borges Department of Informatics, University of Minho 4710 Braga, Portugal email@example.com Abstract. This communication describes and
BUS ARCHITECTURES Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin Keywords: Bus standards, PCI bus, ISA bus, Bus protocols, Serial Buses, USB, IEEE 1394
Unit A451: Computer systems and programming Section 2: Computing Hardware 1/5: Central Processing Unit Section Objectives Candidates should be able to: (a) State the purpose of the CPU (b) Understand the
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 2 Operating System Overview Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Operating System Objectives/Functions
Outline: Challenges to Obtaining Good Parallel Processing Performance Coverage: The Parallel Processing Challenge of Finding Enough Parallelism Amdahl s Law: o The parallel speedup of any program is limited
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Measuring Cache Performance Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses With simplifying assumptions: Memory stall cycles = = Memory
September 30, 2013 Todays lecture Memory subsystem Address Generator Unit (AGU) Memory subsystem Applications may need from kilobytes to gigabytes of memory Having large amounts of memory on-chip is expensive
Memory If we define memory as a place where data is stored there are many levels of memory: Processor registers Primary (or main) memory RAM Secondary memory slower and more permanent disks Tertiary memory
Lecture 3: Single processor architecture and memory David Bindel 30 Jan 2014 Logistics Raised enrollment from 75 to 94 last Friday. Current enrollment is 90; C4 and CMS should be current? HW 0 (getting
CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
8 Performance Basics; Computer Architectures 8.1 Speed and limiting factors of computations Basic floating-point operations, such as addition and multiplication, are carried out directly on the central
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Using the MIPS32 M4K Processor Core SRAM Interface in Microcontroller Applications October 2007 MIPS Technologies, Inc. 1225 Charleston Road Mountain View, CA 94043 (650) 567-5000 2007 MIPS Technologies,
Components of the System Unit The System Unit A case that contains the electronic components of the computer used to process data. The System Unit The case of the system unit, or chassis, is made of metal
Embedded Systems Sandip Kundu 1 ECE 354 Lecture 1 The Big Picture What are embedded systems? Challenges in embedded computing system design. Design methodologies. Sophisticated functionality. Real-time
Message-passing Multiprocessors. In recent years, there has been interest (and even commercially available machines) which do not use common memory at all, but instead, communicate by passing messages.
Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
DRAM Parameters address DRAM parameters Large capacity: e.g., 1-4Gb Arranged as square + Minimizes wire length + Maximizes efficiency DRAM bit array Narrow data interface: 1 16 bit Cheap packages few bus
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
Page Replacement Strategies Jay Kothari (firstname.lastname@example.org) Maxim Shevertalov (email@example.com) CS 370: Operating Systems (Summer 2008) Page Replacement Policies Why do we care about Replacement Policy? Replacement
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
CSE341T/CSE549T 11/19/2014 Lecture 23 Cache-Aware Analysis of Algorithms So far, we have been assuming a Random Access Model for memory accessing any memory location takes constant time. However, modern
Chapter Seven emories: Review SRA: value is stored on a pair of inverting gates very fast but takes up more space than DRA (4 to transistors) DRA: value is stored as a charge on capacitor (must be refreshed)
Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
Cache Memory Gábor Horváth 2016. április 27. Budapest associate professor BUTE Dept. Of Networked Systems and Services firstname.lastname@example.org It is the memory,... The memory is a serious bottleneck of Neumann
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called
Virtual Memory: Demand Paging and Page Replacement Problems that remain to be solved: Even with 8K page size, number of pages /process is very large. Can t afford to keep the entire page table for a process
Chapter 12: Multiprocessor Architectures Lesson 09: Cache Coherence Problem and Cache synchronization solutions Part 1 Objective To understand cache coherence problem To learn the methods used to solve
Graphics Processing Unit (GPU) Memory Hierarchy Presented by Vu Dinh and Donald MacIntyre 1 Agenda Introduction to Graphics Processing CPU Memory Hierarchy GPU Memory Hierarchy GPU Architecture Comparison
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
Transparent Flip-Flop The RS flip-flop forms the basis of a number of 1-bit storage devices in digital electronics. ne such device is shown in the figure, where extra combinational logic converts the input
Technical White Paper DATA CENTER Optimizing Linux for Dual-Core * AMD Opteron Processors Optimizing Linux for Dual-Core AMD Opteron Processors Table of Contents: 2.... SUSE Linux Enterprise and the AMD
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards email@example.com At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components
Last Class: Memory Management Static & Dynamic Relocation Fragmentation Paging Lecture 12, page 1 Recap: Paging Processes typically do not use their entire space in memory all the time. Paging 1. divides
Chapter 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming Batched Systems Time-Sharing Systems Personal-Computer Systems Parallel Systems Distributed Systems Real -Time
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,