Embedded Parallel Computing

Size: px

Start display at page:

Download "Embedded Parallel Computing"

Lucas McKinney
8 years ago
Views:

1 Embedded Parallel Computing Lecture 5 - The anatomy of a modern multiprocessor, the multicore processors Tomas Nordström Course webpage:: < Course responsible and examiner: Tomas Nordström, Tomas.Nordstrom@hh.se; Room E313; Tel:

2 Outline Modern Multicore Symmetrical Multiprocessing (SMP) Multicore ARM Multi-core architecture 2

3 MIMD: Symmetrical Multiprocessing A multi-core architecture with Symmetrical Multiprocessing (SMP) is defined by the following characteristics: Architecture consists of two or more identical CPU cores. All cores share a common system memory and are controlled by a single Operating system. Each CPU is capable of operating independently on different workloads and whenever possible, is also capable of sharing workloads with the other CPU. 3

All cores share a common system memory and are controlled by a single Operating system.

4 Multicore [Wikipedia]: A multi-core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions A many-core processor is a multi-core processor in which the number of cores is large enough that traditional multiprocessor techniques are no longer efficient largely because of issues with congestion in supplying instructions and data to the many processors An Intel Core 2 Duo E6750 dual-core processor Several tens of cores! 4

number of cores is large enough that traditional multiprocessor techniques are no longer efficient largely because of issues with

5 Multicore Symmetric multiprocessing (SMP) designs using discrete CPUs exists since a long time Thus the issues regarding implementing multicore processor architecture and supporting it with software are well known Utilizing a proven processing-core design without architectural changes reduces design risk significantly 5

architecture and supporting it with software are well known Utilizing a proven

6 Single Core Design is Hitting the f... Wall Greatly diminished gains in processor performance from increasing the operating frequency. This is due to three primary factors: The memory wall The ILP wall The power wall 6

7 Multicore SMP The proximity of multiple CPU cores on the same die allows the cache coherency circuitry to operate at a much higher clock-rate than is possible if the signals have to travel off-chip Combining equivalent CPUs on a single die significantly improves the performance of cache snoop (alternative: Bus snooping) operations 7

signals have to travel off-chip Combining equivalent CPUs on a single die

8 Example: ARM MPcore 8

9 Thread-level Parallelism For thread-level parallelism, ARM needed to improve exception handling to prepare for the increased complexity in handling multithreading on multiple processors These requirements added inherent complexity in the interrupt handler, scheduler, and context switch 9

handling multithreading on multiple processors These requirements added

10 MPcore Semaphores Earlier ARM architectures implemented semaphores with the swap instruction, which held the external bus until completion. One processor can hold the entire bus until completion, disallowing all other processors. Unacceptable! ARMv6 introduced two new instructions load-exclusive LDREX and store-exclusive STREX which take advantage of an exclusive monitor in memory: LDREX loads a value from memory and sets the exclusive monitor to watch that location, and STREX checks the exclusive monitor and, if no other write has taken place to that location, performs the store to memory and returns a value to indicate if the data was written. 10

ARMv6 introduced two new instructions load-exclusive LDREX and store-exclusive STREX which take advantage of an exclusive monitor in memory: LDREX loads a value

11 Physically Tagged Caches Usage of Virtual or Physical addresses in the cache? A virtually tagged cache must be flushed every time a context switch takes place because the cache contains old virtual-to-physical translations In ARM11 with MPcore the memory management unit logic resides between the level 1 cache and the processor core 11

because the cache contains old virtual-to-physical translations In ARM11 with

12 Atomic Instructions Traditionally swap-based and compare- andexchange-based semaphores have been used to control access to critical data SMP often aim for lock-free synchronization 12

13 cmpxchg8b Many are using the Intel cmpxchg8b instruction in these lock-free routines because it can exchange and compare 8 bytes of data atomically. Typically, this involved 4 bytes for payload and 4 bytes to distinguish between payload versions that could otherwise have the same value the so-called A-B-A problem. 13

Typically, this involved 4 bytes for payload and 4 bytes to distinguish between

14 The ARM exclusives provide atomicity using the data address rather than the data value, so that the routines can atomically exchange data without experiencing the A-B-A problem Exploiting this would, however, require rewriting much of the existing two-word exclusive code. Consequently, ARM added instructions for performing load-and-store exclusives using various payload sizes -- including 8 bytes -- thus ensuring the direct portability of existing multithreaded code. 14

of the existing two-word exclusive code.

15 Misc MP improvements Improved access to localized data Power-conscious spin-locks Weakly ordered memory consistency 15

16 Two Main Enhancements The ARM11 multiprocessor includes two main SMP enhancements: Generic Interrupt Controller (GIC) providing interprocessor communication Snoop Control Unit (SCU), an intelligent memory-communication system providing cache coherence 16

providing interprocessor communication Snoop Control Unit

17 Cache Coherency The ARM11 MPCore implements a Snoop Control Unit (SCU) between the processors. Operating at CPU frequency. This configuration also provides a very rapid path for data to move directly between each CPU s cache. 17

18 18

19 MESI Modified The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state. Exclusive The cache line is present only in the current cache, but is clean; it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it. Shared Indicates that this cache line may be stored in other caches of the machine and is "clean" ; it matches the main memory. The line may be discarded (changed to the Invalid state) at any time. Invalid Indicates that this cache line is invalid. 19

The write-back changes the line to the Exclusive state. Exclusive The cache line is present only in the current cache, but is clean; it matches main memory.

20 MOESI The processor maintains cache coherence with an optimized version of the MESI (modified, exclusive, shared, invalid) protocol. In addition to the four common MESI protocol states, there is a fifth "Owned" state representing data that is both modified and shared. This avoids the need to write modified data back to main memory before sharing it. 20

In addition to the four common MESI protocol states, there is a fifth "Owned" state

21 21

22 22

23 23

24 Interrupt System Generic Interrupt Controller (GIC) External interrupts Internal Interrupts Example: One processor allocates virtual memory -> all others needs to update their memory translations -> ARM uses GIC to quickly signal that between processors 24

25 Distributed Interrupt Controller masking of interrupts Distributed Interrupt Controller I N T E R F A C E MP 11 CPUs prioritization of the interrupts distribution of the interrupts to the target MP11 CPUs tracking the status of interrupts generation of interrupts by software 25

26 26

27 NVIDIA Tegra 2 27

28 Applications Using MPCore Frostbite is an example of a game engine that employs job-based parallelism. This engine is used by the popular Battlefield: Bad Company series of games. It is an engine that is capable of using as many threads as the underlying hardware platform provides. The engine performs the primary Game and Render tasks on the GPU and divides up the other system related work into jobs. Each job typically consists of 15K to 200K lines of C+ + code with the average job size being around 25K lines of code. Most of these jobs are independent while some have interdependencies. Each frame of the game would typically contain two hundred to three hundred jobs and the engine assigns the jobs to all available hardware cores. 28

29 Task Level Parallelism on Frostbite Game Engine

30 Questions Study-support questions 30

31 Links Multi-core < SMP - Symmetric Multiprocessor System < ABA problem < MESI < MOESI < Embedded moves to multicore < Goodacre, J.; Sloss, A.N.;, "Parallelism and the ARM instruction set architecture," IEEE Computer, vol.38, no.7, pp , July 2005 doi: /MC < Goodacre, J., "Details of a New Cortex Processor Revealed, Cortex-A9", Presentation at the ARM developers' Conference, October < Stevens A., Introduction to AMBA 4 ACE, ARM White paper June 6, < 31

Hardware accelerated Virtualization in the ARM Cortex Processors

Hardware accelerated Virtualization in the ARM Cortex Processors John Goodacre Director, Program Management ARM Processor Division ARM Ltd. Cambridge UK 2nd November 2010 Sponsored by: & & New Capabilities