NUMA Programming; OpenCL

Transcription

1 NUMA Programming; OpenCL Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico December 2, 2009 José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

2 Outline Cache Coherent NUMA Distributed Memory Multi-core Processors AMD Opteron IBM Cell Broadband Engine Programming ccnuma Systems OpenCL José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

3 Shared-Memory Systems also known as Uniform Memory Access (UMA) architecture Symmetric Shared-Memory Multiprocessors (SMP) P P P P Main Memory I / O José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

4 Distributed-Memory Systems or Non-Uniform Memory Access (NUMA) architecture Multicomputers P P Cache Cache Main Memory I / O Main Memory I / O Interconnection Network Main Memory I / O Main Memory I / O Cache Cache P P José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

5 Cache-Coherent NUMA Limitations of UMA / SMP: José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

6 Cache-Coherent NUMA Limitations of UMA / SMP: limited scalability (8 to 12 cores), due to contention in accessing memory! Limitations of Multicomputers: José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

7 Cache-Coherent NUMA Limitations of UMA / SMP: limited scalability (8 to 12 cores), due to contention in accessing memory! Limitations of Multicomputers: high communication overhead! José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

8 Cache-Coherent NUMA Limitations of UMA / SMP: limited scalability (8 to 12 cores), due to contention in accessing memory! Limitations of Multicomputers: high communication overhead! Intermediate solution: Cache-Coherent NUMA, or ccnuma Distributed Shared Memory, DSM, typically used in these systems. José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

9 NUMA (DSM) Distributed shared memory can be implemented through a distributed virtual memory scheme: logic address space is divided into pages page table keeps state of each page (similar to a directory) not-present pages in remote memories shared page available in local memory, but more copies exist exclusive page only available in local memory access to one of these pages causes a page-fault page request sent to remote processor José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

10 Cache-Coherent NUMA P P Cache Cache Main Memory I / O Main Memory I / O Interconnection Network Main Memory I / O Main Memory I / O Cache Cache P P José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

11 Cache-Coherent NUMA P P Main Memory Cache Main Memory Cache I / O Main Memory Cache P Main Memory Cache P highly scalable memory bandwidth grows with computational power cache coherence possible due to shared global bus José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

12 AMD Opteron each AMD s Opteron chip has its own memory controller, allowing for easy system extension each node may be a single- or multi-core each node has L1 and L2 caches José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

13 IBM Cell Broadband Engine Heterogeneous multiprocessor: Power Processing Element (PPE): Master processor 8 Synergistic Processing Elements (SPE): fully functional RISC processors Local storage size per SPE: 256kB SPEs can only access own local memory José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

14 NUMA Aware Systems For optimal performance on NUMA systems: processes should be located on processors that are as close as possible to the memory that the process accesses allocate all memory for a process in the same processor OS with multi-queue scheduler with a runqueue per processor dispatch all child processes on the same processor through the life of the parent processes Linux and Windows OSs are NUMA ready. José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

15 NUMA Aware Systems On Linux, numactl defines scheduling and/or memory placement policy: numactl --interleave=all bigdatabase run bigdatabase with its memory interleaved on all CPUs. José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

16 NUMA Aware Systems On Linux, numactl defines scheduling and/or memory placement policy: numactl --interleave=all bigdatabase run bigdatabase with its memory interleaved on all CPUs. numactl --cpubind=0 --membind=0,1 process run process on node 0 with memory allocated on node 0 and 1. José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

17 NUMA Aware Systems On Linux, numactl defines scheduling and/or memory placement policy: numactl --interleave=all bigdatabase run bigdatabase with its memory interleaved on all CPUs. numactl --cpubind=0 --membind=0,1 process run process on node 0 with memory allocated on node 0 and 1. numactl --show show the NUMA state José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

18 Programming NUMA Systems gcc provides a library with a simple programming interface of NUMA systems. #include <numa.h> gcc... -lnuma Defines policies for: thread binding memory allocation Before any other routine is used, int numa available() must be called. If it returns -1, all other functions in this library are undefined. José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

19 Programming NUMA Systems Querying the system: int numa max node() returns the number of nodes in the systems long numa node size(int node, long *freep) returns the memory size of node node, and the current free memory in freep int numa distance(int node1, int node2) reports the distance in the machine topology between two nodes. The factors are a multiple of 10. It returns 0 when the distance cannot be determined. A node has distance 10 to itself. José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

20 Programming NUMA Systems Querying the system: int numa max node() returns the number of nodes in the systems long numa node size(int node, long *freep) returns the memory size of node node, and the current free memory in freep int numa distance(int node1, int node2) reports the distance in the machine topology between two nodes. The factors are a multiple of 10. It returns 0 when the distance cannot be determined. A node has distance 10 to itself. Thread binding: int numa run on node(int node) binds the current thread and its children to node node (for a set of nodes, a nodemask can be specified) José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

21 Programming NUMA Systems Memory allocation: void *numa alloc onnode(size t size, int node) allocates size bytes of memory on a specific node node void *numa alloc local(size t size) allocates size bytes of memory on the local node void *numa alloc interleaved(size t size) allocates size bytes of memory page interleaved on all nodes void *numa alloc(size t size) allocates size bytes of memory with the current NUMA policy void numa free(void *start, size t size) frees size bytes of memory starting at start, allocated by the numa alloc * functions above José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

22 Programming NUMA Systems Memory allocation: void *numa alloc onnode(size t size, int node) allocates size bytes of memory on a specific node node void *numa alloc local(size t size) allocates size bytes of memory on the local node void *numa alloc interleaved(size t size) allocates size bytes of memory page interleaved on all nodes void *numa alloc(size t size) allocates size bytes of memory with the current NUMA policy void numa free(void *start, size t size) frees size bytes of memory starting at start, allocated by the numa alloc * functions above Node masks: Define a subset of nodes to which thread binding and memory allocation apply. José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

23 Future Architecture Trends CPUs with increased number of SMP cores Examples: Intel Core 2 Quad; AMD Bulldozer; UltraSPARC T2; Tilera TILE64 José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

24 Future Architecture Trends CPUs with increased number of SMP cores Examples: Intel Core 2 Quad; AMD Bulldozer; UltraSPARC T2; Tilera TILE64 GPUs with increased number of SIMD cores Example: NVIDIA GTX 280 / GTX 260 José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

25 Future Architecture Trends CPUs with increased number of SMP cores Examples: Intel Core 2 Quad; AMD Bulldozer; UltraSPARC T2; Tilera TILE64 GPUs with increased number of SIMD cores Example: NVIDIA GTX 280 / GTX 260 CPU / GPU convergence Examples: Intel Larrabee; AMD / ATI Fusion Future trend is: many simple cores each core with vector (SIMD) capabilities (of growing length) José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

26 Future Architecture Trends José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

27 Parallel Programming Models OpenMP, PThreads for SMP systems libnuma for ccnuma systems CUDA for NVIDIA s GPUs; Stream SDK for ATI s GPUs Message Passing Interface, MPI This disparate set of models creates challenges of targeting algorithms to optimally exploit available computational power. José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

28 Parallel Programming Models OpenMP, PThreads for SMP systems libnuma for ccnuma systems CUDA for NVIDIA s GPUs; Stream SDK for ATI s GPUs Message Passing Interface, MPI This disparate set of models creates challenges of targeting algorithms to optimally exploit available computational power. Parallel algorithms need to address combinations of these models! José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

29 OpenCL Many issues are common to all parallel programming models! OpenCL cross-platform language recently proposed for data (and task) parallel programming for both GPUs and CPUs. OpenCL was created by Apple in cooperation with others, and will be an open standard administered by the Khronos Group. José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

30 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

31 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

32 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

33 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption Close integration with OpenGL and other 3D APIs for advanced visualization application and innovation José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

34 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption Close integration with OpenGL and other 3D APIs for advanced visualization application and innovation Enable embedded and handheld devices through an embedded profile in the specification José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

35 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption Close integration with OpenGL and other 3D APIs for advanced visualization application and innovation Enable embedded and handheld devices through an embedded profile in the specification Drive future hardware requirements e.g. floating point precision requirements José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

36 OpenCL Execution Model Kernel equivalent to a C function, executed on a Compute Device entry point, arguments, no return value Program collection of kernels and functions equivalent to a dynamically loaded library Command Queue enqueues kernel invocations and other OpenCL commands (like memory map/unmap/copy) enqueue in order execute in or out-of order (optionally) Event synchronize execution within and between queues in a context José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

37 OpenCL Execution Model - NDRange Data parallel execution: kernels executed across 1, 2 or 3 dimensional two-level index space called NDRange kernels are instanced as work-items ( threads ) that are grouped in work-groups no synchronization between work-groups, they are independent barriers for synchronizing work-items within work-group choose NDRange appropriate for your problem dimensions José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

38 OpenCL Language OpenCL language is based on C99 with limitations and extensions. Limitations: no recursion, no C99 headers, no bit fields, no function pointers, no variable length arrays, no byte addressable stores Extensions: vector types, work-items and work-groups, synchronization, address space qualifiers, image access functions, conversion and other built in functions José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

39 Simple Example Regular C void vector_add( float *a, float *b, float *c, size_t n) { size_t i; } for(i = 0; i < n; i++) c[i] = a[i] + b[i]; OpenCL kernel void vector_add( global float *a, global float *b, global float *c) { size_t i; } i = get_global_id(0); c[i] = a[i] + b[i]; José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

40 Review Cache Coherent NUMA Distributed Memory Multi-core Processors AMD Opteron IBM Cell Broadband Engine Programming ccnuma Systems OpenCL José Monteiro (DEI / IST) Parallel and Distributed Computing / 25

41 Next Class distributed systems Google s map-reduce databases José Monteiro (DEI / IST) Parallel and Distributed Computing / 25