Introduction to Parallel Programming (w/ JAVA)

Similar documents

Parallel Programming Survey

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

HPC with Multicore and GPUs

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Next Generation GPU Architecture Code-named Fermi

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to Cloud Computing

Introduction to GPU hardware and to CUDA

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

U N C L A S S I F I E D

Trends in High-Performance Computing for Power Grid Applications

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CUDA Basics. Murphy Stein New York University

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

OpenMP and Performance

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Parallel Algorithm Engineering

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Binary search tree with SIMD bandwidth optimization using SSE

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Computer Graphics Hardware An Overview

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Embedded Systems: map to FPGA, GPU, CPU?

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Case Study on Productivity and Performance of GPGPUs

~ Greetings from WSU CAPPLab ~

Introduction to GPGPU. Tiziano Diamanti

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

GPGPU accelerated Computational Fluid Dynamics

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

GPUs for Scientific Computing

Multi-Threading Performance on Commodity Multi-Core Processors

CUDA programming on NVIDIA GPUs

CSE 6040 Computing for Data Analytics: Methods and Tools

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing - CUDA

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

FPGA-based Multithreading for In-Memory Hash Joins

Performance Characteristics of Large SMP Machines

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

ultra fast SOM using CUDA

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Lecture 3. Optimising OpenCL performance

An Introduction to Parallel Computing/ Programming

Software and the Concurrency Revolution

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Evaluation of CUDA Fortran for the CFD code Strukti

FPGA-based MapReduce Framework for Machine Learning

Course Development of Programming for General-Purpose Multicore Processors

CUDA Programming. Week 4. Shared memory and register

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

Multi-core and Linux* Kernel

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Lecture 1: an introduction to CUDA

Parallelism and Cloud Computing

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Turbomachinery CFD on many-core platforms experiences and strategies

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

OpenCL Programming for the CUDA Architecture. Version 2.3

High Performance Computing in the Multi-core Area

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

10- High Performance Compu5ng

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Texture Cache Approximation on GPUs

Keys to node-level performance analysis and threading in HPC applications

Architecture of Hitachi SR-8000

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Parallel Computing with MATLAB

Transcription:

Introduction to Parallel Programming (w/ JAVA) Dec. 21st, 2015 Christian Terboven IT Center, RWTH Aachen University terboven@itc.rwth-aachen.de 1

Moore s Law Cramming More Components onto Integrated Circuits Gordon Moore, Electronics, 1965 (ftp://download.intel.com/museum/moores_law/articles- Press_Releases/Gordon_Moore_1965_Article.pdf) # of transistors / cost-effective integrated circuit double every N months (12 <= N <= 24) 2

There is no free lunch anymore The number of transistors on a chip is still increasing, but no longer the clock speed! Instead, we see many cores per chip. Parallelization has become a necessity to exploit the performance potential of current microprocessors! 3

Multi-Core Processor Design Rule of thumb: Reduction of 1% voltage and 1% frequency reduces the power consumption by 3% and the performance by 0.66%. Cache Cache Core Core Core Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = ~1 Perf = ~1.8 (Based on slides from Shekhar Borkar, Intel Corp.) 4

Multi-Core Multi-Socket Compute Node Design Set of processors is organized inside a locality domain with a locally connected memory. Core Core Core Core The memory of all locality domains is accessible over a shared virtual address on-chip cache on-chip cache on-chip cache on-chip cache space. interconnect Other locality domains are access over a interconnect, the local domain can be accessed very efficiently without resorting to a network of any kind memory memory 5

Multi-Core Multi-Socket Compute Node Cluster Design System where memory is distributed among nodes No other node than the local one has direct access to the local memory CPU 1 CPU 2 CPU 3 CPU 4 CPU N Socket Cache Cache Cache Cache Cache MEM MEM MEM MEM MEM Node NET IF NET IF NET IF NET IF NET IF Network interface NETWORK 6

What is High Performance Computing (HPC)? From Wikipedia: A supercomputer is a computer at the frontline of current processing capacity, particularly speed of calculation. Simulation, Optimization Virtual Reality Historically there were two principles of science: Theory and Experiment. Computational Science extends them as a third 7 Theory Models, Differential Equations, linear equation systems Experiments Observation and prototypes empirical studies/sciences

Agenda I hope you are motivated by now Basic Concepts of Threading Matrix Multiplication: from Serial to Multi-Core Amdahl s Law and Efficiency Matrix Multiplication Reviewed Basic GPGPU Concepts Matrix Multiplication: from Serial to Many-Core Summary Christmas Exercise 8

Basic Concepts of Threading 9

Processes vs. Threads Applications have to create a team of threads to run on multiple cores simultaneously Threads share the global (shared) data of the program, typically the data on the heap Every thread has its own stack, which may contain private data only visible to the thread Operating systems and/or programming languages offer facilities to creat and manage threads in the application Stack thread Stack thread main() Stack thread Code segment Data segment 10

Shared Memory Parallelization Memory can be accessed by several threads running on different cores in a multi-socket multi-core system: CPU 1 CPU 2 c=3+a a a=4 Decompose data into distinct chunks to be processed independently (data parallelism) 11 Look for tasks that can be executed simultaneously (task parallelism)

Parallel Programming in Theory and Practice Parallelism has to be exploited by the programmer Theory Practice 12

Example of parallel work Example: 4 cars are produced in parallel Prof. Dr. G. Wellein, Dr. G. Hager, Uni Erlangen-Nürnberg 13

Limits of scalability Parts of the manufacturing process can not be parallelized Example: Delivery of components (all workers have to wait) 14

Limits of scalability (cont.) Individual steps may take more or less time Load imbalances lead to unused resources 15

How are you doing? 16

Matrix Multiplication: from Serial to Multi-Core 17

Illustration of Matrix Multiplication Simple thing: C = A times B, with naive implementation in O(n^3) results in the following computations source: Wikipedia Independent computations exploitable for parallelization: rows of the matrix C 18

Illustration of Matrix Multiplication (cont.) Class Matrix.java: final public class Matrix { private final int M; private final double[][] data; // number of rows and columns // M-by-M array // create M-by-N matrix of 0's public Matrix(int dim) { this.m = dim; data = new double[m][m]; } [...] } // end of class Matrix 19

Illustration of Matrix Multiplication (cont.) Class Matrix.java, Matrix Multiplication implementation: // return C = A * B public Matrix times(matrix B) { Matrix A = this; if (A.M!= B.M) throw new RuntimeException("Illegal matrix dimensions."); Matrix C = new Matrix(A.M); for (int i = 0; i < M; i++) for (int j = 0; j < M; j++) for (int k = 0; k < M; k++) C.data[i][j] += (A.data[i][k] * B.data[k][j]); 20 } return C;

Illustration of Matrix Multiplication (cont.) Class Matrix.java, Matrix Multiplication implementation: // return C = A * B public Matrix times(matrix B) { Matrix A = this; if (A.M!= B.M) throw new RuntimeException("Illegal matrix dimensions."); Independent for every i: parallelize this loop over the threads Matrix C = new Matrix(A.M); for (int i = 0; i < M; i++) for (int j = 0; j < M; j++) for (int k = 0; k < M; k++) C.data[i][j] += (A.data[i][k] * B.data[k][j]); 21 } return C;

Thread-level Parallelization 1. Determine the number of threads to be used Initial Thread Serial Part by querying the number of cores, or by user input 2. Compute iteration chunks for every individual thread Rows per Chunk = M / number-of-threads 3. Create a team of threads and start the threads Slave Threads Slave Threads Worker Threads Java class Thread encapsulates all thread mgmt. tasks Parallel Region 22 Provide suitable function: matrix multiplication on given chunk Start thread via start() method 4. Wait for all threads to complete their chunk For each thread call join() method

Thread-level Parallelization (cont) Class Matrix.java, threaded Matrix Multiplication implementation: [...] Thread threads[] = new Thread[num_threads]; // start num_threads threads with their individual tasks for (int i = 0; i < num_threads; i++) { } // compute chunk int rowsperchunk = M / num_threads; int srow = i * rowsperchunk; int erow = [...]; // initialize task, create thread, start thread MultiplicationAsExecutor task = new MultiplicationAsExecutor (srow, erow, C.data, A.data, B.data, M); threads[i] = new Thread(task); threads[i].start(); 23 // wait for all threads to finish for (int i = 0; i < num_threads; i++) { } [...] threads[i].join();

Thread-level Parallelization (cont) Class MultiplicationAsExecutor.java: public class MultiplicationAsExecutor implements Runnable { [...] } 24 // initialization by storing local chunk public MultiplicationAsExecutor(int srow, int erow, double[][] dc, double[][] da, double[][] db, int dim) { } this.startrow = srow; this.endrow = erow; this.c = dc; this.a = da; this.b = db; this.dim = dim; // perform the actual computation public void run() { } for (int i = startrow; i < endrow; i++) for (int j = 0; j < dim; j++) for (int k = 0; k < dim; k++) c[i][j] += (a[i][k] * b[k][j]); // execute immediately public void execute(runnable r) { } r.run();

Runtime [sec.] Performance Evaluation 25 20 20,079 19,951 15 10 10,138 7,034 5 5,44 4,19 0 2D, serial 2D, threaded: 1 2D, threaded: 2 2D, threaded: 3 2D, threaded: 4 2D, threaded: 8 25

Amdahl s Law and Efficiency 26

Parallelization Overhead Overhead introduced by the parallelization: Time to start / end / manage threads Time to send / exchange data Time spent in synchronization of threads / processes With parallelization: The total CPU time increases, The Wall time decreases, The System time stays the same. Efficient parallelization is about minimizing the overhead introduced by the parallelization itself! 27

Speedup and Efficiency Time using 1 CPU: T(1) Time using p CPUs:T(p) Speedup S: S(p)=T(1)/T(p) Measures how much faster the parallel computation is! Efficiency E: E(p)=S(p)/p Ideal case: T(p)=T(1)/p S(p)=p E(p)=1.0 28

Amdahl s Law Illustrated If 80% (measured in program runtime) of your work can be parallelized and just 20% are still running sequential, then your speedup will be: 1 processor: time: 100% speedup: 1 2 processors: time: 60% speedup: 1.7 4 processors: time: 40% speedup: 2.5 processors: time: 20% speedup: 5 29

Matrix Multiplication Reviewed 30

Performance Considerations The Issue: 2D arrays in JAVA result in bad performance Better: 1D array with index fct. Caches only work well for consecutive memoy accesses! core on-chip cache CPU is fast Caches: off-chip cache Fast, but expensive, thus small [MBs] Memory is slow Slow, but cheap, thus large [GBs] memory 31

Runtime [sec.] Performance Evaluation 25 20 20,079 15 14,79 14,822 10 7,606 5 5,274 4,331 3,579 0 2D, serial 1D, serial 1D, threaded: 1 1D, threaded: 2 1D, threaded: 3 1D, threaded: 4 1D, threaded: 8 32

Basic GPGPU Concepts 33

Comparison CPU GPU Similar # of transistors but different design NVIDIA Corporation 2010 CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution GPU Optimized for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation 34

NVIDIA Corporation 2010 GPGPU architecture: NVIDIA s Fermi 3 billion transistors 448 Cores/ Streaming Processors (SP) E.g. floating point and integer unit 14 Streaming Multiprocessors (SM, MP) 32 cores per MP Memory hierarchy Processing flow Copy data from host to device Execute kernel Copy data from device to host GPU 35

Comparison CPU GPU Weak memory model PCI Bus MEMORY CPU 3 1 Host + device memory = separate entities No coherence between host + device Data transfers needed Host-directed execution model Copy input data from CPU mem. to device mem. Execute the device program Copy results from device mem. to CPU mem. MEMORY GPU 2 36

NVIDIA Corporation 2010 Programming model Definitions Host: CPU, executes functions Device: usually GPU, executes kernels Parallel portion of application executed on device as kernel Kernel is executed as array of threads All threads execute the same code Threads are identified by IDs float x = input[threadid]; float y = func(x); output[threadid] = y; Select input/output data Control decisions 37

Time Programming model (cont.) Threads are grouped into blocks, Blocks are grouped into a grid. Kernel is executed as a grid of blocks of threads Host Kernel 1 Block 0 Block 4 Device Block 1 Block 5 Block 2 Block 6 Block 3 Block 7 Dimensions of blocks and grids: 3 Kernel 2 Block (0,0) Block (1,0) Block (0,1) Block (1,1) Block (0,2) Block (1,2) Block (0,3) Block (1,3) Block (1,3) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,0,0) (1,0,0) (2,0,0) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,1,0) (1,1,0) (2,1,0) 38

Putting it all together execution model logical hierarchy memory model Core thread / vector Registers Registers streaming multiprocessor (SM) instruction cache registers block of threads / gang hardware/ software cache device: GPU grid (kernel) sync possible, shared mem Shared Mem SM-1 L1 L2 Shared Mem SM-n L1 Device Global Memory Host CPU PCIe CPU Mem Host Host Memory 39 Vector, worker, gang mapping is compiler dependent.

Matrix Multiplication: from Multi- to Many-Core 40

GPGPU Parallelization 1. Create a CUDA-kernel well-suited for the GPGPU device simple-enough and data-parallel code 2. Setup JCuda and Compile your program accordingly kernel code has to be compiled to.ptx file with NVIDIA s compiler 3. Initialize JCuda environment load driver library, initialize device 4. Transfer data from host to device all data necessary on the device 5. Execute CUDA-kernel launch kernel on device 6. Transfer results from device to host all data necessary after the kernel execution on the host 41

GPGPU Parallelization (cont.) CUDA-kernel: C is often close enough to JAVA extern "C" global void matmult(int dim, double *c, double *a, double *b) { int row = blockdim.y * blockidx.y + threadidx.y; int col = blockdim.x * blockidx.x + threadidx.x; if (row > dim col > dim) return; } 42 double prod = 0; int kk; for (kk = 0; kk < dim; ++kk){ } prod += a[row * dim + kk] * b[kk * dim + col]; c[row*dim + col] = prod;

GPGPU Parallelization (cont.) Class Matrix.java, CUDA Matrix Multiplication implementation: [...] // allocate memory int size = A.M * A.M * Sizeof.DOUBLE; CUdeviceptr a_dev = new CUdeviceptr(); CUdeviceptr b_dev = new CUdeviceptr(); CUdeviceptr c_dev = new CUdeviceptr(); cudamalloc(a_dev, size); cudamalloc(b_dev, size); cudamalloc(c_dev, size); // load code CUmodule module = new CUmodule(); cumoduleload(module, "JCudaMatmulKernel.ptx"); CUfunction function = new CUfunction(); cumodulegetfunction(function, module, "matmult"); 43

GPGPU Parallelization (cont.) Class Matrix.java, CUDA Matrix Multiplication implementation: // copy data cumemcpyhtod(a_dev, Pointer.to(A.data), size); cumemcpyhtod(b_dev, Pointer.to(B.data), size); // launch kernel Pointer parameters = Pointer.to( Pointer.to(new int[] { A.M }), Pointer.to(c_dev), Pointer.to(a_dev), Pointer.to(b_dev) ); final int threadsperdim = 32; int grids = (int) Math.ceil(((double) A.M) / threadsperdim); culaunchkernel(function, grids, grids, 1, threadsperdim, threadsperdim, 1, 0, null, parameters, null); cuctxsynchronize(); [...] // cleanup code omissed for brevity 44

Runtime [sec.] Performance Evaluation 25 20 15 14,79 14,822 10 5 0 3,579 1,373 0,185 1D, serial 1D, threaded: 1 1D, threaded: 8 1D, cuda 1D, cuda (no copy) 45

Performance: Critical Discussion FLOPS: performance rate (no. floating-point operations per second) Matrix Multiplication Algorithm: n^2 + 2n complexity, here n = 1536 Result: 7247757.3 mega double precision floating-point operations performed Host: Intel Xeon E5620 4 Cores, with Hyper-Threading 2.4 GHz clock frequency SSE4.2: 4 floating-ypoint operations per cycle peak 4 * 2.4 * 4 = 38.4 GFLOPs peak performance Our multi-threaded code run at 2025 MFLOPS 5.3 % efficiency 46

Performance: Critical Discussion (cont.) GPU: NVIDIA Tesla C2050 448 CUDA cores 1.15 GHz clock frequency 448 * 1.15 = 515 GFLOPS peak performance Our CUDA kernel runs at 5278 MFLOPS 10.1 % efficiency Note on the GPU: the data transfer is the most costly part Kernel execution time incl. data transfer: 1.373 sec. Kernel execution time excl. data transfer: 0.185 sec. The GPU would profit from a larger problem size, or 47 repeated executions of the kernel

Performance: Critical Discussion (cont.) Can we do better with the same algorithm? Yes! Matrix Multiplication can profit from blocking, that is the reuse of data in the caches, for both the CPU and the GPU. And: Matrix Multiplication is a standard problem, there are libraries for that: BLAS (dgemm). GPGPU Performance with cublas Kernel execution time incl. data transfer: 1.306 sec. Kernel execution time excl. data transfer: 0.0234 sec. => The CUDA kernel itself runs at 310 GFLOPS 48

Runtime [sec.] Performance Evaluation 1,6 1,4 1,2 1,373 1,305 1 0,8 0,6 0,4 0,2 0 0,185 2,34E-002 1D, cuda 1D, cuda (no copy) 1D, cublas 1D, cublas (no copy) 49

Summary 50

Summary What did we learn? Parallelization has become a necessity to exploit the performance potential of modern multi- and many-core architectures! Efficient programming is about optimizing memory access, efficient parallelization is about minimizing overhead. Shared Memory parallelization: work is distributed over threads on separate cores, threads share global data Heterogeneous architectures (here: GPGPUs): separate memories require explicit data management, well-suited problems can benefit from special architectures with large amount of parallelism. Not covered today: Cache Blocking to achieve even better perf. Not covered today: Distributed Memory parallelization 51

Lectures SS 2016 Lecture: Performance & correctness analysis of parallel programs Software Lab: Parallel Programming Models for Applications in the Area of High-Performance Computation (HPC) Seminar: Current Topics in High-Performance Computing (HPC) WS 2016/17 Lecture: Introduction to High-Performance Computing Seminar: Current Topics in High-Performance Computing (HPC) www.hpc.rwth-aachen.de contact@hpc.rwth-aachen.de 52

Christmas Exercise 53

Game of Life Game of Life: zero-player game Evolution is determined by initial state and game rules Rules: 2D orthogonal grid of cells every cell has only two possible states: alive (black) or dead (white) every cell interacts with its neighbours, and at each time step: any live cell with fewer than two live neighbours dies, any live cell with two or three live neighbours lives on, any live cell with more than three live neighbours dies, 54 any dead cell with exactly three live neighbours becomes a live cell.

There is something to win! A really nice book on HPC from really nice people: One book for each in the group with the highest performance in a multithreaded solution the highest performance in a CUDAparallel solution random drawing winner Requirement: fill out and hand in the questionnaire Measurements will be done by us on linuxc8 (RWTH Compute Cluster) 55

Umfrage zur Produktivität beim Entwickeln im Bereich HPC Allgemeine Umfrage (einmalig gültig bis 20.1.2016) Hilft unseren Forschungsaktivitäten! Bitte nehmen Sie teil! Am Ende der Weihnachtsaufgabe ausfüllen & auf Weihnachtsaufgabe beziehen! Einflussfaktoren auf Programmieraufwand? Benötigter Programmieraufwand? Anzahl der programmierten Code-Zeilen? Erreichte Performance? Zum leichten und qualitativen Ausfüllen der Umfrage bitte während des Entwickelns obige Daten schon festhalten https://app.lamapoll.de/ ProductivityHPCStudents/ 56

Questions? 57