Programming Language Seminar Concurrency I-1: Java and C# Memory Models

Size: px
Start display at page:

Download "Programming Language Seminar Concurrency I-1: Java and C# Memory Models"

Transcription

1 Programming Language Seminar Concurrency I-1: Java and C# Memory Models Peter Sestoft Friday

2 Outline for today, 1 Why parallel programming? Concurrency in Java and C# Problem: shared mutable state (data, fields) Solutions: Locks, synchronized! AtomicInteger, AtomicLong, AtomicReference Concurrency without locks Weird behavior legal in Java and C# for speed Safe publication The double meaning of synchronized! The meaning of volatile! Immutability and visibility The double meaning of final 2

3 Why parallel programming? Until 2003, CPUs became faster every year So sequential software became faster every year Today, CPUs are still 2-4 GHz as in 2003 So sequential software has not become faster Instead, we get Multicore: 2, 4, 8,... CPUs on a chip Vector instructions (4 x MAC, SIMD, SSE) in CPUs Superfast Graphics Processing Units (GPU) 96 simple CUDA codes in this ancient 2009 laptop 3027 simple but fast CUDA cores in Nvidia Tesla K10 Herb Sutter: The free lunch is over (2005) More speed requires parallel programming But parallel programming is difficult and errorprone... with existing means: threads, synchronization,... 3

4 A simple counter, incremented in parallel class BareCounter implements Counter { private int counter = 0; public void inc() { counter++; Simple counter Thread[] ts = new Thread[threads]; for (int j=0; j<threads; j++) ts[j] = new Thread() { public void run() { for (int i=0; i<iterations; i++) counter.inc(); ; for (int j=0; j<threads; j++) ts[j].start(); for (int j=0; j<threads; j++) ts[j].join(); Many threads increment counter in parallel This goes wrong, of course Why? 4

5 Locks: Ensure mutual exclusion class SyncCounter implements Counter { private int counter = 0; public synchronized void inc() { counter++; Synchronized counter class SyncCounter implements Counter { private int counter = 0; public void inc() { synchronized(this) { counter++; File ConcurrentCounters.java Really, abbreviation for this code This works Why? 5

6 Locking/synchronization A lock does not guarantee anything in itself Disciplined use of locks can lead to Exclusive access to shared mutable state And hence consistent update of the state Easy to misuse Forget synchronized one place => anarchy Low performance under high contention Context switches Not compositional Using multiple locks can lead to deadlock Easy to avoid by always locking in the same order But hard to know that libraries, GUI,... do 6

7 Atomic update (Java 5) class AtomicCounter implements Counter { private final AtomicInteger counter = new AtomicInteger(); public void inc() { counter.getandincrement(); Atomic counter This uses an atomic x86 instruction Mono JITted code, from CIL, from C# See file Interlocked.cs 7

8 Java Atomic variables java.util.concurrent.atomic package AtomicInteger, AtomicReference<T>,... C#/.NET System.Threading.Interlocked namespace Add(ref int, int), Exchange<T>(ref T, T),... More efficient than locking/synchronized When applicable Translates directly to x86 instructions We shall look more into these next week In lock-free algorithms 8

9 Strange but legal behavior Java Language Specification, sect 17.4: Run these code fragments in two threads Assume A and B shared fields, initially 0 r2=a; B=1; Thread 1 r1=b; Thread 2 A=2; What are the possible results? Strangely, r1==1 and r2==2 is possible The Java (or C#/.NET) memory model Does not guarantee sequential consistency Not between threads, only within each individual thread Compiler may reorder and share memory accesses 9

10 Why permit such strange behaviors? More comprehensible example from JLS 17.4 Assume p, q shared, p==q and p.x==0 r1 = p;! r2 = r1.x;! r3 = q;! r4 = r3.x;! r5 = r1.x;! Thread 1 r6 = p;! Thread 2 r6.x = 3;! Classic compiler optimization: r1 = p;! r2 = r1.x;! r3 = q;! r4 = r3.x;! r5 = r2;! r6 = p;! r6.x = 3;! (p.x seems to switch from r2=0 to r4=3 and back to r5=0) 10

11 Sequential consistency The volatile field modifier avoids these compiler optimizations offers a number of guarantees (in Java and C#) but loses some performance IntArray.IsSorted example, sequential Files VolatileArray.java, VolatileArray.cs Java, sec non-volatile, sec volatile C# MS sec non-volatile, sec volatile C# Mono sec in both cases In this particular case, Mono does no optimization See machine code, in source file 11

12 Java Java and C# Java Language Specification (JLS), Java 7, 2013: section Volatile Fields (brief) and section 17.4 Memory Model (rather complicated) JVM Specification just refers to JLS C#/.NET C# Language Specification Volatile Fields CLI Ecma-335 standard section I : "volatile read has acquire semantics... the read is guaranteed to occur prior to any references to memory than occur after the read instruction in the CIL instruction sequence" "volatile write has release semantics... the write... occur after any memory references... prior to the write..." 12

13 Thread-unsafe integer holder public class MutableInteger { private int value; public int get() { return value; public void set(int value) { this.value = value; One thread may never see the updates performed by another one 13

14 Thread-safe integer holder public class MutableInteger { private int value; public synchronized int get() { return value; public synchronized void set(int value) { this.value = value; Locking (synchronized) has two effects: Mutual exclusion Visibility of memory updates: all fields visible to thread A before releasing a lock are visible to thread B after acquiring the lock ("synchronizes") 14

15 Visibility by synchronization "release" "acquire" Goetz p

16 Another thread-safe integer holder? public class MutableInteger { private volatile int value; public int get() { return value; public void set(int value) { this.value = value; Not in the book, but should work The volatile modifier has one effect: Visibility of memory updates: all fields visible to thread A before writing the field are visible to thread B after reading the field (it "synchronizes") Stronger guarantee than in C/C++ Affects visibility of all fields, not just the volatile 16

17 C#/.NET CLI Ecma-335 standard section I : "A volatile write has release semantics... the write is guaranteed to happen after any memory references prior to the write instruction in the CIL instruction sequence" "volatile read has acquire semantics... the read is guaranteed to occur prior to any references to memory that occur after the read instruction in the CIL instruction sequence" So same as Java: volatile write+read has the visibility effect of lock release+acquire (but not the mutual exclusion effect, of course) 17

18 Goetz factorization servlet example: Stateless servlet public class StatelessFactorizer... implements Servlet { public void service(servletrequest req, ServletResponse resp) { BigInteger i = extractfromrequest(req); BigInteger[] factors = factor(i); encodeintoresponse(resp, factors); BigInteger extractfromrequest(servletrequest req) {... BigInteger[] factor(biginteger i) {... void encodeintoresponse(servletresponse resp,...) {... No concurrent access to any shared state All state is thread-confined (local variables) 18

19 Goetz factorization servlet example: Count accesses in shared int public class UnsafeCountingFactorizer... { private long count = 0; public void service(servletrequest req, ServletResponse resp) { BigInteger i = extractfromrequest(req); BigInteger[] factors = factor(i); ++count; Unsafe encodeintoresponse(resp, factors); Concurrent access to shared mutable state Unsafe because ++i operation is not atomic Risk of lost updates Shared state 19

20 Goetz factorization servlet example: Count accesses with atomic int public class CountingFactorizer... { private final AtomicLong count = new AtomicLong(0); Shared state public void service(servletrequest req, ServletResponse resp) { BigInteger i = extractfromrequest(req); BigInteger[] factors = factor(i); count.incrementandget(); Safe encodeintoresponse(resp, factors); Concurrent access to shared mutable state Safe because operation is atomic No lost updates Could we use synchronized instead? 20

21 Goetz factorization servlet example: Cache last factorization public class UnsafeCachingFactorizer... { private final AtomicReference<BigInteger> lastnumber =...; private final AtomicReference<BigInteger[]> lastfactors =...; public void service(servletrequest req, ServletResponse resp) { BigInteger i = extractfromrequest(req); if (i.equals(lastnumber.get())) encodeintoresponse(resp, lastfactors.get()); else { BigInteger[] factors = factor(i); lastnumber.set(i); lastfactors.set(factors); encodeintoresponse(resp, factors); Invariant: lastnumber = product of lastfactors Can we use synchronized here? Unsafe, may violate invariant 21

22 Goetz factorization servlet example: Cache last factorization, I public class CachedFactorizer... { private BigInteger lastnumber; private BigInteger[] lastfactors; public void service(servletrequest req, ServletResponse resp) { BigInteger i = extractfromrequest(req); BigInteger[] factors = null; synchronized (this) { if (i.equals(lastnumber)) factors = lastfactors.clone(); if (factors == null) { factors = factor(i); synchronized (this) { lastnumber = i; lastfactors = factors.clone(); encodeintoresponse(resp, factors); Why needed? Preserves invariant 22

23 Immutable factor cache public class OneValueCache { private final BigInteger lastnumber; private final BigInteger[] lastfactors; public OneValueCache(BigInteger i, BigInteger[] factors) { lastnumber = i; lastfactors = Arrays.copyOf(factors, factors.length); public BigInteger[] getfactors(biginteger i) { if (lastnumber == null!lastnumber.equals(i)) return null; else return Arrays.copyOf(lastFactors, lastfactors.length); Final fields, and instance-private copies of arrays, and BigInteger instances are immutable 23

24 Goetz factorization servlet example: Cache last factorization, II public class VolatileCachedFactorizer... { private volatile OneValueCache cache = new OneValueCache(null, null); public void service(servletrequest req, ServletResponse resp) { BigInteger i = extractfromrequest(req); BigInteger[] factors = cache.getfactors(i); if (factors == null) { factors = factor(i); cache = new OneValueCache(i, factors); encodeintoresponse(resp, factors); Volatile field cache ensures visibility NB! Immutable cache object avoids shared mutable state and ensures visibility 24

25 Semantics of final fields Final has two effects field cannot be updated after initialization, and field's value is visible after construction Java Language Specification 17.5: A thread that can only see a reference to an object after [that object's constructor has finished] is guaranteed to see the correctly initialized values for that object's final fields This is similar to volatile fields But the JIT compiler can perform lots of optimizations (caching,...) on final fields that are not possible for volatile fields 25

26 JLS example class FinalFieldExample {! final int x;! int y;! static FinalFieldExample f;! public FinalFieldExample() {! x = 3;! y = 4;!! static void writer() {! f = new FinalFieldExample();!! Thread 1. Writes to f after constructor finished static void reader() {! if (f!= null) {! int i = f.x; // guaranteed to see 3! int j = f.y; // could see 0!!! Thread 2 26

27 What about C#/.NET readonly fields?! No mention found in C# Language Specification (readonly) or Ecma-335 CLI Specification (initonly) In fact, no such guarantee intended, see mails from Microsoft (Carol Eidt and Eric Eilebrecht)

28 Visibility of memory updates Caused by synchronized/lock Caused by volatile Caused by final (visible after construction) Caused by CAS and similar (next week) Caused by synchronized collections, in Java package java.util.concurrent.net namespace System.Collections.Concurrent and older synchronized collections 28

29 Week 1 (this week) Reading Read Goetz et al.: Java Concurrency in Practice, chapters 1, 2, 3, 4, 5 Look at Java Language Specification, section Week 2 Goetz et al.: Java Concurrency in Practice, chapter 15 Michael and Scott: Simple, fast, and practical... Herlihy & Shavit: The Art of Multiprocessor Programming, chapters 3 and 9 29

Built-in Concurrency Primitives in Java Programming Language. by Yourii Martiak and Mahir Atmis

Built-in Concurrency Primitives in Java Programming Language. by Yourii Martiak and Mahir Atmis Built-in Concurrency Primitives in Java Programming Language by Yourii Martiak and Mahir Atmis Overview One of the many strengths of Java is the built into the programming language support for concurrency

More information

aicas Technology Multi Core und Echtzeit Böse Überraschungen vermeiden Dr. Fridtjof Siebert CTO, aicas OOP 2011, 25 th January 2011

aicas Technology Multi Core und Echtzeit Böse Überraschungen vermeiden Dr. Fridtjof Siebert CTO, aicas OOP 2011, 25 th January 2011 aicas Technology Multi Core und Echtzeit Böse Überraschungen vermeiden Dr. Fridtjof Siebert CTO, aicas OOP 2011, 25 th January 2011 2 aicas Group aicas GmbH founded in 2001 in Karlsruhe Focus: Embedded

More information

Computer Science 483/580 Concurrent Programming Midterm Exam February 23, 2009

Computer Science 483/580 Concurrent Programming Midterm Exam February 23, 2009 Computer Science 483/580 Concurrent Programming Midterm Exam February 23, 2009 Your name There are 6 pages to this exam printed front and back. Please make sure that you have all the pages now. The exam

More information

Monitors, Java, Threads and Processes

Monitors, Java, Threads and Processes Monitors, Java, Threads and Processes 185 An object-oriented view of shared memory A semaphore can be seen as a shared object accessible through two methods: wait and signal. The idea behind the concept

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System Multicore Systems Challenges for the Real-Time Software Developer Dr. Fridtjof Siebert aicas GmbH Haid-und-Neu-Str. 18 76131 Karlsruhe, Germany siebert@aicas.com Abstract Multicore systems have become

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Facing the Challenges for Real-Time Software Development on Multi-Cores

Facing the Challenges for Real-Time Software Development on Multi-Cores Facing the Challenges for Real-Time Software Development on Multi-Cores Dr. Fridtjof Siebert aicas GmbH Haid-und-Neu-Str. 18 76131 Karlsruhe, Germany siebert@aicas.com Abstract Multicore systems introduce

More information

An Easier Way for Cross-Platform Data Acquisition Application Development

An Easier Way for Cross-Platform Data Acquisition Application Development An Easier Way for Cross-Platform Data Acquisition Application Development For industrial automation and measurement system developers, software technology continues making rapid progress. Software engineers

More information

Outline of this lecture G52CON: Concepts of Concurrency

Outline of this lecture G52CON: Concepts of Concurrency Outline of this lecture G52CON: Concepts of Concurrency Lecture 10 Synchronisation in Java Natasha Alechina School of Computer Science nza@cs.nott.ac.uk mutual exclusion in Java condition synchronisation

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts

More information

Java Virtual Machine Locks

Java Virtual Machine Locks Java Virtual Machine Locks SS 2008 Synchronized Gerald SCHARITZER (e0127228) 2008-05-27 Synchronized 1 / 13 Table of Contents 1 Scope...3 1.1 Constraints...3 1.2 In Scope...3 1.3 Out of Scope...3 2 Logical

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

C# and Other Languages

C# and Other Languages C# and Other Languages Rob Miles Department of Computer Science Why do we have lots of Programming Languages? Different developer audiences Different application areas/target platforms Graphics, AI, List

More information

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (I)

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (I) SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (I) Shan He School for Computational Science University of Birmingham Module 06-19321: SSC Outline Outline of Topics

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by

More information

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer Computers CMPT 125: Lecture 1: Understanding the Computer Tamara Smyth, tamaras@cs.sfu.ca School of Computing Science, Simon Fraser University January 3, 2009 A computer performs 2 basic functions: 1.

More information

Understanding Hardware Transactional Memory

Understanding Hardware Transactional Memory Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence

More information

Software and the Concurrency Revolution

Software and the Concurrency Revolution Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )

More information

Replication on Virtual Machines

Replication on Virtual Machines Replication on Virtual Machines Siggi Cherem CS 717 November 23rd, 2004 Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

Java Memory Model: Content

Java Memory Model: Content Java Memory Model: Content Memory Models Double Checked Locking Problem Java Memory Model: Happens Before Relation Volatile: in depth 16 March 2012 1 Java Memory Model JMM specifies guarantees given by

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Item 7: Prefer Immutable Atomic Value Types

Item 7: Prefer Immutable Atomic Value Types Item 7: Prefer Immutable Atomic Value Types 1 Item 7: Prefer Immutable Atomic Value Types Immutable types are simple: After they are created, they are constant. If you validate the parameters used to construct

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Java Interview Questions and Answers

Java Interview Questions and Answers 1. What is the most important feature of Java? Java is a platform independent language. 2. What do you mean by platform independence? Platform independence means that we can write and compile the java

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

CS11 Java. Fall 2014-2015 Lecture 7

CS11 Java. Fall 2014-2015 Lecture 7 CS11 Java Fall 2014-2015 Lecture 7 Today s Topics! All about Java Threads! Some Lab 7 tips Java Threading Recap! A program can use multiple threads to do several things at once " A thread can have local

More information

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

General Introduction

General Introduction Managed Runtime Technology: General Introduction Xiao-Feng Li (xiaofeng.li@gmail.com) 2012-10-10 Agenda Virtual machines Managed runtime systems EE and MM (JIT and GC) Summary 10/10/2012 Managed Runtime

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Topics. Producing Production Quality Software. Concurrent Environments. Why Use Concurrency? Models of concurrency Concurrency in Java

Topics. Producing Production Quality Software. Concurrent Environments. Why Use Concurrency? Models of concurrency Concurrency in Java Topics Producing Production Quality Software Models of concurrency Concurrency in Java Lecture 12: Concurrent and Distributed Programming Prof. Arthur P. Goldberg Fall, 2005 2 Why Use Concurrency? Concurrent

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

C++ INTERVIEW QUESTIONS

C++ INTERVIEW QUESTIONS C++ INTERVIEW QUESTIONS http://www.tutorialspoint.com/cplusplus/cpp_interview_questions.htm Copyright tutorialspoint.com Dear readers, these C++ Interview Questions have been designed specially to get

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency

More information

All ju The State of Software Development Today: A Parallel View. June 2012

All ju The State of Software Development Today: A Parallel View. June 2012 All ju The State of Software Development Today: A Parallel View June 2012 2 What is Parallel Programming? When students study computer programming, the normal approach is to learn to program sequentially.

More information

CPU performance monitoring using the Time-Stamp Counter register

CPU performance monitoring using the Time-Stamp Counter register CPU performance monitoring using the Time-Stamp Counter register This laboratory work introduces basic information on the Time-Stamp Counter CPU register, which is used for performance monitoring. The

More information

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts

More information

Dynamic Load Balancing. Using Work-Stealing 35.1 INTRODUCTION CHAPTER. Daniel Cederman and Philippas Tsigas

Dynamic Load Balancing. Using Work-Stealing 35.1 INTRODUCTION CHAPTER. Daniel Cederman and Philippas Tsigas CHAPTER Dynamic Load Balancing 35 Using Work-Stealing Daniel Cederman and Philippas Tsigas In this chapter, we present a methodology for efficient load balancing of computational problems that can be easily

More information

Unit 8: Immutability & Actors

Unit 8: Immutability & Actors SPP (Synchro et Prog Parallèle) Unit 8: Immutability & Actors François Taïani Questioning Locks Why do we need locks on data? because concurrent accesses can lead to wrong outcome But not all concurrent

More information

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment Nuno A. Carvalho, João Bordalo, Filipe Campos and José Pereira HASLab / INESC TEC Universidade do Minho MW4SOC 11 December

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

Stack Allocation. Run-Time Data Structures. Static Structures

Stack Allocation. Run-Time Data Structures. Static Structures Run-Time Data Structures Stack Allocation Static Structures For static structures, a fixed address is used throughout execution. This is the oldest and simplest memory organization. In current compilers,

More information

Rootbeer: Seamlessly using GPUs from Java

Rootbeer: Seamlessly using GPUs from Java Rootbeer: Seamlessly using GPUs from Java Phil Pratt-Szeliga. Dr. Jim Fawcett. Dr. Roy Welch. Syracuse University. Rootbeer Overview and Motivation Rootbeer allows a developer to program a GPU in Java

More information

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

SHARED HASH TABLES IN PARALLEL MODEL CHECKING

SHARED HASH TABLES IN PARALLEL MODEL CHECKING SHARED HASH TABLES IN PARALLEL MODEL CHECKING IPA LENTEDAGEN 2010 ALFONS LAARMAN JOINT WORK WITH MICHAEL WEBER AND JACO VAN DE POL 23/4/2010 AGENDA Introduction Goal and motivation What is model checking?

More information

Moving from CS 61A Scheme to CS 61B Java

Moving from CS 61A Scheme to CS 61B Java Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you

More information

Crash Course in Java

Crash Course in Java Crash Course in Java Based on notes from D. Hollinger Based in part on notes from J.J. Johns also: Java in a Nutshell Java Network Programming and Distributed Computing Netprog 2002 Java Intro 1 What is

More information

Lesson 06: Basics of Software Development (W02D2

Lesson 06: Basics of Software Development (W02D2 Lesson 06: Basics of Software Development (W02D2) Balboa High School Michael Ferraro Lesson 06: Basics of Software Development (W02D2 Do Now 1. What is the main reason why flash

More information

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture Chapter 2: Computer-System Structures Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture Operating System Concepts 2.1 Computer-System Architecture

More information

Synchronization in. Distributed Systems. Cooperation and Coordination in. Distributed Systems. Kinds of Synchronization.

Synchronization in. Distributed Systems. Cooperation and Coordination in. Distributed Systems. Kinds of Synchronization. Cooperation and Coordination in Distributed Systems Communication Mechanisms for the communication between processes Naming for searching communication partners Synchronization in Distributed Systems But...

More information

Parrot in a Nutshell. Dan Sugalski dan@sidhe.org. Parrot in a nutshell 1

Parrot in a Nutshell. Dan Sugalski dan@sidhe.org. Parrot in a nutshell 1 Parrot in a Nutshell Dan Sugalski dan@sidhe.org Parrot in a nutshell 1 What is Parrot The interpreter for perl 6 A multi-language virtual machine An April Fools joke gotten out of hand Parrot in a nutshell

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples

More information

Java Coding Practices for Improved Application Performance

Java Coding Practices for Improved Application Performance 1 Java Coding Practices for Improved Application Performance Lloyd Hagemo Senior Director Application Infrastructure Management Group Candle Corporation In the beginning, Java became the language of the

More information

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild. Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015 Operating Systems 05. Threads Paul Krzyzanowski Rutgers University Spring 2015 February 9, 2015 2014-2015 Paul Krzyzanowski 1 Thread of execution Single sequence of instructions Pointed to by the program

More information

GPU Hardware Performance. Fall 2015

GPU Hardware Performance. Fall 2015 Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Java Performance. Adrian Dozsa TM-JUG 18.09.2014

Java Performance. Adrian Dozsa TM-JUG 18.09.2014 Java Performance Adrian Dozsa TM-JUG 18.09.2014 Agenda Requirements Performance Testing Micro-benchmarks Concurrency GC Tools Why is performance important? We hate slow web pages/apps We hate timeouts

More information

Le langage OCaml et la programmation des GPU

Le langage OCaml et la programmation des GPU Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline

More information

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji sinamera@ca.ibm.com Please Note IBM s statements regarding its plans, directions, and intent are subject to

More information

on an system with an infinite number of processors. Calculate the speedup of

on an system with an infinite number of processors. Calculate the speedup of 1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements

More information

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes) Why Threads Are A Bad Idea (for most purposes) John Ousterhout Sun Microsystems Laboratories john.ousterhout@eng.sun.com http://www.sunlabs.com/~ouster Introduction Threads: Grew up in OS world (processes).

More information

Design Pattern for the Adaptive Scheduling of Real-Time Tasks with Multiple Versions in RTSJ

Design Pattern for the Adaptive Scheduling of Real-Time Tasks with Multiple Versions in RTSJ Design Pattern for the Adaptive Scheduling of Real-Time Tasks with Multiple Versions in RTSJ Rodrigo Gonçalves, Rômulo Silva de Oliveira, Carlos Montez LCMI Depto. de Automação e Sistemas Univ. Fed. de

More information

Real Time Programming: Concepts

Real Time Programming: Concepts Real Time Programming: Concepts Radek Pelánek Plan at first we will study basic concepts related to real time programming then we will have a look at specific programming languages and study how they realize

More information

Extreme Performance with Java

Extreme Performance with Java Extreme Performance with Java QCon NYC - June 2012 Charlie Hunt Architect, Performance Engineering Salesforce.com sfdc_ppt_corp_template_01_01_2012.ppt In a Nutshell What you need to know about a modern

More information

Lecture 6: Semaphores and Monitors

Lecture 6: Semaphores and Monitors HW 2 Due Tuesday 10/18 Lecture 6: Semaphores and Monitors CSE 120: Principles of Operating Systems Alex C. Snoeren Higher-Level Synchronization We looked at using locks to provide mutual exclusion Locks

More information

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization Lesson Objectives To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization AE3B33OSD Lesson 1 / Page 2 What is an Operating System? A

More information

Shared Address Space Computing: Programming

Shared Address Space Computing: Programming Shared Address Space Computing: Programming Alistair Rendell See Chapter 6 or Lin and Synder, Chapter 7 of Grama, Gupta, Karypis and Kumar, and Chapter 8 of Wilkinson and Allen Fork/Join Programming Model

More information

Sources: On the Web: Slides will be available on:

Sources: On the Web: Slides will be available on: C programming Introduction The basics of algorithms Structure of a C code, compilation step Constant, variable type, variable scope Expression and operators: assignment, arithmetic operators, comparison,

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Comparison of Concurrency Frameworks for the Java Virtual Machine

Comparison of Concurrency Frameworks for the Java Virtual Machine Universität Ulm Fakultät für Ingenieurwissenschaften und Informatik Institut für Verteilte Systeme, Bachelorarbeit im Studiengang Informatik Comparison of Concurrency Frameworks for the Java Virtual Machine

More information

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @

More information

1. Properties of Transactions

1. Properties of Transactions Department of Computer Science Software Development Methodology Transactions as First-Class Concepts in Object-Oriented Programming Languages Boydens Jeroen Steegmans Eric 13 march 2008 1. Properties of

More information

Introduction to GPU Computing

Introduction to GPU Computing Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture

More information