Tamás Budavári / The Johns Hopkins University

Size: px

Start display at page:

Download "Tamás Budavári / The Johns Hopkins University"

Jesse Anderson
9 years ago
Views:

1 PRACTICAL SCIENTIFIC ANALYSIS OF BIG DATA RUNNING IN PARALLEL / The Johns Hopkins University

2 2 Parallelism Data parallel Same processing on different pieces of data Task parallel Simultaneous processing on the same data

3 On all levels of the hierarchy 3 Clouds Clusters Machines Cores Threads

4 4 Scalability Scale up Scale out Vertically Add resources to a node Bigger memory, Faster processor, Horizontally Use more of the Threads, cores, machines, clusters, clouds,

5 5 Cluster

6 6 High-Performance Computing Traditional HPC clusters Launching jobs on a cluster of machines Use MPI to communicate among nodes Message Passing Interface (not this class)

7 7 Queuing Systems Used for batch jobs on computer clusters Fair scheduling of user jobs Group policies Several systems Portable Batch System (PBS) Condor, etc

8 8 Portable Batch System Basic PBS commands qsub qdel qstat, showq

9 9 Job Requirements Which queue? How much memory? How many CPUs? Think MPI For how long? Send , where and when?

10 10 Example Job Submit this Using qsub

11 11 Computer

12 Classification of Parallel Computers 12 Flynn s Taxonomy

13 13 SISD Single Instruction Single Data Classical Von Neumann machines Single threaded codes arstechnica.com

14 14 SIMD Single Instruction Multiple Data On x86 MMX: Math Matrix extension SSE: Streaming SIMD Extension and more GPU programming!! arstechnica.com

15 Amdahl s Laws 15 Bell, Gray & Szalay (2005) Petascale Computational Systems: Balanced CyberInfrastructure in a Data-Centric World

16 Amdahl s Law of Parallelism 16 Speed ratio with T(1) S P T( N) P S N P p S P T(1) T( N) (1 1 p) p N Before looking into parallelism, speed up the serial code, to figure out the max speedup, i.e., N

17 17 Chip

18 Moore s Law 18

19 New Limitation is Energy! 19 Power to compute the same thing? CPU is 10 less efficient than a digital signal processor DSP is 10 less efficient than a custom chip New design: multicores with slower clocks But the interconnect is expensive Need simpler components Swinburne University of Technology 9/1/2011

20 Emerging Architectures 20 Andrew Chien: to replace the 90/10 rule Custom modules on chip, cf. SoC in cellphones Statistics on a video codec module? Swinburne University of Technology 9/1/2011

Emerging Architectures 21 Andrew Chien: 10 10 to

21 Emerging Architectures 21 Andrew Chien: to replace the 90/10 rule Custom modules on chip, cf. SoC in cellphones Scientific analysis on such specialized units? Swinburne University of Technology 9/1/2011

22 GPUs Evolved to be General Purpose 22 Virtual world: simulation of real physics C for CUDA and OpenCL 512 cores 25k threads, running 1 billion/sec Old algorithms built on wrong assumption Today processing is free but memory is slow Swinburne University of Technology New programming paradigm! 9/1/2011

23 New Moore s Law 23 In the number of cores Faster than ever

24 24 Data Parallel Techniques Embarrassingly Parallel Decoupled problems, independent processing MapReduce Map Reduce

25 25 Programming

26 26 Programming Languages No one language to rule them all And many to choose from

27 27 Assembly Low-level (almost) machine code Different for each computer

28 28 The C Language Higher level but still close to hardware Pointers! Many things written in C Operating systems Other languages,

29 29 Java Pros Memory management with garbage collection Just-In-Time compilation from bytecode Cons Not so great performance Hard to include legacy codes New language features were an afterthought

30 30 Python Scripting to glue things together Easy to wrap legacy codes Lots of scientific modules and plotting Good for prototyping

31 31 Etc Perl Matlab Mathematica IDL R Lisp Haskell Ocaml Erlang Your favorite here

32 32 Programming in C

33 33 Programming in C Skeleton of an application

34 34 Programming in C Files Headers *.h Source *.c Building an application Compile source Link object files

35 Using Pointers 35

36 36 Arrays Dynamic arrays Memory allocation Freeing memory Pointer arithmetics

37 37 Matrix, etc Point to pointers Data allocated in v Pointers in A For 2D indexing One can have Matrix, tensor, Jagged arrays,

Introduction to GPU Programming Languages

CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure