CSE 6040 Computing for Data Analytics: Methods and Tools

CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 1

Outline Inside a CPU Why it matters: SIMD operations, multithreading Memory hierarchy Why it matters: Memory access pattern matters We will not get into the nasty details of machine-level code Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 2

Typical computer architecture Intel CPU with multiple cores and shared cache: Source: https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 3

Inside a CPU (Where is my data?) Intel 80386, first processor supporting 32-bit computing, also called i386. (1985) Intel Pentium, supporting packed data types. (1997) Intel Pentium-III, first introducing Streaming SIMD Extensions (SSE). (1999) Latest: Advanced Vector Extensions (AVX) with 256-bit wide registers. (2008, 2011) Source: https://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 4

SIMD Single Instruction Multiple Data: Instruction-level parallelism Supported by latest instruction sets MMX - MultiMedia extensions (Intel Pentium, 1997) SSE - Streaming SIMD Extensions (Intel Pentium-III, 1999) SSE2 - Streaming SIMD Extensions 2 (Intel Pentium 4, 2001) SSE3 - Streaming SIMD Extensions 3 (Intel Pentium 4 Prescott, 2004) SSSE3 - Supplemental Streaming SIMD Extensions 3 (Intel Core Woodcrest, 2006) SSE4.1 - Streaming SIMD Extensions 4.1 (Intel Core Penryn, 2006) SSE4.2 - Streaming SIMD Extensions 4.2 (Intel Core i7 Nehalem, 2008) AVX - Advanced Vector extensions (Intel Sandy Bridge, 2011) (Note the difference between brand name and microarchitecture code name) Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 5

SIMD Scalar instruction Vector instruction Source: https://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 6

SIMD In C/C++ types: 4 x float 2 x double 16 x char 8 x short 4 x int 2 x long long 8 x float 4 x double Source: https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 7

SIMD Scalar operations: 16 loads, 8 multiplications, 8 stores SSE operations: 4 loads, 2 vmuls, 2 stores AVX operations: 2 loads, 1 large vmul, 1 store Source: https://software.intel.com/en-us/articles/improving-the-compute-performance-of-videoprocessing-software-using-avx-advanced-vector-extensions-instructions vmul: vector(ized) multiplication Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 8

Vectorization not possible for Python's lists > import time > def func(x): > return x**4 > arr = range(1048576) > t0 = time.time() > arr2 = [None] * (1048576) > for i in arr: > arr2[i] = i ** 4 > print "Time using for loop:", time.time() - t0 > t0 = time.time() > arr3 = map(func, arr) > print "Time using map:", time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 9

vectorize() in Numpy > import time > import numpy as np > def func(x): > return x**4 > arr = np.arange(0, 1048576, 1, dtype=np.float64) > arr2 = np.zeros(1048576) > t0 = time.time() > for i in arr: > arr2 = arr[i] ** 4 > print "Time using for loop:", time.time() - t0 > t0 = time.time() > vecfunc = np.vectorize(func) > arr3 = vecfunc(arr) > print "Time using vectorize:", time.time() - t0 > t0 = time.time() > arr4 = np.power(arr, 4) > print "Time using numpy power:", time.time() - t0 vectorize() returns a function object Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 10

Vectorizing conditional statements > def func(x): > if x <= 0: > return np.exp(x) > else: > return np.log(x) > arr = np.random.randn(1048576) > arr2 = np.zeros(1048576) > t0 = time.time() > vecfunc = np.vectorize(func) > arr3 = vecfunc(arr) > print "Time using vectorize:", time.time() - t0 > t0 = time.time() > arr4 = np.where(arr <= 0, np.exp(arr), np.log(arr)) > print "Time using numpy where:", time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 11

Vectorization in R > a = 1:1e6 > c = 0 > # compute sum of squares using a for loops > system.time(for (e in a) c = c + e^2) > ## user system elapsed > ## 0.832 0.001 0.833 > system.time(sum(a^2)) > ## user system elapsed > ## 0.006 0.002 0.008 Summary: Avoid using for-loops to manipulate vectors and matrices. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 12

Typical computer architecture What else? Source: https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 13

Multi-threading Fork-join model: Thread-level parallelism Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 14

Multi-threading > import numpy as np > import threading > import time > def test(mat_dim, count): > for i in range(count): > A = np.random.rand(mat_dim, mat_dim) > B = np.random.rand(mat_dim, mat_dim) > tmp = A.dot(B) > mat_dim = 512 > n_mats = 256 > num_threads = 4 > t0 = time.time() > threads = [threading.thread(target=test, args=(mat_dim, n_mats/num_threads)) for i in range(num_threads)] > for t in threads: > t.start() > for t in threads: > t.join() > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 15

When threading won't work in Python? > import numpy as np > import threading > import time > def test(start, count): > s = 0 > for i in range(start, start+count): > s += i Compute-intensive native Python code cannot run concurrently due to Global Interpreter Lock. > n_number = 1048576*16 > num_threads = 8 > t0 = time.time() > threads = [threading.thread(target=test, args=(n_number/num_threads*i, n_number/num_threads)) for i in range(num_threads)] > for t in threads: > t.start() > for t in threads: > t.join() > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 16

Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing A pool with a fixed number of threads: Master process Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing > import numpy as np > import multiprocessing > import time > def test(mat_dim, count): > for i in range(count): > A = np.random.rand(mat_dim, mat_dim) > B = np.random.rand(mat_dim, mat_dim) > tmp = A.dot(B) > if name == ' main ': > mat_dim = 512 > n_mats = 256 > num_processes = 4 > pool = multiprocessing.pool(processes=num_processes) > t0 = time.time() > for i in range(num_processes): > pool.apply_async(test, (mat_dim, n_mats/num_processes)) > pool.close() > pool.join() # synchronization / blocking > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 18

Multi-threading vs. Multi-processing? Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19

Multi-threading vs. Multi-processing? Multi-threading (threading): Only one physical core can run native Python code Different threads can share data through Python data types Useful for covering I/O and network latency, though Multi-processing (multiprocessing): Implements the "thread pool" idea Can use multiple cores Child processes cannot share data between each other easily Need to avoid loading data in the master process (they will be copied to EVERY child process!) Need more error handling code for debugging (no error information returned) Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19

Multi-threading vs. Multi-processing? Multi-threading (threading): Only one physical core can run native Python code Different threads can share data through Python data types Useful for covering I/O and network latency, though Multi-processing (multiprocessing): Implements the "thread pool" idea Can use multiple cores Child processes cannot share data between each other easily Need to avoid loading data in the master process (they will be copied to EVERY child process!) Need more error handling code for debugging (no error information returned) Recommended use: Use multiprocessing Each child process needs to know which files to access for reading and writing data before it starts working Number of threads or processes is often set to the number of physical cores in the system. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19

Typical workflow > import multiprocessing > def train_logistic(input_folder, output_file): > # read data from input_folder > # train logistic model > # write model into output_file > if name == ' main ': > pool = multiprocessing.pool(processes=4) > for i in range(16): > pool.apply_async(train_logistic, ('data'+str(i), 'model' + str(i) + '.txt') > pool.close() > pool.join() # synchronization / blocking Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 20

Typical computer architecture What else? Source: https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 21

Memory hierarchy Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 22

Memory hierarchy Register: 1~2KB L1 cache: 64~128KB, ~500GB/s L2 cache: 1~2MB, ~200GB/s L3 cache: 6~12MB, ~100GB/s Main memory: ~20GB/s USB disk: 480~4800Mb/s SSD hard drive: ~600MB/s Source: https://software.intel.com/en-us/articles/who-moved-the-goal-posts-the-rapidly-changing-world-of-cpus Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 23

Cache A memory read operation first looks up cache to see if the data in need have been loaded recently and still exist in cache if not in cache, load the data from main memory into cache and consume the data in registers Data are loaded from main memory into cache in cache lines (typically 64~128 bytes). Therefore a program reading memory contiguously is faster than one reading memory randomly. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 24

Recall from "Dense and sparse matrices" For a matrix stored in the column-major order, accessing column by column is faster. For a matrix stored in the row-major order, accessing row by row is faster. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 25

Experiments Reading a fixed size array in strides of 1, 2, 4, 8, etc. and then randomly Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 26

Processor-DRAM performance gap Even if you access memory optimally, you still can't catup up with the processor Calculate the peak flops (floating points operations per second): (Clock rate) x (number of cores) x (number of SSE flops for single/double) For example Intel Core i7 4700MQ (with AVX) 2.4 GHz x 4 x 8 = 76.8 Gflop/s (Compare with up to 20 GB/s peak memory bandwidth) We will see compute-bound vs. memory-bound algorithms in "Numerical software stacks". Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 27

Take home messages Today's CPUs and programming are inherently parallel. Contiguous memory access is faster than random memory access. There is a large gap between the processing speeds of CPU and main memory. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 28