CSE 6040 Computing for Data Analytics: Methods and Tools

Similar documents
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Multi-Threading Performance on Commodity Multi-Core Processors

Binary search tree with SIMD bandwidth optimization using SSE

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Matrix Multiplication

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

FPGA-based Multithreading for In-Memory Hash Joins

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Parallel Programming Survey

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

RevoScaleR Speed and Scalability

Introduction to GPU Programming Languages

Architecture of Hitachi SR-8000

OpenMP and Performance

Software implementation of Post-Quantum Cryptography

YALES2 porting on the Xeon- Phi Early results

1. Memory technology & Hierarchy

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Parallel Algorithm Engineering


Intel Xeon Processor E5-2600

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Next Generation GPU Architecture Code-named Fermi

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Putting it all together: Intel Nehalem.

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Introduction to GPGPU. Tiziano Diamanti

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 1: the anatomy of a supercomputer

Enterprise Applications

Course Development of Programming for General-Purpose Multicore Processors

White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor

Hardware performance monitoring. Zoltán Majó

Parallel Computing in Python: multiprocessing. Konrad HINSEN Centre de Biophysique Moléculaire (Orléans) and Synchrotron Soleil (St Aubin)

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Parallel Computing for Data Science

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Communicating with devices

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Rethinking SIMD Vectorization for In-Memory Databases

Enabling Technologies for Distributed Computing

CUDAMat: a CUDA-based matrix class for Python

High Performance Computing. Course Notes HPC Fundamentals

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Intel 64 and IA-32 Architectures Software Developer s Manual

Introduction to GPU Architecture

Pexip Speeds Videoconferencing with Intel Parallel Studio XE

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

QCD as a Video Game?

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmarking Large Scale Cloud Computing in Asia Pacific

Introduction to GPU hardware and to CUDA

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

Enabling Technologies for Distributed and Cloud Computing

HY345 Operating Systems

Control 2004, University of Bath, UK, September 2004

CSE 160 Lecture 5. The Memory Hierarchy False Sharing Cache Coherence and Consistency. Scott B. Baden

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

High Performance Computing Lab Exercises

Introduction to Microprocessors

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Vector Architectures

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

High-speed image processing algorithms using MMX hardware

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Computer Architecture. Secure communication and encryption.

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Price/performance Modern Memory Hierarchy

Computer Graphics Hardware An Overview

Capacity Planning Process Estimating the load Initial configuration

Multi-Core Programming

GPU Parallel Computing Architecture and CUDA Programming Model

Benchmarking Cassandra on Violin

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Generations of the computer. processors.

Transcription:

CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 1

Outline Inside a CPU Why it matters: SIMD operations, multithreading Memory hierarchy Why it matters: Memory access pattern matters We will not get into the nasty details of machine-level code Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 2

Typical computer architecture Intel CPU with multiple cores and shared cache: Source: https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 3

Typical computer architecture Intel CPU with multiple cores and shared cache: Source: https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 3

Inside a CPU (Where is my data?) Intel 80386, first processor supporting 32-bit computing, also called i386. (1985) Intel Pentium, supporting packed data types. (1997) Intel Pentium-III, first introducing Streaming SIMD Extensions (SSE). (1999) Latest: Advanced Vector Extensions (AVX) with 256-bit wide registers. (2008, 2011) Source: https://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 4

SIMD Single Instruction Multiple Data: Instruction-level parallelism Supported by latest instruction sets MMX - MultiMedia extensions (Intel Pentium, 1997) SSE - Streaming SIMD Extensions (Intel Pentium-III, 1999) SSE2 - Streaming SIMD Extensions 2 (Intel Pentium 4, 2001) SSE3 - Streaming SIMD Extensions 3 (Intel Pentium 4 Prescott, 2004) SSSE3 - Supplemental Streaming SIMD Extensions 3 (Intel Core Woodcrest, 2006) SSE4.1 - Streaming SIMD Extensions 4.1 (Intel Core Penryn, 2006) SSE4.2 - Streaming SIMD Extensions 4.2 (Intel Core i7 Nehalem, 2008) AVX - Advanced Vector extensions (Intel Sandy Bridge, 2011) (Note the difference between brand name and microarchitecture code name) Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 5

SIMD Scalar instruction Vector instruction Source: https://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 6

SIMD In C/C++ types: 4 x float 2 x double 16 x char 8 x short 4 x int 2 x long long 8 x float 4 x double Source: https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 7

SIMD Scalar operations: 16 loads, 8 multiplications, 8 stores SSE operations: 4 loads, 2 vmuls, 2 stores AVX operations: 2 loads, 1 large vmul, 1 store Source: https://software.intel.com/en-us/articles/improving-the-compute-performance-of-videoprocessing-software-using-avx-advanced-vector-extensions-instructions vmul: vector(ized) multiplication Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 8

Vectorization not possible for Python's lists > import time > def func(x): > return x**4 > arr = range(1048576) > t0 = time.time() > arr2 = [None] * (1048576) > for i in arr: > arr2[i] = i ** 4 > print "Time using for loop:", time.time() - t0 > t0 = time.time() > arr3 = map(func, arr) > print "Time using map:", time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 9

vectorize() in Numpy > import time > import numpy as np > def func(x): > return x**4 > arr = np.arange(0, 1048576, 1, dtype=np.float64) > arr2 = np.zeros(1048576) > t0 = time.time() > for i in arr: > arr2 = arr[i] ** 4 > print "Time using for loop:", time.time() - t0 > t0 = time.time() > vecfunc = np.vectorize(func) > arr3 = vecfunc(arr) > print "Time using vectorize:", time.time() - t0 > t0 = time.time() > arr4 = np.power(arr, 4) > print "Time using numpy power:", time.time() - t0 vectorize() returns a function object Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 10

Vectorizing conditional statements > def func(x): > if x <= 0: > return np.exp(x) > else: > return np.log(x) > arr = np.random.randn(1048576) > arr2 = np.zeros(1048576) > t0 = time.time() > vecfunc = np.vectorize(func) > arr3 = vecfunc(arr) > print "Time using vectorize:", time.time() - t0 > t0 = time.time() > arr4 = np.where(arr <= 0, np.exp(arr), np.log(arr)) > print "Time using numpy where:", time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 11

Vectorization in R > a = 1:1e6 > c = 0 > # compute sum of squares using a for loops > system.time(for (e in a) c = c + e^2) > ## user system elapsed > ## 0.832 0.001 0.833 > system.time(sum(a^2)) > ## user system elapsed > ## 0.006 0.002 0.008 Summary: Avoid using for-loops to manipulate vectors and matrices. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 12

Typical computer architecture What else? Source: https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 13

Typical computer architecture What else? Source: https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 13

Multi-threading Fork-join model: Thread-level parallelism Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 14

Multi-threading > import numpy as np > import threading > import time > def test(mat_dim, count): > for i in range(count): > A = np.random.rand(mat_dim, mat_dim) > B = np.random.rand(mat_dim, mat_dim) > tmp = A.dot(B) > mat_dim = 512 > n_mats = 256 > num_threads = 4 > t0 = time.time() > threads = [threading.thread(target=test, args=(mat_dim, n_mats/num_threads)) for i in range(num_threads)] > for t in threads: > t.start() > for t in threads: > t.join() > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 15

When threading won't work in Python? > import numpy as np > import threading > import time > def test(start, count): > s = 0 > for i in range(start, start+count): > s += i Compute-intensive native Python code cannot run concurrently due to Global Interpreter Lock. > n_number = 1048576*16 > num_threads = 8 > t0 = time.time() > threads = [threading.thread(target=test, args=(n_number/num_threads*i, n_number/num_threads)) for i in range(num_threads)] > for t in threads: > t.start() > for t in threads: > t.join() > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 16

Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing A pool with a fixed number of threads: Master process Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Multi-processing > import numpy as np > import multiprocessing > import time > def test(mat_dim, count): > for i in range(count): > A = np.random.rand(mat_dim, mat_dim) > B = np.random.rand(mat_dim, mat_dim) > tmp = A.dot(B) > if name == ' main ': > mat_dim = 512 > n_mats = 256 > num_processes = 4 > pool = multiprocessing.pool(processes=num_processes) > t0 = time.time() > for i in range(num_processes): > pool.apply_async(test, (mat_dim, n_mats/num_processes)) > pool.close() > pool.join() # synchronization / blocking > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 18

Multi-threading vs. Multi-processing? Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19

Multi-threading vs. Multi-processing? Multi-threading (threading): Only one physical core can run native Python code Different threads can share data through Python data types Useful for covering I/O and network latency, though Multi-processing (multiprocessing): Implements the "thread pool" idea Can use multiple cores Child processes cannot share data between each other easily Need to avoid loading data in the master process (they will be copied to EVERY child process!) Need more error handling code for debugging (no error information returned) Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19

Multi-threading vs. Multi-processing? Multi-threading (threading): Only one physical core can run native Python code Different threads can share data through Python data types Useful for covering I/O and network latency, though Multi-processing (multiprocessing): Implements the "thread pool" idea Can use multiple cores Child processes cannot share data between each other easily Need to avoid loading data in the master process (they will be copied to EVERY child process!) Need more error handling code for debugging (no error information returned) Recommended use: Use multiprocessing Each child process needs to know which files to access for reading and writing data before it starts working Number of threads or processes is often set to the number of physical cores in the system. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19

Typical workflow > import multiprocessing > def train_logistic(input_folder, output_file): > # read data from input_folder > # train logistic model > # write model into output_file > if name == ' main ': > pool = multiprocessing.pool(processes=4) > for i in range(16): > pool.apply_async(train_logistic, ('data'+str(i), 'model' + str(i) + '.txt') > pool.close() > pool.join() # synchronization / blocking Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 20

Typical computer architecture What else? Source: https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 21

Typical computer architecture What else? Source: https://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 21

Memory hierarchy Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 22

Memory hierarchy Register: 1~2KB L1 cache: 64~128KB, ~500GB/s L2 cache: 1~2MB, ~200GB/s L3 cache: 6~12MB, ~100GB/s Main memory: ~20GB/s USB disk: 480~4800Mb/s SSD hard drive: ~600MB/s Source: https://software.intel.com/en-us/articles/who-moved-the-goal-posts-the-rapidly-changing-world-of-cpus Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 23

Cache A memory read operation first looks up cache to see if the data in need have been loaded recently and still exist in cache if not in cache, load the data from main memory into cache and consume the data in registers Data are loaded from main memory into cache in cache lines (typically 64~128 bytes). Therefore a program reading memory contiguously is faster than one reading memory randomly. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 24

Recall from "Dense and sparse matrices" For a matrix stored in the column-major order, accessing column by column is faster. For a matrix stored in the row-major order, accessing row by row is faster. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 25

Experiments Reading a fixed size array in strides of 1, 2, 4, 8, etc. and then randomly Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 26

Processor-DRAM performance gap Even if you access memory optimally, you still can't catup up with the processor Calculate the peak flops (floating points operations per second): (Clock rate) x (number of cores) x (number of SSE flops for single/double) For example Intel Core i7 4700MQ (with AVX) 2.4 GHz x 4 x 8 = 76.8 Gflop/s (Compare with up to 20 GB/s peak memory bandwidth) We will see compute-bound vs. memory-bound algorithms in "Numerical software stacks". Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 27

Take home messages Today's CPUs and programming are inherently parallel. Contiguous memory access is faster than random memory access. There is a large gap between the processing speeds of CPU and main memory. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 28