CSE 6040 Computing for Data Analytics: Methods and Tools
|
|
|
- Gwendolyn Goodman
- 10 years ago
- Views:
Transcription
1 CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 1
2 Outline Inside a CPU Why it matters: SIMD operations, multithreading Memory hierarchy Why it matters: Memory access pattern matters We will not get into the nasty details of machine-level code Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 2
3 Typical computer architecture Intel CPU with multiple cores and shared cache: Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 3
4 Typical computer architecture Intel CPU with multiple cores and shared cache: Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 3
5 Inside a CPU (Where is my data?) Intel 80386, first processor supporting 32-bit computing, also called i386. (1985) Intel Pentium, supporting packed data types. (1997) Intel Pentium-III, first introducing Streaming SIMD Extensions (SSE). (1999) Latest: Advanced Vector Extensions (AVX) with 256-bit wide registers. (2008, 2011) Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 4
6 SIMD Single Instruction Multiple Data: Instruction-level parallelism Supported by latest instruction sets MMX - MultiMedia extensions (Intel Pentium, 1997) SSE - Streaming SIMD Extensions (Intel Pentium-III, 1999) SSE2 - Streaming SIMD Extensions 2 (Intel Pentium 4, 2001) SSE3 - Streaming SIMD Extensions 3 (Intel Pentium 4 Prescott, 2004) SSSE3 - Supplemental Streaming SIMD Extensions 3 (Intel Core Woodcrest, 2006) SSE4.1 - Streaming SIMD Extensions 4.1 (Intel Core Penryn, 2006) SSE4.2 - Streaming SIMD Extensions 4.2 (Intel Core i7 Nehalem, 2008) AVX - Advanced Vector extensions (Intel Sandy Bridge, 2011) (Note the difference between brand name and microarchitecture code name) Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 5
7 SIMD Scalar instruction Vector instruction Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 6
8 SIMD In C/C++ types: 4 x float 2 x double 16 x char 8 x short 4 x int 2 x long long 8 x float 4 x double Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 7
9 SIMD Scalar operations: 16 loads, 8 multiplications, 8 stores SSE operations: 4 loads, 2 vmuls, 2 stores AVX operations: 2 loads, 1 large vmul, 1 store Source: vmul: vector(ized) multiplication Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 8
10 Vectorization not possible for Python's lists > import time > def func(x): > return x**4 > arr = range( ) > t0 = time.time() > arr2 = [None] * ( ) > for i in arr: > arr2[i] = i ** 4 > print "Time using for loop:", time.time() - t0 > t0 = time.time() > arr3 = map(func, arr) > print "Time using map:", time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 9
11 vectorize() in Numpy > import time > import numpy as np > def func(x): > return x**4 > arr = np.arange(0, , 1, dtype=np.float64) > arr2 = np.zeros( ) > t0 = time.time() > for i in arr: > arr2 = arr[i] ** 4 > print "Time using for loop:", time.time() - t0 > t0 = time.time() > vecfunc = np.vectorize(func) > arr3 = vecfunc(arr) > print "Time using vectorize:", time.time() - t0 > t0 = time.time() > arr4 = np.power(arr, 4) > print "Time using numpy power:", time.time() - t0 vectorize() returns a function object Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 10
12 Vectorizing conditional statements > def func(x): > if x <= 0: > return np.exp(x) > else: > return np.log(x) > arr = np.random.randn( ) > arr2 = np.zeros( ) > t0 = time.time() > vecfunc = np.vectorize(func) > arr3 = vecfunc(arr) > print "Time using vectorize:", time.time() - t0 > t0 = time.time() > arr4 = np.where(arr <= 0, np.exp(arr), np.log(arr)) > print "Time using numpy where:", time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 11
13 Vectorization in R > a = 1:1e6 > c = 0 > # compute sum of squares using a for loops > system.time(for (e in a) c = c + e^2) > ## user system elapsed > ## > system.time(sum(a^2)) > ## user system elapsed > ## Summary: Avoid using for-loops to manipulate vectors and matrices. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 12
14 Typical computer architecture What else? Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 13
15 Typical computer architecture What else? Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 13
16 Multi-threading Fork-join model: Thread-level parallelism Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 14
17 Multi-threading > import numpy as np > import threading > import time > def test(mat_dim, count): > for i in range(count): > A = np.random.rand(mat_dim, mat_dim) > B = np.random.rand(mat_dim, mat_dim) > tmp = A.dot(B) > mat_dim = 512 > n_mats = 256 > num_threads = 4 > t0 = time.time() > threads = [threading.thread(target=test, args=(mat_dim, n_mats/num_threads)) for i in range(num_threads)] > for t in threads: > t.start() > for t in threads: > t.join() > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 15
18 When threading won't work in Python? > import numpy as np > import threading > import time > def test(start, count): > s = 0 > for i in range(start, start+count): > s += i Compute-intensive native Python code cannot run concurrently due to Global Interpreter Lock. > n_number = *16 > num_threads = 8 > t0 = time.time() > threads = [threading.thread(target=test, args=(n_number/num_threads*i, n_number/num_threads)) for i in range(num_threads)] > for t in threads: > t.start() > for t in threads: > t.join() > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 16
19 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
20 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
21 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
22 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
23 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
24 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
25 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
26 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
27 Multi-processing A pool with a fixed number of threads: Master process Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
28 Multi-processing > import numpy as np > import multiprocessing > import time > def test(mat_dim, count): > for i in range(count): > A = np.random.rand(mat_dim, mat_dim) > B = np.random.rand(mat_dim, mat_dim) > tmp = A.dot(B) > if name == ' main ': > mat_dim = 512 > n_mats = 256 > num_processes = 4 > pool = multiprocessing.pool(processes=num_processes) > t0 = time.time() > for i in range(num_processes): > pool.apply_async(test, (mat_dim, n_mats/num_processes)) > pool.close() > pool.join() # synchronization / blocking > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 18
29 Multi-threading vs. Multi-processing? Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19
30 Multi-threading vs. Multi-processing? Multi-threading (threading): Only one physical core can run native Python code Different threads can share data through Python data types Useful for covering I/O and network latency, though Multi-processing (multiprocessing): Implements the "thread pool" idea Can use multiple cores Child processes cannot share data between each other easily Need to avoid loading data in the master process (they will be copied to EVERY child process!) Need more error handling code for debugging (no error information returned) Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19
31 Multi-threading vs. Multi-processing? Multi-threading (threading): Only one physical core can run native Python code Different threads can share data through Python data types Useful for covering I/O and network latency, though Multi-processing (multiprocessing): Implements the "thread pool" idea Can use multiple cores Child processes cannot share data between each other easily Need to avoid loading data in the master process (they will be copied to EVERY child process!) Need more error handling code for debugging (no error information returned) Recommended use: Use multiprocessing Each child process needs to know which files to access for reading and writing data before it starts working Number of threads or processes is often set to the number of physical cores in the system. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19
32 Typical workflow > import multiprocessing > def train_logistic(input_folder, output_file): > # read data from input_folder > # train logistic model > # write model into output_file > if name == ' main ': > pool = multiprocessing.pool(processes=4) > for i in range(16): > pool.apply_async(train_logistic, ('data'+str(i), 'model' + str(i) + '.txt') > pool.close() > pool.join() # synchronization / blocking Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 20
33 Typical computer architecture What else? Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 21
34 Typical computer architecture What else? Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 21
35 Memory hierarchy Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 22
36 Memory hierarchy Register: 1~2KB L1 cache: 64~128KB, ~500GB/s L2 cache: 1~2MB, ~200GB/s L3 cache: 6~12MB, ~100GB/s Main memory: ~20GB/s USB disk: 480~4800Mb/s SSD hard drive: ~600MB/s Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 23
37 Cache A memory read operation first looks up cache to see if the data in need have been loaded recently and still exist in cache if not in cache, load the data from main memory into cache and consume the data in registers Data are loaded from main memory into cache in cache lines (typically 64~128 bytes). Therefore a program reading memory contiguously is faster than one reading memory randomly. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 24
38 Recall from "Dense and sparse matrices" For a matrix stored in the column-major order, accessing column by column is faster. For a matrix stored in the row-major order, accessing row by row is faster. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 25
39 Experiments Reading a fixed size array in strides of 1, 2, 4, 8, etc. and then randomly Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 26
40 Processor-DRAM performance gap Even if you access memory optimally, you still can't catup up with the processor Calculate the peak flops (floating points operations per second): (Clock rate) x (number of cores) x (number of SSE flops for single/double) For example Intel Core i7 4700MQ (with AVX) 2.4 GHz x 4 x 8 = 76.8 Gflop/s (Compare with up to 20 GB/s peak memory bandwidth) We will see compute-bound vs. memory-bound algorithms in "Numerical software stacks". Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 27
41 Take home messages Today's CPUs and programming are inherently parallel. Contiguous memory access is faster than random memory access. There is a large gap between the processing speeds of CPU and main memory. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 28
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
Matrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors
Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 05: Array Processors Objective To learn how the array processes in multiple pipelines 2 Array Processor
Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator
WHITE PAPER Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com SAS 9 Preferred Implementation Partner tests a single Fusion
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
Software implementation of Post-Quantum Cryptography
Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software
YALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Intel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. [email protected]. MemoryHierarchy- 1
Hierarchy Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN [email protected] Hierarchy- 1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor
Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Lecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
Enterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor
White Paper Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor Introduction ADL Embedded Solutions newly introduced PCIe/104 ADLQM67 platform is the first ever PC/104 form factor board to
Hardware performance monitoring. Zoltán Majó
Hardware performance monitoring Zoltán Majó 1 Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance
Parallel Computing in Python: multiprocessing. Konrad HINSEN Centre de Biophysique Moléculaire (Orléans) and Synchrotron Soleil (St Aubin)
Parallel Computing in Python: multiprocessing Konrad HINSEN Centre de Biophysique Moléculaire (Orléans) and Synchrotron Soleil (St Aubin) Parallel computing: Theory Parallel computers Multiprocessor/multicore:
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015
Operating Systems 05. Threads Paul Krzyzanowski Rutgers University Spring 2015 February 9, 2015 2014-2015 Paul Krzyzanowski 1 Thread of execution Single sequence of instructions Pointed to by the program
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Communicating with devices
Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Enabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
CUDAMat: a CUDA-based matrix class for Python
Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 November 25, 2009 UTML TR 2009 004 CUDAMat: a CUDA-based
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
Intel 64 and IA-32 Architectures Software Developer s Manual
Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Pexip Speeds Videoconferencing with Intel Parallel Studio XE
1 Pexip Speeds Videoconferencing with Intel Parallel Studio XE by Stephen Blair-Chappell, Technical Consulting Engineer, Intel Over the last 18 months, Pexip s software engineers have been optimizing Pexip
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Measuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
QCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Benchmarking Large Scale Cloud Computing in Asia Pacific
2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++
High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++ Daniel Adler, Jens Oelschlägel, Oleg Nenadic, Walter Zucchini Georg-August University Göttingen, Germany - Research
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
HY345 Operating Systems
HY345 Operating Systems Recitation 2 - Memory Management Solutions Panagiotis Papadopoulos [email protected] Problem 7 Consider the following C program: int X[N]; int step = M; //M is some predefined constant
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
CSE 160 Lecture 5. The Memory Hierarchy False Sharing Cache Coherence and Consistency. Scott B. Baden
CSE 160 Lecture 5 The Memory Hierarchy False Sharing Cache Coherence and Consistency Scott B. Baden Using Bang coming down the home stretch Do not use Bang s front end for running mergesort Use batch,
OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
High Performance Computing Lab Exercises
High Performance Computing Lab Exercises (Make sense of the theory!) Rubin H Landau With Sally Haerer and Scott Clark 6 GB/s CPU cache RAM cache Main Store 32 KB 2GB 2MB 32 TB@ 111Mb/s Computational Physics
Introduction to Microprocessors
Introduction to Microprocessors Yuri Baida [email protected] [email protected] October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
Vector Architectures
EE482C: Advanced Computer Organization Lecture #11 Stream Processor Architecture Stanford University Thursday, 9 May 2002 Lecture #11: Thursday, 9 May 2002 Lecturer: Prof. Bill Dally Scribe: James Bonanno
OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
High-speed image processing algorithms using MMX hardware
High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
Computer Architecture. Secure communication and encryption.
Computer Architecture. Secure communication and encryption. Eugeniy E. Mikhailov The College of William & Mary Lecture 28 Eugeniy Mikhailov (W&M) Practical Computing Lecture 28 1 / 13 Computer architecture
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
Price/performance Modern Memory Hierarchy
Lecture 21: Storage Administration Take QUIZ 15 over P&H 6.1-4, 6.8-9 before 11:59pm today Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Last Time Exam discussion
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Capacity Planning Process Estimating the load Initial configuration
Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting
Multi-Core Programming
Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Using Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...
Generations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
