CSE 6040 Computing for Data Analytics: Methods and Tools
|
|
- Gwendolyn Goodman
- 8 years ago
- Views:
Transcription
1 CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 1
2 Outline Inside a CPU Why it matters: SIMD operations, multithreading Memory hierarchy Why it matters: Memory access pattern matters We will not get into the nasty details of machine-level code Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 2
3 Typical computer architecture Intel CPU with multiple cores and shared cache: Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 3
4 Typical computer architecture Intel CPU with multiple cores and shared cache: Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 3
5 Inside a CPU (Where is my data?) Intel 80386, first processor supporting 32-bit computing, also called i386. (1985) Intel Pentium, supporting packed data types. (1997) Intel Pentium-III, first introducing Streaming SIMD Extensions (SSE). (1999) Latest: Advanced Vector Extensions (AVX) with 256-bit wide registers. (2008, 2011) Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 4
6 SIMD Single Instruction Multiple Data: Instruction-level parallelism Supported by latest instruction sets MMX - MultiMedia extensions (Intel Pentium, 1997) SSE - Streaming SIMD Extensions (Intel Pentium-III, 1999) SSE2 - Streaming SIMD Extensions 2 (Intel Pentium 4, 2001) SSE3 - Streaming SIMD Extensions 3 (Intel Pentium 4 Prescott, 2004) SSSE3 - Supplemental Streaming SIMD Extensions 3 (Intel Core Woodcrest, 2006) SSE4.1 - Streaming SIMD Extensions 4.1 (Intel Core Penryn, 2006) SSE4.2 - Streaming SIMD Extensions 4.2 (Intel Core i7 Nehalem, 2008) AVX - Advanced Vector extensions (Intel Sandy Bridge, 2011) (Note the difference between brand name and microarchitecture code name) Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 5
7 SIMD Scalar instruction Vector instruction Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 6
8 SIMD In C/C++ types: 4 x float 2 x double 16 x char 8 x short 4 x int 2 x long long 8 x float 4 x double Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 7
9 SIMD Scalar operations: 16 loads, 8 multiplications, 8 stores SSE operations: 4 loads, 2 vmuls, 2 stores AVX operations: 2 loads, 1 large vmul, 1 store Source: vmul: vector(ized) multiplication Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 8
10 Vectorization not possible for Python's lists > import time > def func(x): > return x**4 > arr = range( ) > t0 = time.time() > arr2 = [None] * ( ) > for i in arr: > arr2[i] = i ** 4 > print "Time using for loop:", time.time() - t0 > t0 = time.time() > arr3 = map(func, arr) > print "Time using map:", time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 9
11 vectorize() in Numpy > import time > import numpy as np > def func(x): > return x**4 > arr = np.arange(0, , 1, dtype=np.float64) > arr2 = np.zeros( ) > t0 = time.time() > for i in arr: > arr2 = arr[i] ** 4 > print "Time using for loop:", time.time() - t0 > t0 = time.time() > vecfunc = np.vectorize(func) > arr3 = vecfunc(arr) > print "Time using vectorize:", time.time() - t0 > t0 = time.time() > arr4 = np.power(arr, 4) > print "Time using numpy power:", time.time() - t0 vectorize() returns a function object Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 10
12 Vectorizing conditional statements > def func(x): > if x <= 0: > return np.exp(x) > else: > return np.log(x) > arr = np.random.randn( ) > arr2 = np.zeros( ) > t0 = time.time() > vecfunc = np.vectorize(func) > arr3 = vecfunc(arr) > print "Time using vectorize:", time.time() - t0 > t0 = time.time() > arr4 = np.where(arr <= 0, np.exp(arr), np.log(arr)) > print "Time using numpy where:", time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 11
13 Vectorization in R > a = 1:1e6 > c = 0 > # compute sum of squares using a for loops > system.time(for (e in a) c = c + e^2) > ## user system elapsed > ## > system.time(sum(a^2)) > ## user system elapsed > ## Summary: Avoid using for-loops to manipulate vectors and matrices. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 12
14 Typical computer architecture What else? Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 13
15 Typical computer architecture What else? Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 13
16 Multi-threading Fork-join model: Thread-level parallelism Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 14
17 Multi-threading > import numpy as np > import threading > import time > def test(mat_dim, count): > for i in range(count): > A = np.random.rand(mat_dim, mat_dim) > B = np.random.rand(mat_dim, mat_dim) > tmp = A.dot(B) > mat_dim = 512 > n_mats = 256 > num_threads = 4 > t0 = time.time() > threads = [threading.thread(target=test, args=(mat_dim, n_mats/num_threads)) for i in range(num_threads)] > for t in threads: > t.start() > for t in threads: > t.join() > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 15
18 When threading won't work in Python? > import numpy as np > import threading > import time > def test(start, count): > s = 0 > for i in range(start, start+count): > s += i Compute-intensive native Python code cannot run concurrently due to Global Interpreter Lock. > n_number = *16 > num_threads = 8 > t0 = time.time() > threads = [threading.thread(target=test, args=(n_number/num_threads*i, n_number/num_threads)) for i in range(num_threads)] > for t in threads: > t.start() > for t in threads: > t.join() > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 16
19 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
20 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
21 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
22 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
23 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
24 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
25 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
26 Multi-processing A pool with a fixed number of threads: Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
27 Multi-processing A pool with a fixed number of threads: Master process Master process Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17
28 Multi-processing > import numpy as np > import multiprocessing > import time > def test(mat_dim, count): > for i in range(count): > A = np.random.rand(mat_dim, mat_dim) > B = np.random.rand(mat_dim, mat_dim) > tmp = A.dot(B) > if name == ' main ': > mat_dim = 512 > n_mats = 256 > num_processes = 4 > pool = multiprocessing.pool(processes=num_processes) > t0 = time.time() > for i in range(num_processes): > pool.apply_async(test, (mat_dim, n_mats/num_processes)) > pool.close() > pool.join() # synchronization / blocking > print time.time() - t0 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 18
29 Multi-threading vs. Multi-processing? Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19
30 Multi-threading vs. Multi-processing? Multi-threading (threading): Only one physical core can run native Python code Different threads can share data through Python data types Useful for covering I/O and network latency, though Multi-processing (multiprocessing): Implements the "thread pool" idea Can use multiple cores Child processes cannot share data between each other easily Need to avoid loading data in the master process (they will be copied to EVERY child process!) Need more error handling code for debugging (no error information returned) Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19
31 Multi-threading vs. Multi-processing? Multi-threading (threading): Only one physical core can run native Python code Different threads can share data through Python data types Useful for covering I/O and network latency, though Multi-processing (multiprocessing): Implements the "thread pool" idea Can use multiple cores Child processes cannot share data between each other easily Need to avoid loading data in the master process (they will be copied to EVERY child process!) Need more error handling code for debugging (no error information returned) Recommended use: Use multiprocessing Each child process needs to know which files to access for reading and writing data before it starts working Number of threads or processes is often set to the number of physical cores in the system. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 19
32 Typical workflow > import multiprocessing > def train_logistic(input_folder, output_file): > # read data from input_folder > # train logistic model > # write model into output_file > if name == ' main ': > pool = multiprocessing.pool(processes=4) > for i in range(16): > pool.apply_async(train_logistic, ('data'+str(i), 'model' + str(i) + '.txt') > pool.close() > pool.join() # synchronization / blocking Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 20
33 Typical computer architecture What else? Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 21
34 Typical computer architecture What else? Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 21
35 Memory hierarchy Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 22
36 Memory hierarchy Register: 1~2KB L1 cache: 64~128KB, ~500GB/s L2 cache: 1~2MB, ~200GB/s L3 cache: 6~12MB, ~100GB/s Main memory: ~20GB/s USB disk: 480~4800Mb/s SSD hard drive: ~600MB/s Source: Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 23
37 Cache A memory read operation first looks up cache to see if the data in need have been loaded recently and still exist in cache if not in cache, load the data from main memory into cache and consume the data in registers Data are loaded from main memory into cache in cache lines (typically 64~128 bytes). Therefore a program reading memory contiguously is faster than one reading memory randomly. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 24
38 Recall from "Dense and sparse matrices" For a matrix stored in the column-major order, accessing column by column is faster. For a matrix stored in the row-major order, accessing row by row is faster. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 25
39 Experiments Reading a fixed size array in strides of 1, 2, 4, 8, etc. and then randomly Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 26
40 Processor-DRAM performance gap Even if you access memory optimally, you still can't catup up with the processor Calculate the peak flops (floating points operations per second): (Clock rate) x (number of cores) x (number of SSE flops for single/double) For example Intel Core i7 4700MQ (with AVX) 2.4 GHz x 4 x 8 = 76.8 Gflop/s (Compare with up to 20 GB/s peak memory bandwidth) We will see compute-bound vs. memory-bound algorithms in "Numerical software stacks". Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 27
41 Take home messages Today's CPUs and programming are inherently parallel. Contiguous memory access is faster than random memory access. There is a large gap between the processing speeds of CPU and main memory. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 28
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationCPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
More informationBindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationINTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationChapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors
Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 05: Array Processors Objective To learn how the array processes in multiple pipelines 2 Array Processor
More informationAmadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator
WHITE PAPER Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com SAS 9 Preferred Implementation Partner tests a single Fusion
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationAlgorithms of Scientific Computing II
Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationArchitecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
More informationOpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
More informationSoftware implementation of Post-Quantum Cryptography
Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software
More informationYALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
More informationSAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationIntel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
More informationAssessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationGPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
More informationIntro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
More informationMemory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1
Hierarchy Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Hierarchy- 1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor
More informationPutting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
More informationCOMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationLecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
More informationEnterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
More informationCourse Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 wzhang4@vcu.edu
More informationWhite Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor
White Paper Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor Introduction ADL Embedded Solutions newly introduced PCIe/104 ADLQM67 platform is the first ever PC/104 form factor board to
More informationHardware performance monitoring. Zoltán Majó
Hardware performance monitoring Zoltán Majó 1 Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance
More informationParallel Computing in Python: multiprocessing. Konrad HINSEN Centre de Biophysique Moléculaire (Orléans) and Synchrotron Soleil (St Aubin)
Parallel Computing in Python: multiprocessing Konrad HINSEN Centre de Biophysique Moléculaire (Orléans) and Synchrotron Soleil (St Aubin) Parallel computing: Theory Parallel computers Multiprocessor/multicore:
More informationINTEL IPP REALISTIC RENDERING MOBILE PLATFORM SOFTWARE DEVELOPMENT KIT
INTEL IPP REALISTIC RENDERING MOBILE PLATFORM SOFTWARE DEVELOPMENT KIT Department of computer science and engineering, Sogang university 2008. 7. 22 Deukhyun Cha INTEL PERFORMANCE LIBRARY: INTEGRATED PERFORMANCE
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationSeptember 25, 2007. Maya Gokhale Georgia Institute of Technology
NAND Flash Storage for High Performance Computing Craig Ulmer cdulmer@sandia.gov September 25, 2007 Craig Ulmer Maya Gokhale Greg Diamos Michael Rewak SNL/CA, LLNL Georgia Institute of Technology University
More informationOperating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015
Operating Systems 05. Threads Paul Krzyzanowski Rutgers University Spring 2015 February 9, 2015 2014-2015 Paul Krzyzanowski 1 Thread of execution Single sequence of instructions Pointed to by the program
More informationLecture 1. Course Introduction
Lecture 1 Course Introduction Welcome to CSE 262! Your instructor is Scott B. Baden Office hours (week 1) Tues/Thurs 3.30 to 4.30 Room 3244 EBU3B 2010 Scott B. Baden / CSE 262 /Spring 2011 2 Content Our
More informationParallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
More informationIn-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationCommunicating with devices
Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationIMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION
More informationPerformance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationEnabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
More informationCUDAMat: a CUDA-based matrix class for Python
Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 November 25, 2009 UTML TR 2009 004 CUDAMat: a CUDA-based
More informationHigh Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationIntel 64 and IA-32 Architectures Software Developer s Manual
Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationPexip Speeds Videoconferencing with Intel Parallel Studio XE
1 Pexip Speeds Videoconferencing with Intel Parallel Studio XE by Stephen Blair-Chappell, Technical Consulting Engineer, Intel Over the last 18 months, Pexip s software engineers have been optimizing Pexip
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationSmall Business Upgrades to Reliable, High-Performance Intel Xeon Processor-based Workstations to Satisfy Complex 3D Animation Needs
Small Business Upgrades to Reliable, High-Performance Intel Xeon Processor-based Workstations to Satisfy Complex 3D Animation Needs Intel, BOXX Technologies* and Caffelli* collaborated to deploy a local
More informationMeasuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
More informationQCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationBenchmarking Large Scale Cloud Computing in Asia Pacific
2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationHigh-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++
High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++ Daniel Adler, Jens Oelschlägel, Oleg Nenadic, Walter Zucchini Georg-August University Göttingen, Germany - Research
More informationCS 352H: Computer Systems Architecture
CS 352H: Computer Systems Architecture Topic 14: Multicores, Multiprocessors, and Clusters University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Introduction Goal:
More informationImproving System Scalability of OpenMP Applications Using Large Page Support
Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline
More informationEnabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
More informationHY345 Operating Systems
HY345 Operating Systems Recitation 2 - Memory Management Solutions Panagiotis Papadopoulos panpap@csd.uoc.gr Problem 7 Consider the following C program: int X[N]; int step = M; //M is some predefined constant
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationCSE 160 Lecture 5. The Memory Hierarchy False Sharing Cache Coherence and Consistency. Scott B. Baden
CSE 160 Lecture 5 The Memory Hierarchy False Sharing Cache Coherence and Consistency Scott B. Baden Using Bang coming down the home stretch Do not use Bang s front end for running mergesort Use batch,
More informationOBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
More informationHigh Performance Computing Lab Exercises
High Performance Computing Lab Exercises (Make sense of the theory!) Rubin H Landau With Sally Haerer and Scott Clark 6 GB/s CPU cache RAM cache Main Store 32 KB 2GB 2MB 32 TB@ 111Mb/s Computational Physics
More informationSPEEDUP - optimization and porting of path integral MC Code to new computing architectures
SPEEDUP - optimization and porting of path integral MC Code to new computing architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić, A. Bogojević Scientific Computing Laboratory, Institute of Physics
More informationIntroduction to Microprocessors
Introduction to Microprocessors Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?
More informationSWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
More informationVector Architectures
EE482C: Advanced Computer Organization Lecture #11 Stream Processor Architecture Stanford University Thursday, 9 May 2002 Lecture #11: Thursday, 9 May 2002 Lecturer: Prof. Bill Dally Scribe: James Bonanno
More informationOpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
More informationHigh-speed image processing algorithms using MMX hardware
High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationComputer Architecture. Secure communication and encryption.
Computer Architecture. Secure communication and encryption. Eugeniy E. Mikhailov The College of William & Mary Lecture 28 Eugeniy Mikhailov (W&M) Practical Computing Lecture 28 1 / 13 Computer architecture
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationPrice/performance Modern Memory Hierarchy
Lecture 21: Storage Administration Take QUIZ 15 over P&H 6.1-4, 6.8-9 before 11:59pm today Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Last Time Exam discussion
More informationComputer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
More informationPower System Probabilistic and Security Analysis on Commodity High Performance Computing Systems
Power System Probabilistic and Security Analysis on Commodity High Performance Computing Systems Tao Cui Carnegie Mellon University 5 Forbes Ave. Pittsburgh, PA 15213 tao.cui@ieee.org ABSTRACT Large scale
More informationCapacity Planning Process Estimating the load Initial configuration
Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting
More informationSchool of Computing and Information Sciences. Course Title: Computer Programming III Date: April 9, 2014
Course Title: Computer Date: April 9, 2014 Course Number: Number of Credits: 3 Subject Area: Programming Subject Area Coordinator: Tim Downey email: downeyt@cis.fiu.edu Catalog Description: Programming
More informationMulti-Core Programming
Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
More informationUsing Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...
More informationGenerations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
More information