An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors
|
|
|
- Kory Hill
- 9 years ago
- Views:
Transcription
1 An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos The College of William & Mary
2 Content Motivation of this Evaluation Overview of Multithreaded/Multicore Processors Experimental Methodology OpenMP Evaluation Adaptive Multithreading Degree Selection Implications for OpenMP Conclusions
3 Motivation CMPs and SMTs are gaining popularity SMTs in high-end and mainstream computers Intel Xeon HT CMPs beginning to see same trend Intel Pentium-D Combined approach showing promise IBM Power5 and Intel Pentium-D Extreme Edition Given this popularity, evaluation of codes parallelized with OpenMP timely and necessary
4 Three Goals Compare Multiprocessors of CMPs and SMTs Low-level comparison (hardware counters) High-level comparison (execution time) Locate architectural bottlenecks on each Find ways to improve OpenMP for these architectures without modifying interface Awareness of underlying architecture
5 Content Motivation of this Evaluation Overview of Multithreaded/Multicore Processors Experimental Methodology OpenMP Evaluation Adaptive Multithreading Degree Selection Implications for OpenMP Conclusions
6 Multithreaded and Multicore Processors Execute multiple threads on single chip Resource replication within processor Improved cost/performance ratio Minimal increases in architectural complexity provide significant increases in performance
7 Simultaneous Multithreading Minimal resource replication Provides instructions to overlap memory latency Separate threads exploit idle resources Context1 Functional Units Context2 L1 Cache L2 Cache Main Memory
8 Chip Multiprocessing Much larger degree of resource replication Two complete processing cores on each chip Outer levels of cache and external interface are shared Greatly reduced resource contention compared to SMT Context1 Functional Units Context2 Functional Units L1 Cache L1 Cache L2 Cache Main Memory
9 Content Motivation of this Evaluation Overview of Multithreaded/Multicore Processors Experimental Methodology OpenMP Evaluation Adaptive Multithreading Degree Selection Implications for OpenMP Conclusions
10 Experimental Methodology Real 4-way server based on Intel s HT processors Representative of SMT class of architectures 2 execution contexts per chip Shared execution units, cache hierarchy, and DTLB Simulated 4-way CMP-based multiprocessor Used the Simics simulation environment (full system) 2 execution cores per chip Configured to be similar to SMT machine (cache configuration) 8K data L1, 256K L2, 512K L2, 64 entry TLB, 1GB main memory Private L1 and DTLB per core doubles effective space Shared L2 and L3 caches
11 Benchmarks We used the NAS Parallel Benchmark Suite OpenMP version Class A Ran 1, 2, 4, and 8 threads Bound to 1, 2, and 4 processors 1 and 2 contexts per processor
12 Benchmarks We used the NAS Parallel Benchmark Suite OpenMP version Class A Ran 1, 2, 4, and 8 threads Bound to 1, 2, and 4 processors 1 and 2 contexts per processor T 0
13 Benchmarks We used the NAS Parallel Benchmark Suite OpenMP version Class A Ran 1, 2, 4, and 8 threads Bound to 1, 2, and 4 processors 1 and 2 contexts per processor T 0 T 1
14 Benchmarks We used the NAS Parallel Benchmark Suite OpenMP version Class A Ran 1, 2, 4, and 8 threads Bound to 1, 2, and 4 processors 1 and 2 contexts per processor T 0 T 1
15 Benchmarks We used the NAS Parallel Benchmark Suite OpenMP version Class A Ran 1, 2, 4, and 8 threads Bound to 1, 2, and 4 processors 1 and 2 contexts per processor T 0 T 1 T 2 T 3
16 Benchmarks We used the NAS Parallel Benchmark Suite OpenMP version Class A Ran 1, 2, 4, and 8 threads Bound to 1, 2, and 4 processors 1 and 2 contexts per processor T 0 T 1 T 2 T 3
17 Benchmarks We used the NAS Parallel Benchmark Suite OpenMP version Class A Ran 1, 2, 4, and 8 threads Bound to 1, 2, and 4 processors 1 and 2 contexts per processor T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7
18 Benchmarks, cont. On SMT machine, ran benchmarks to completion Collected HW statistics with VTune Simulator introduces average of 7000-fold slowdown on execution for CMP Ran same data set as on SMT Ran only 3 iterations of outermost loop, discarding first for cache warm-up Simics simulator directly provides HW statistics
19 Content Motivation of this Evaluation Overview of Multithreaded/Multicore Processors Experimental Methodology OpenMP Evaluation Adaptive Multithreading Degree Selection Implications for OpenMP Conclusions
20 Hardware Statistics Collected Monitored direct metrics Wall clock time, number of instructions, number of L2 and L3 references and misses, number of stall cycles, number of data TLB misses, and number of bus transactions and derived metrics Cycles per instruction and L2 and L3 miss rates Due to time and space limitations, we present: L2 references, L2 miss rates, DTLB misses, stall cycles, and execution time Most impact on performance Provide insight into performance
21 L2 References On SMT, two threads executing causes L2 references to go up by 42% On CMP, running two threads causes L2 references to go down by 37%
22 L2 Miss Rate SMT L2 miss rate highly dependent upon application characteristics
23 L2 Miss Rate SMT If working sets of both threads do not fit into shared cache, L2 miss rate increases
24 L2 Miss Rate SMT On the other hand, applications can benefit from data sharing in the shared cache
25 L2 Miss Rate SMT CG has a high degree of data sharing which is good with one processor but has negative consequences with more processors - Inter-processor data sharing results in cache line invalidations
26 L2 Miss Rate SMT Tradeoffs between sharing in the L2 of one processor and increased cumulative L2 space from multiple processors
27 L2 Miss Rate CMP L2 miss rate much more stable on the CMP processors
28 L2 Miss Rate CMP L2 miss rate generally uncorrelated to number of threads per processor
29 L2 Miss Rate CMP The large working set of FT is still a problem for 1 and 2 processors
30 L2 Miss Rate CMP CG retains the property observed on SMT as well
31 L2 Miss Rate Comparison More potential for L2 data sharing on SMT, with shared L1 Private L1s can reduce L2 sharing, less L2 accesses On CMP, L2 not as affected by executing two threads per processor
32 Data TLB Misses SMT The number of DTLB misses increases dramatically with use of second execution context
33 Data TLB Misses SMT DTLB misses suffer up to a 32-fold increase
34 Data TLB Misses SMT 6 executions suffer a 20 or more fold increase
35 Data TLB Misses SMT Intel s HT processor has surprisingly small DTLB -> poor coverage of the virtual address space
36 Data TLB Misses CMP CMP provides private DTLB to each core, which results in much more stable DTLB performance
37 Data TLB Misses CMP The majority of the executions experience normalized DTLB misses quite close to 1
38 Data TLB Misses CMP DTLB misses may decrease with 2 threads due to the cumulatively larger DTLB size from the DTLB duplication
39 Data TLB Misses CMP But if entries are duplicated between threads, then benefits of replicated DTLBs are reduced
40 Data TLB Misses Comparison Privatizing the DTLB significantly reduces misses SMT average 10.8-fold increase CMP average 0% increase Not very affected by multiple threads on a processor
41 Stall Cycles SMT On SMT, stall cycles represent cumulative effects of waiting for memory accesses and resource contention between co-executing threads
42 Stall Cycles SMT Stall cycles for all executions increase with use of second execution context
43 Stall Cycles SMT In the best case, MG, stall cycles still increase by about a factor of 2
44 Stall Cycles CMP CMP only shares outer levels of cache and interface to external devices, which greatly reduces possible sources of stall cycles
45 Stall Cycles CMP Once again, CMP s resource replication results in a stabilized number of stall cycles, close to 1
46 Stall Cycles CMP FT has a relatively large increase in stall cycles As we have already seen, it suffers from contention in the L2 and DTLB, even on the CMP architecture
47 Stall Cycles Comparison Increase of 310% for SMT vs. only 3% for CMP Signifies that vast majority of stalls on SMT result from contention for internal processor resources
48 Execution Time SMT Two ways to evaluate the data: Fixed number of CPUs, different number of threads Fixed number of threads, different number of CPUs
49 Execution Time SMT Running two threads on single CPU is not always beneficial for execution time compared to using a single thread
50 Execution Time SMT Running two threads on single CPU is not always beneficial for execution time Good in some cases
51 Execution Time SMT Running two threads on single CPU is not always beneficial for execution time Bad in others
52 Execution Time SMT Even for a given application, neither one thread nor two threads per processor is always optimal
53 Execution Time SMT For a fixed number of threads, it is always better to execute them on as many different physical processors as possible
54 Execution Time CMP CMP, on the other hand, utilizes two threads per CPU very well
55 Execution Time CMP The activation of the second execution context was always beneficial
56 Execution Time CMP For a given number of threads, it was often better to run them on as few processors as possible
57 Execution Time Comparison CMP handles using two threads per processor much better than SMT Due to greater resource replication in CMP, which reduces contention CMP is a cost-effective means of improving performance
58 Content Motivation of this Evaluation Overview of Multithreaded/Multicore Processors Experimental Methodology OpenMP Evaluation Adaptive Multithreading Degree Selection Implications for OpenMP Conclusions
59 Adaptive Approach Description Neither 1 or 2 threads per CPU is always better Based on work by Zhang, et al from U. Toronto (PDCS 04) we try both and use whichever performs better Selection is performed at the granularity of a parallel region Function calls before and after each region, could be inserted by preprocessor We only consider number of threads, rather than scheduling policy However, no manual changes to source code And no modifications to compiler or OpenMP runtime
60 Description, cont. Outermost Loop { }!$OMP PARALLEL{ } // Parallel Region 1!$OMP PARALLEL{ } // Parallel Region 2!$OMP PARALLEL { } // Parallel Region N Since NPB are iterative, we record execution time of 2 nd and 3 rd iterations with 1 and 2 threads Ignore 1 st iteration as cache warm-up Whichever number of threads performs better is used when the region is encountered in the future
61 Adaptive Experiments Used the same 7 NPB benchmarks along with two other OpenMP codes MM5: a mesoscale weather prediction model Cobra: a matrix pseudospectrum code Ran on 1, 2, 3, and 4 processors Compared adaptive execution times to both 1 and 2 threads per processor
62 Results from Adaptation Graph shows relative performance of each approach for 1, 2, 3, and 4 processors 1 thread per processor 2 threads per processor Adaptive approach
63 Results from Adaptation Graph shows relative performance of each approach for 1, 2, 3, and 4 processors 1 thread per processor 2 threads per processor Adaptive approach
64 Results from Adaptation Graph shows relative performance of each approach for 1, 2, 3, and 4 processors 1 thread per processor 2 threads per processor Adaptive approach
65 Results from Adaptation
66 Results from Adaptation Adaptation does not perform well for MG MG has only 4 iterations and our approach takes 3
67 Results from Adaptation Adaptation does not perform well for MG MG has only 4 iterations and our approach takes 3 CG, however, performs well with only 15 iterations So it does not require many iterations to be profitable
68 Results from Adaptation In 17 of the 36 experiments, adaptation did better than either static number of threads
69 Results from Adaptation In 17 of the 36 experiments, adaptation did better than either static number of threads In Cobra, adaptation was the best for all numbers of processors
70 Results from Adaptation Compared to optimal static number of threads, adaptation was only 3.0% slower It was, however, 10.7% faster than the worse static number of threads The average overall speedup was 3.9% This shows that adaptation provides a good approximation of the optimal number of threads Requires no a priori knowledge However, does not overcome inherent architectural bottlenecks
71 Content Motivation of this Evaluation Overview of Multithreaded/Multicore Processors Experimental Methodology OpenMP Evaluation Adaptive Multithreading Degree Selection Implications for OpenMP Conclusions
72 Implications for OpenMP Our study indicates that OpenMP scales effortlessly on CMPs It is important to consider optimizations of OpenMP for SMT processors Viable technology for improving performance on a single core These optimizations could come from: Additional runtime environment support Extensions to the programming interface
73 OpenMP Optimizations for SMT Co-executing thread identification is most important optimization New SCHEDULE clause may be used Can assign iterations to SMTs These iterations can then be split between co-executing threads using SMT-aware policy OpenMP thread groups extensions may be used Co-executing threads go to same group Use SMT-aware scheduling and local synchronization Not necessarily nested parallelism
74 OpenMP Optimizations for SMT Necessity of thread binding SMT-aware optimizations require threads to remain on the same processor Some applications may benefit from running 2 threads on the same processor Use of proposed mechanisms, like ONTO clause However, exposing architecture internals in the programming interface is undesirable in OpenMP New mechanisms for improving execution on SMT processors in an autonomic manner
75 Content Motivation of this Evaluation Overview of Multithreaded/Multicore Processors Experimental Methodology OpenMP Evaluation Adaptive Multithreading Degree Selection Implications for OpenMP Conclusions
76 Conclusions Evaluated the performance of OpenMP applications on SMT/CMP-based multiprocessors SMTs suffer from contention on shared resources CMPs more efficient due to greater resource replication CMPs appear to be more cost effective Adaptively selecting the optimal number of threads helps SMT performance However, inherent architectural bottlenecks hinder the efficient exploitation of these architectures Identified OpenMP functionality that could be used to boost performance on SMTs
Operating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
An examination of the dual-core capability of the new HP xw4300 Workstation
An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings
Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Thread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
x64 Servers: Do you want 64 or 32 bit apps with that server?
TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called
Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps
Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps Ada Gavrilovska, Hsien-Hsin-Lee, Karsten Schwan, Sudha Yalamanchili, Matt Wolf CERCS Georgia Institute of Technology Background
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
A Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
Testing Database Performance with HelperCore on Multi-Core Processors
Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5
Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.
<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization
An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended
PARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are
Disk Storage Shortfall
Understanding the root cause of the I/O bottleneck November 2010 2 Introduction Many data centers have performance bottlenecks that impact application performance and service delivery to users. These bottlenecks
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
Multicore Programming with LabVIEW Technical Resource Guide
Multicore Programming with LabVIEW Technical Resource Guide 2 INTRODUCTORY TOPICS UNDERSTANDING PARALLEL HARDWARE: MULTIPROCESSORS, HYPERTHREADING, DUAL- CORE, MULTICORE AND FPGAS... 5 DIFFERENCES BETWEEN
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
Advanced Core Operating System (ACOS): Experience the Performance
WHITE PAPER Advanced Core Operating System (ACOS): Experience the Performance Table of Contents Trends Affecting Application Networking...3 The Era of Multicore...3 Multicore System Design Challenges...3
Workshare Process of Thread Programming and MPI Model on Multicore Architecture
Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma
1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015
Operating Systems 05. Threads Paul Krzyzanowski Rutgers University Spring 2015 February 9, 2015 2014-2015 Paul Krzyzanowski 1 Thread of execution Single sequence of instructions Pointed to by the program
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
Understanding Hardware Transactional Memory
Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Sequential Performance Analysis with Callgrind and KCachegrind
Sequential Performance Analysis with Callgrind and KCachegrind 4 th Parallel Tools Workshop, HLRS, Stuttgart, September 7/8, 2010 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation
OpenACC Parallelization and Optimization of NAS Parallel Benchmarks
OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools
LSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
Design and Implementation of the Heterogeneous Multikernel Operating System
223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Low Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
Delivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL
MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL Dr. Allon Cohen Eli Ben Namer [email protected] 1 EXECUTIVE SUMMARY SANRAD VXL provides enterprise class acceleration for virtualized
Sequential Performance Analysis with Callgrind and KCachegrind
Sequential Performance Analysis with Callgrind and KCachegrind 2 nd Parallel Tools Workshop, HLRS, Stuttgart, July 7/8, 2008 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut
HyperThreading Support in VMware ESX Server 2.1
HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect
Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:
Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
GeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
End-user Tools for Application Performance Analysis Using Hardware Counters
1 End-user Tools for Application Performance Analysis Using Hardware Counters K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer Abstract One purpose of the end-user tools described in
What is Multi Core Architecture?
What is Multi Core Architecture? When a processor has more than one core to execute all the necessary functions of a computer, it s processor is known to be a multi core architecture. In other words, a
Enabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
The Truth Behind IBM AIX LPAR Performance
The Truth Behind IBM AIX LPAR Performance Yann Guernion, VP Technology EMEA HEADQUARTERS AMERICAS HEADQUARTERS Tour Franklin 92042 Paris La Défense Cedex France +33 [0] 1 47 73 12 12 [email protected] www.orsyp.com
Contributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
PARALLELS CLOUD SERVER
PARALLELS CLOUD SERVER Performance and Scalability 1 Table of Contents Executive Summary... Error! Bookmark not defined. LAMP Stack Performance Evaluation... Error! Bookmark not defined. Background...
Inside the Erlang VM
Rev A Inside the Erlang VM with focus on SMP Prepared by Kenneth Lundin, Ericsson AB Presentation held at Erlang User Conference, Stockholm, November 13, 2008 1 Introduction The history of support for
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Embedded Parallel Computing
Embedded Parallel Computing Lecture 5 - The anatomy of a modern multiprocessor, the multicore processors Tomas Nordström Course webpage:: Course responsible and examiner: Tomas
Multicore Processor and GPU. Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/
Multicore Processor and GPU Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/ Moore s Law The number of transistors on integrated circuits doubles approximately every two years! CPU performance
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. [email protected] Andrew Munzer Director, Training and Customer
Exploiting GPU Hardware Saturation for Fast Compiler Optimization
Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom [email protected] Christophe Dubach School of Informatics
Windows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%
openbench Labs Executive Briefing: April 19, 2013 Condusiv s Server Boosts Performance of SQL Server 2012 by 55% Optimizing I/O for Increased Throughput and Reduced Latency on Physical Servers 01 Executive
Performance evaluation
Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography
Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware
Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware Priya Narasimhan T. Dumitraş, A. Paulos, S. Pertet, C. Reverte, J. Slember, D. Srivastava Carnegie Mellon University Problem Description
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing
JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing January 2014 Legal Notices JBoss, Red Hat and their respective logos are trademarks or registered trademarks of Red Hat, Inc. Azul
Multi-Core Programming
Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part
Using Power to Improve C Programming Education
Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden [email protected] jonasskeppstedt.net jonasskeppstedt.net [email protected]
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
LLamasoft K2 Enterprise 8.1 System Requirements
Overview... 3 RAM... 3 Cores and CPU Speed... 3 Local System for Operating Supply Chain Guru... 4 Applying Supply Chain Guru Hardware in K2 Enterprise... 5 Example... 6 Determining the Correct Number of
Multi-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
