Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources

Size: px
Start display at page:

Download "Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources"

Transcription

1 Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources Wei Wang, Tanima Dey, Jason Mars, Lingjia Tang, Jack Davidson, Mary Lou Soffa Department of Computer Science University of Virginia ISPASS 2012 This research is supported in part by NSF grant number CCF

2 Motivation Chip-multiprocessors offer large number of cores and ample resources Number of simultaneously executing applications is increasing Careful resource management is critical Thread mapping is a powerful technique for resource management ISPASS 2012 Wang et al., University of Virginia 2

3 Challenges for Thread Mapping Multiple resources are effected Threads demonstrate various run-time characteristics Multi-threaded workloads are emerging ISPASS 2012 Wang et al., University of Virginia 3

4 Goal of this Research Analyze why a particular thread mapping is better than another mapping: What are the resources that cause the performance differences? What are the thread characteristics that cause the resource utilization differences? What is the relative importance of various resources? ISPASS 2012 Wang et al., University of Virginia 4

5 Contributions In-depth performance analyses of various thread mappings using multi-threaded applications on real hardware Identify the key hardware resources Determine the impact on key resource utilization Introduce a new metric L2MP to analyze the performance of the combined memory resources Provide a ranking of the resources ISPASS 2012 Wang et al., University of Virginia 5

6 Outline Motivation Challenges Contributions Overview resource, metric, mappings Analysis prefetchers, processor cores Key findings for thread mapping Conclusion ISPASS 2012 Wang et al., University of Virginia 6

7 Overview A comprehensive analyses considering various factors Application s performance Application s characteristics Hardware resources shared by applications Utilization of the resources ISPASS 2012 Wang et al., University of Virginia 7

8 Resources and Metrics Resources Memory Resources: L1 I/D, I/D TLB, L2, Prefetchers, Memory interconnect Processor Resources: Memory disambiguation units, branch predictors, Processor Core Metrics Cache misses, mis-predictions, memory latency (with hardware performance counters (HPCs)) Processor utilization (from OS) Execution cycles and execution time ISPASS 2012 Wang et al., University of Virginia 8

9 Thread Characteristics of Multithreaded Applications Single thread characteristics Cache demand Memory bandwidth demand I/O frequency Prefetcher effectiveness Prefetcher excessiveness Multiple thread characteristics Sibling Threads Data and instruction sharing Frequency of synchronization ISPASS 2012 Wang et al., University of Virginia 9

10 Four Thread Mappings Mapping Core 0 Core 1 Core 2 Core 3 LLC0 LLC1 OSMap Any thread Any thread Any thread Any thread IsoMap a1, a1 a1,a1 a2, a2 a2, a2 IntMap a1, a1 a2,a2 a1,a1 a2,a2 SprMap a1, a2 a1,a2 a1,a2 a1,a2 App 1 Core 0 Core 1 Core 2 Core 3 App 2 L1 Cache TLB L2 Cache L1 Cache TLB L1 Cache TLB L2 Cache L1 Cache TLB Hardware Prefetchers Hardware Prefetchers Off-Chip Mem Interconnect ISPASS

11 Experimental Setup Platform & Workloads Intel Core 2 Q9550 Processor PARSEC benchmark suite benchmarks All possible pairs (36) using the 9 benchmarks 4 worker threads each benchmark Core 0 Core 1 L1 TLB L1 TLB L2 Cache Hardware Prefetchers Core 2 Core 3 L1 TLB Memory Controller & Memory L1 TLB L2 Cache Hardware Prefetchers ISPASS 2012 Wang et al., University of Virginia 11

12 Key Resources A key resource is identified Utilization of the resource varies considerably Utilization variation results in difference in application s performance Identification technique Direct approach: use HPCs Indirect approach: use application s performance in different mappings ISPASS 2012 Wang et al., University of Virginia 12

13 Key Resources More important resources Memory resources Processor resources L1D-cache Branch predictor L2-cache Processor core Hardware prefetchers Memory interconnect Less important resources L1I-cache I/D TLB Memory disambiguation unit ISPASS 2012 Wang et al., University of Virginia 13

14 Analysis Hardware Prefetchers Experimental Results: streamcluster (w. blackscholes) ISPASS

15 Key Findings for Hardware Prefetchers Case 1: Threads that share high amount of data Sharing the same cache improves performance ISPASS 2012 Wang et al., University of Virginia 15

16 Key Findings for Hardware Prefetchers Case 2: Threads that have low or no data sharing but high prefetcher excessiveness Sharing the same prefetchers improves performance ISPASS 2012 Wang et al., University of Virginia 16

17 Key Findings for Hardware Prefetchers Case 3: Threads that have low data sharing and low prefetcher excessiveness Fewer cache misses and prefetch operations improves performance ISPASS 2012 Wang et al., University of Virginia 17

18 Analysis Processor Cores Processor utilization ISPASS 2012 Wang et al., University of Virginia 18

19 Analysis Processor Cores Performance impact ISPASS 2012 Wang et al., University of Virginia 19

20 Key Findings for Processor Cores Case 1: Sibling threads have frequent synchronization ISPASS 2012 Wang et al., University of Virginia 20

21 Key Findings for Processor Cores Case 2: Sibling threads have frequent I/O operations ISPASS 2012 Wang et al., University of Virginia 21

22 Managing Multiple Resources Example L2 caches, prefetchers, and memory bandwidth are closely related resources A single metric to evaluate their aggregated performance impact L2MP: L2-cache-misses-memory-latencyproduct L2MP = L2_cache_misses X Memory_latency ISPASS 2012 Wang et al., University of Virginia 22

23 L2MP L2MP is good indicator of performance ISPASS 2012 Wang et al., University of Virginia 23

24 Managing Multiple Resources Thread mapping algorithms Consider all the key resources together Improve the utilizations of the resources that provide the maximum benefit Consider co-running application s characteristics ISPASS 2012 Wang et al., University of Virginia 24

25 Findings for Multiple Resources For memory-intensive applications streamcluster, canneal, facesim, fluidanimate Maximize the L2MP metric For I/O- or CPU-intensive applications swaptions, blackscholes, vips, x264, bodytrack Maximize processor utilization ISPASS 2012 Wang et al., University of Virginia 25

26 Conclusion Identified six key resources Analyzed how to map threads with particular characteristics to improve resource utilization Introduced a new metric L2MP for managing key memory resources Determined relative importance of the key resources ISPASS 2012 Wang et al., University of Virginia 26

27 Related Work Shared-cache-aware thread mapping Jiang et al. PACT 2008 Chandra et al. HPCA 2005 Xie et al. CMP-MSI 2008 Knauerhase et al. IEEE-Micro 2008 Cache-Prefetcher-FSB-aware thread mapping Zhuravlev et al. ASPLOS 2010 ISPASS 2012 Wang et al., University of Virginia 27

28 Thank you & Questions? ISPASS 2012 Wang et al., University of Virginia 28

ReSense: Mapping Dynamic Workloads of Colocated Multithreaded Applications Using Resource Sensitivity

ReSense: Mapping Dynamic Workloads of Colocated Multithreaded Applications Using Resource Sensitivity ReSense: Mapping Dynamic Workloads of Colocated Multithreaded Applications Using Resource Sensitivity TANIMA DEY, WEI WANG, JACK W. DAVIDSON, and MARY LOU SOFFA, University of Virginia To utilize the full

More information

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa Introduction Problem Recent studies into the effects of memory

More information

Thread Reinforcer: Dynamically Determining Number of Threads via OS Level Monitoring

Thread Reinforcer: Dynamically Determining Number of Threads via OS Level Monitoring Thread Reinforcer: Dynamically Determining via OS Level Monitoring Kishore Kumar Pusukuri, Rajiv Gupta, Laxmi N. Bhuyan Department of Computer Science and Engineering University of California, Riverside

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Neil Vachharajani Pure Storage neil@purestorage.com Lingjia Tang University of Virginia lt8f@cs.virginia.edu Robert Hundt Google

More information

Autonomous Resource Sharing for Multi-Threaded Workloads in Virtualized Servers

Autonomous Resource Sharing for Multi-Threaded Workloads in Virtualized Servers Autonomous Resource Sharing for Multi-Threaded Workloads in Virtualized Servers Can Hankendi* hankendi@bu.edu Ayse K. Coskun* acoskun@bu.edu Electrical and Computer Engineering Department Boston University

More information

The Advantages of an Autopilot Resource Allocation Strategy - A Case Study

The Advantages of an Autopilot Resource Allocation Strategy - A Case Study Energy-Efficient Server Consolidation for Multi-threaded Applications in the Cloud Can Hankendi ECE Department Boston University, Boston, MA Email: hankendi@bu.edu Ayse K. Coskun ECE Department Boston

More information

An Approach to Resource-Aware Co-Scheduling for CMPs

An Approach to Resource-Aware Co-Scheduling for CMPs An Approach to Resource-Aware Co-Scheduling for CMPs Major Bhadauria Computer Systems Laboratory Cornell University Ithaca, NY, USA major@csl.cornell.edu Sally A. McKee Dept. of Computer Science and Engineering

More information

Modeling the Effects on Power and Performance from Memory Interference of Co-located Applications in Multicore Systems

Modeling the Effects on Power and Performance from Memory Interference of Co-located Applications in Multicore Systems Modeling the Effects on Power and Performance from Memory Interference of Co-located Applications in Multicore Systems Daniel Dauwe 1, Ryan Friese 1, Sudeep Pasricha 1,2, Anthony A. Maciejewski 1, Gregory

More information

PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors

PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors ChristianBienia,SanjeevKumar andkaili DepartmentofComputerScience,PrincetonUniversity MicroprocessorTechnologyLabs,Intel

More information

Addressing Shared Resource Contention in Multicore Processors via Scheduling

Addressing Shared Resource Contention in Multicore Processors via Scheduling Addressing Shared Resource Contention in Multicore Processors via Scheduling Sergey Zhuravlev Sergey Blagodurov Alexandra Fedorova School of Computing Science, Simon Fraser University, Vancouver, Canada

More information

The Data Center as a Grid Load Stabilizer

The Data Center as a Grid Load Stabilizer The Data Center as a Grid Load Stabilizer Hao Chen *, Michael C. Caramanis ** and Ayse K. Coskun * * Department of Electrical and Computer Engineering ** Division of Systems Engineering Boston University

More information

Modeling Communication Costs in Blade Servers

Modeling Communication Costs in Blade Servers odeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin C. Lee Duke University Department of Electrical and Computer Engineering {qiuyun.wang, benjamin.c.lee}@duke.edu ABSTRACT Datacenters demand

More information

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

Application Heartbeats for Software Performance and Health Henry Hoffmann, Jonathan Eastep, Marco Santambrogio, Jason Miller, and Anant Agarwal

Application Heartbeats for Software Performance and Health Henry Hoffmann, Jonathan Eastep, Marco Santambrogio, Jason Miller, and Anant Agarwal Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-29-35 August 7, 29 Application Heartbeats for Software Performance and Health Henry Hoffmann, Jonathan Eastep, Marco

More information

CSHARP: Coherence and SHaring Aware Replacement Policies for Parallel Applications

CSHARP: Coherence and SHaring Aware Replacement Policies for Parallel Applications CSHARP: Coherence and SHaring Aware Replacement Policies for Parallel Applications Biswabandan Panda Department of CSE, IIT Madras, India Email: biswa@cse.iitm.ac.in Shankar Balachandran Department of

More information

Operating System Scheduling for Efficient Online Self-Test in Robust Systems. Yanjing Li. Onur Mutlu. Subhasish Mitra

Operating System Scheduling for Efficient Online Self-Test in Robust Systems. Yanjing Li. Onur Mutlu. Subhasish Mitra Operating System Scheduling for Efficient Online Self-Test in Robust Systems Yanjing Li Stanford University Onur Mutlu Carnegie Mellon University Subhasish Mitra Stanford University 1 Why Online Self-Test

More information

Architecture Support for Big Data Analytics

Architecture Support for Big Data Analytics Architecture Support for Big Data Analytics Ahsan Javed Awan EMJD-DC (KTH-UPC) (http://uk.linkedin.com/in/ahsanjavedawan/) Supervisors: Mats Brorsson(KTH), Eduard Ayguade(UPC), Vladimir Vlassov(KTH) 1

More information

A Tumbler: An Effective Load Balancing Technique for MultiCPU Multicore Systems

A Tumbler: An Effective Load Balancing Technique for MultiCPU Multicore Systems A Tumbler: An Effective Load Balancing Technique for MultiCPU Multicore Systems KISHORE KUMAR PUSUKURI, University of California, Riverside RAJIV GUPTA, University of California, Riverside LAXMI N. BHUYAN,

More information

Performance monitoring with Intel Architecture

Performance monitoring with Intel Architecture Performance monitoring with Intel Architecture CSCE 351: Operating System Kernels Lecture 5.2 Why performance monitoring? Fine-tune software Book-keeping Locating bottlenecks Explore potential problems

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality Heechul Yun +, Gang Yao +, Rodolfo Pellizzoni *, Marco Caccamo +, Lui Sha + University of Illinois at Urbana and Champaign

More information

Interval Simulation: Raising the Level of Abstraction in Architectural Simulation

Interval Simulation: Raising the Level of Abstraction in Architectural Simulation Interval Simulation: Raising the Level of Abstraction in Architectural Simulation Davy Genbrugge Stijn Eyerman Lieven Eeckhout Ghent University, Belgium Abstract Detailed architectural simulators suffer

More information

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

More information

Ensuring Quality of Service in High Performance Servers

Ensuring Quality of Service in High Performance Servers Ensuring Quality of Service in High Performance Servers YAN SOLIHIN Fei Guo, Seongbeom Kim, Fang Liu Center of Efficient, Secure, and Reliable Computing (CESR) North Carolina State University solihin@ece.ncsu.edu

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

Architectural Support for Enhanced SMT Job Scheduling

Architectural Support for Enhanced SMT Job Scheduling Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andrew Janiszewski Dan Connors University of Colorado at Boulder Department of Electrical and Computer Engineering 5 UCB, Boulder,

More information

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads Gen9 Servers give more performance per dollar for your investment. Executive Summary Information Technology (IT) organizations face increasing

More information

Impact of Java Application Server Evolution on Computer System Performance

Impact of Java Application Server Evolution on Computer System Performance Impact of Java Application Server Evolution on Computer System Performance Peng-fei Chuang, Celal Ozturk, Khun Ban, Huijun Yan, Kingsum Chow, Resit Sendag Intel Corporation; {peng-fei.chuang, khun.ban,

More information

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches Xi Chen 1, Zheng Xu 1, Hyungjun Kim 1, Paul V. Gratz 1, Jiang Hu 1, Michael Kishinevsky 2 and Umit Ogras 2

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

Enterprise Applications

Enterprise Applications Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting

More information

Maximizing Hardware Prefetch Effectiveness with Machine Learning

Maximizing Hardware Prefetch Effectiveness with Machine Learning Maximizing Hardware Prefetch Effectiveness with Machine Learning Saami Rahman, Martin Burtscher, Ziliang Zong, and Apan Qasem Department of Computer Science Texas State University San Marcos, TX 78666

More information

Virtualizing Performance Asymmetric Multi-core Systems

Virtualizing Performance Asymmetric Multi-core Systems Virtualizing Performance Asymmetric Multi- Systems Youngjin Kwon, Changdae Kim, Seungryoul Maeng, and Jaehyuk Huh Computer Science Department, KAIST {yjkwon and cdkim}@calab.kaist.ac.kr, {maeng and jhhuh}@kaist.ac.kr

More information

DYNAMIC CACHE-USAGE PROFILER FOR THE XEN HYPERVISOR WIRA DAMIS MULIA. Bachelor of Science in Electrical and Computer. Engineering

DYNAMIC CACHE-USAGE PROFILER FOR THE XEN HYPERVISOR WIRA DAMIS MULIA. Bachelor of Science in Electrical and Computer. Engineering DYNAMIC CACHE-USAGE PROFILER FOR THE XEN HYPERVISOR By WIRA DAMIS MULIA Bachelor of Science in Electrical and Computer Engineering Oklahoma State University Stillwater, Oklahoma 2009 Submitted to the Faculty

More information

Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Design Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

More information

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015 Capstone Overview Architecture for Big Data & Machine Learning Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015 Accelerators Memory Traffic Reduction Memory Intensive Arch. Context-based Prefetching Deep

More information

Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies

Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Ragavendra Natarajan Department of Computer Science and Engineering University of Minnesota

More information

Allocation Policy Analysis for Cache Coherence Protocols for STT-MRAM-based caches

Allocation Policy Analysis for Cache Coherence Protocols for STT-MRAM-based caches Allocation Policy Analysis for Cache Coherence Protocols for STT-MRAM-based caches A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Pushkar Shridhar Nandkar IN

More information

Process-level Power Estimation in VM-based Systems

Process-level Power Estimation in VM-based Systems Process-level Power Estimation in VM-based Systems Maxime Colmant, Mascha Kurpicz, Pascal Felber, Loïc Huertas, Romain Rouvoy, Anita Sobe To cite this version: Maxime Colmant, Mascha Kurpicz, Pascal Felber,

More information

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan Niladrish Chatterjee David Nellans Manu Awasthi Rajeev Balasubramonian Al Davis School of Computing University of

More information

Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors

Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors Robert L. McGregor Christos D. Antonopoulos Department of Computer Science The College of William & Mary Williamsburg, VA 23187-8795

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

How To Improve Performance On A Multicore Processor With An Asymmetric Hypervisor

How To Improve Performance On A Multicore Processor With An Asymmetric Hypervisor AASH: An Asymmetry-Aware Scheduler for Hypervisors Vahid Kazempour Ali Kamali Alexandra Fedorova Simon Fraser University, Vancouver, Canada {vahid kazempour, ali kamali, fedorova}@sfu.ca Abstract Asymmetric

More information

Measuring the Performance of Prefetching Proxy Caches

Measuring the Performance of Prefetching Proxy Caches Measuring the Performance of Prefetching Proxy Caches Brian D. Davison davison@cs.rutgers.edu Department of Computer Science Rutgers, The State University of New Jersey The Problem Traffic Growth User

More information

FACT: a Framework for Adaptive Contention-aware Thread migrations

FACT: a Framework for Adaptive Contention-aware Thread migrations FACT: a Framework for Adaptive Contention-aware Thread migrations Kishore Kumar Pusukuri Department of Computer Science and Engineering University of California, Riverside, CA 92507. kishore@cs.ucr.edu

More information

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis White Paper Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis White Paper March 2014 2014 Cisco and/or its affiliates. All rights reserved. This document

More information

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle

More information

DATA centers often comprise thousands of enterprise

DATA centers often comprise thousands of enterprise LeakageAware Cooling Management for Improving Server Energy Efficiency Marina Zapater, Ozan Tuncer, José L. Ayala, José M. Moya, Kalyan Vaidyanathan, Kenny Gross and Ayse K. Coskun Abstract The computational

More information

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:

More information

Managing Performance vs. Accuracy Trade-offs With Loop Perforation

Managing Performance vs. Accuracy Trade-offs With Loop Perforation Managing Performance vs. Accuracy Trade-offs With Loop Perforation Stelios Sidiroglou Sasa Misailovic Henry Hoffmann Martin Rinard Computer Science and Artificial Intelligence Laboratory Massachusetts

More information

Virtual Machine Scheduling for Parallel Soft Real-Time Applications

Virtual Machine Scheduling for Parallel Soft Real-Time Applications Virtual Machine Scheduling for Parallel Soft Real-Time Applications Like Zhou, Song Wu, Huahua Sun, Hai Jin, Xuanhua Shi Services Computing Technology and System Lab Cluster and Grid Computing Lab School

More information

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run SFWR ENG 3BB4 Software Design 3 Concurrent System Design 2 SFWR ENG 3BB4 Software Design 3 Concurrent System Design 11.8 10 CPU Scheduling Chapter 11 CPU Scheduling Policies Deciding which process to run

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS AN INSTRUCTION WINDOW THAT CAN TOLERATE LATENCIES TO DRAM MEMORY IS PROHIBITIVELY COMPLEX AND POWER HUNGRY. TO AVOID HAVING TO

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters

A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters Hui Wang, Canturk Isci, Lavanya Subramanian, Jongmoo Choi, Depei Qian, Onur Mutlu Beihang University, IBM Thomas J. Watson

More information

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure White Paper Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure White Paper March 2014 2014 Cisco and/or its affiliates. All rights reserved. This

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

POWER8 Performance Analysis

POWER8 Performance Analysis POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs satsadas@in.ibm.com #OpenPOWERSummit Join the conversation at #OpenPOWERSummit

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array Evaluation report prepared under contract with Lenovo Executive Summary Even with the price of flash

More information

Hardware performance monitoring. Zoltán Majó

Hardware performance monitoring. Zoltán Majó Hardware performance monitoring Zoltán Majó 1 Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance

More information

On Performance Debugging of Unnecessary Lock Contentions on Multicore Processors: A Replay-based Approach

On Performance Debugging of Unnecessary Lock Contentions on Multicore Processors: A Replay-based Approach On Performance Debugging of Unnecessary ock Contentions on Multicore Processors: A Replay-based Approach ong Zheng Xiaofei iao Services Computing Technology and System ab, Cluster and Grid Computing ab,

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

Disk Storage Shortfall

Disk Storage Shortfall Understanding the root cause of the I/O bottleneck November 2010 2 Introduction Many data centers have performance bottlenecks that impact application performance and service delivery to users. These bottlenecks

More information

Precise and Accurate Processor Simulation

Precise and Accurate Processor Simulation Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin Madison http://www.ece.wisc.edu/~pharm Performance Modeling Analytical

More information

Introducing EEMBC Cloud and Big Data Server Benchmarks

Introducing EEMBC Cloud and Big Data Server Benchmarks Introducing EEMBC Cloud and Big Data Server Benchmarks Quick Background: Industry-Standard Benchmarks for the Embedded Industry EEMBC formed in 1997 as non-profit consortium Defining and developing application-specific

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Outline. Introduction. State-of-the-art Forensic Methods. Hardware-based Workload Forensics. Experimental Results. Summary. OS level Hypervisor level

Outline. Introduction. State-of-the-art Forensic Methods. Hardware-based Workload Forensics. Experimental Results. Summary. OS level Hypervisor level Outline Introduction State-of-the-art Forensic Methods OS level Hypervisor level Hardware-based Workload Forensics Process Reconstruction Experimental Results Setup Result & Overhead Summary 1 Introduction

More information

Application Performance Analysis of the Cortex-A9 MPCore

Application Performance Analysis of the Cortex-A9 MPCore This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development Application Performance Analysis

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

Host Power Management in VMware vsphere 5

Host Power Management in VMware vsphere 5 in VMware vsphere 5 Performance Study TECHNICAL WHITE PAPER Table of Contents Introduction.... 3 Power Management BIOS Settings.... 3 Host Power Management in ESXi 5.... 4 HPM Power Policy Options in ESXi

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

An OS-oriented performance monitoring tool for multicore systems

An OS-oriented performance monitoring tool for multicore systems An OS-oriented performance monitoring tool for multicore systems J.C. Sáez, J. Casas, A. Serrano, R. Rodríguez-Rodríguez, F. Castro, D. Chaver, M. Prieto-Matias Department of Computer Architecture Complutense

More information

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes TRACE PERFORMANCE TESTING APPROACH Overview Approach Flow Attributes INTRODUCTION Software Testing Testing is not just finding out the defects. Testing is not just seeing the requirements are satisfied.

More information

Analyse de performances pour les systèmes intégrés multi-cœurs

Analyse de performances pour les systèmes intégrés multi-cœurs Ben Salma SANA Laboratoire TIMA SLS Analyse de performances pour les systèmes intégrés multi-cœurs Encadrants: Frederic Petrot (Frederic.Petrot@imag.fr) Nicolas Fournel (Nicolas.Fournel@imag.fr) Page 1

More information

Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems

Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen, University of Rochester https://www.usenix.org/conference/atc5/technical-session/presentation/srikanthan

More information

Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads

Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads Ryan Cochran School of Engineering Brown University Providence, RI 2912 ryan_cochran@brown.edu Can Hankendi ECE Department

More information

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu. Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive

More information

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier. Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core

More information

Accurate Characterization of the Variability in Power Consumption in Modern Mobile Processors

Accurate Characterization of the Variability in Power Consumption in Modern Mobile Processors Accurate Characterization of the Variability in Power Consumption in Modern Mobile Processors Bharathan Balaji, John McCullough, Rajesh K. Gupta, Yuvraj Agarwal University of California, San Diego {bbalaji,

More information

MONITORING power consumption of a microprocessor

MONITORING power consumption of a microprocessor IEEE TRANSACTIONS ON CIRCUIT AND SYSTEMS-II, VOL. X, NO. Y, JANUARY XXXX 1 A Study on the use of Performance Counters to Estimate Power in Microprocessors Rance Rodrigues, Member, IEEE, Arunachalam Annamalai,

More information

Post-compiler Software Optimization for Reducing Energy

Post-compiler Software Optimization for Reducing Energy Post-compiler Software Optimization for Reducing Energy Eric Schulte Jonathan Dorn Stephen Harding Stephanie Forrest Westley Weimer Department of Computer Science Department of Computer Science University

More information

Run-time Resource Management in SOA Virtualized Environments. Danilo Ardagna, Raffaela Mirandola, Marco Trubian, Li Zhang

Run-time Resource Management in SOA Virtualized Environments. Danilo Ardagna, Raffaela Mirandola, Marco Trubian, Li Zhang Run-time Resource Management in SOA Virtualized Environments Danilo Ardagna, Raffaela Mirandola, Marco Trubian, Li Zhang Amsterdam, August 25 2009 SOI Run-time Management 2 SOI=SOA + virtualization Goal:

More information

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle? Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

More information

Probabilistic Modeling for Job Symbiosis Scheduling on SMT Processors

Probabilistic Modeling for Job Symbiosis Scheduling on SMT Processors 7 Probabilistic Modeling for Job Symbiosis Scheduling on SMT Processors STIJN EYERMAN and LIEVEN EECKHOUT, Ghent University, Belgium Symbiotic job scheduling improves simultaneous multithreading (SMT)

More information

Achieving QoS in Server Virtualization

Achieving QoS in Server Virtualization Achieving QoS in Server Virtualization Intel Platform Shared Resource Monitoring/Control in Xen Chao Peng (chao.p.peng@intel.com) 1 Increasing QoS demand in Server Virtualization Data center & Cloud infrastructure

More information

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor

More information

LOOKING FOR AN AMAZING PROCESSOR. Product Brief 6th Gen Intel Core Processors for Desktops: S-series

LOOKING FOR AN AMAZING PROCESSOR. Product Brief 6th Gen Intel Core Processors for Desktops: S-series Product Brief 6th Gen Intel Core Processors for Desktops: Sseries LOOKING FOR AN AMAZING PROCESSOR for your next desktop PC? Look no further than 6th Gen Intel Core processors. With amazing performance

More information

Virtualization Performance Insights from TPC-VMS

Virtualization Performance Insights from TPC-VMS Virtualization Performance Insights from TPC-VMS Wayne D. Smith, Shiny Sebastian Intel Corporation wayne.smith@intel.com shiny.sebastian@intel.com Abstract. This paper describes the TPC-VMS (Virtual Measurement

More information

CloudCache: Expanding and Shrinking Private Caches

CloudCache: Expanding and Shrinking Private Caches Credits CloudCache: Expanding and Shrinking Private Caches Parts of the work presented in this talk are from the results obtained in collaboration with students and faculty at the : Mohammad Hammoud Lei

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Journal of Instruction-Level Parallelism 7 (25) 1-28 Submitted 2/25; published 6/25 Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Joshua Kihm Alex Settle Andrew

More information

FAST, ACCURATE, AND VALIDATED FULL-SYSTEM SOFTWARE SIMULATION

FAST, ACCURATE, AND VALIDATED FULL-SYSTEM SOFTWARE SIMULATION ... FAST, ACCURATE, AND VALIDATED FULL-SYSTEM SOFTWARE SIMULATION OF X86 HARDWARE... THIS ARTICLE PRESENTS A FAST AND ACCURATE INTERVAL-BASED CPU TIMING MODEL THAT IS EASILY IMPLEMENTED AND INTEGRATED

More information