Operating System Impact on SMT Architecture



Similar documents
HyperThreading Support in VMware ESX Server 2.1

Multi-core Programming System Overview

Thread level parallelism

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

An Implementation Of Multiprocessor Linux

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Intel DPDK Boosts Server Appliance Performance White Paper

Operating Systems 4 th Class

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

How To Write A Windows Operating System (Windows) (For Linux) (Windows 2) (Programming) (Operating System) (Permanent) (Powerbook) (Unix) (Amd64) (Win2) (X

Thread-Sensitive Scheduling for SMT Processors

TPCalc : a throughput calculator for computer architecture studies

Full and Para Virtualization

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

CHAPTER 15: Operating Systems: An Overview

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Multi-core and Linux* Kernel

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

CSC 2405: Computer Systems II

Thread Level Parallelism II: Multithreading

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Security Overview of the Integrity Virtual Machines Architecture

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Precise and Accurate Processor Simulation

SERVER CLUSTERING TECHNOLOGY & CONCEPT

Development of Type-2 Hypervisor for MIPS64 Based Systems

FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Amoeba Distributed Operating System

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

CPU Scheduling Outline

Multithreading Lin Gao cs9244 report, 2006

Putting it all together: Intel Nehalem.


Virtualization for Cloud Computing

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Symmetric Multiprocessing

Virtualization in Linux KVM + QEMU

Types of Workloads. Raj Jain. Washington University in St. Louis

CHAPTER 1 INTRODUCTION

Kernel comparison of OpenSolaris, Windows Vista and. Linux 2.6

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Rackspace Cloud Databases and Container-based Virtualization

Multi-Threading Performance on Commodity Multi-Core Processors

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Models For Modeling and Measuring the Performance of a Xen Virtual Server

x86 ISA Modifications to support Virtual Machines

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Introduction to the NI Real-Time Hypervisor

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Control 2004, University of Bath, UK, September 2004

Operating Systems: Basic Concepts and History

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

1. Computer System Structure and Components

Components of a Computer System

Enterprise Applications

ELEC 377. Operating Systems. Week 1 Class 3

Enabling Technologies for Distributed and Cloud Computing

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

Chapter 2: OS Overview

Virtual machines and operating systems

PARALLELS CLOUD SERVER

CS161: Operating Systems

Computer-System Architecture

Eloquence Training What s new in Eloquence B.08.00

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

CHAPTER 7: The CPU and Memory

CPU Organization and Assembly Language

The Benefits of POWER7+ and PowerVM over Intel and an x86 Hypervisor

Virtualization. Clothing the Wolf in Wool. Wednesday, April 17, 13

Introduction. What is an Operating System?

Operating Systems Overview

Modeling Virtual Machine Performance: Challenges and Approaches

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Principles and characteristics of distributed systems and environments

Real-Time Scheduling 1 / 39

Operating System for the K computer

Transcription:

Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, November 2000. ) represents the first study of OS execution on a simulated SMT processor. The SimOS environment adapted for SMT: Alpha-based SMT CPU core added. Digital Unix 4.0d modified to support SMT. Study goals: Compare SMT/OS performance results with previous SMT performance results that do not account for OS behavior and impact. Contrast OS impact between OS intensive and non OS intensive workloads. Two types of workloads selected for the study: Non OS intensive workload: Multiprogrammed 8 SPECInt95 benchmarks. OS intensive workload: Multi-threaded Apache web server (64 server processes), driven by the SPECWeb benchmark (128 clients). No SMT-specific OS optimizations were investigated in this study. #1 Lec # 4 Fall 2002 9-18-2002

OS Code Vs. User Code Operating systems are usually huge programs that can overwhelm the cache and TLB due to code and data size. Operating systems may impact branch prediction performance, because of frequent branches and infrequent loops. OS execution is often brief and intermittent, invoked by interrupts, exceptions, or system calls, and can cause the replacement of useful cache, TLB and branch prediction state for little or no benefit. The OS may perform spin-waiting, explicit cache/tlb invalidation, and other operations not common in usermode code. #2 Lec # 4 Fall 2002 9-18-2002

SimOS SimOS is a complete machine simulation environment developed at Stanford (http://simos.stanford.edu/). Designed for the efficient and accurate study of both uniprocessor and multiprocessor computer systems. Simulates computer hardware in enough detail to boot and run commercial operating systems. SimOS currently provides CPU models of the MIPS R4000 and R10000 and Digital Alpha processor families. In addition to the CPU, SimOs also models caches, multiprocessor memory busses, disk drives, ethernet, consoles, and other system devices. SimOs has been ported for IRIX versions 5.3 (32-bit) and 6.4 (64-bit) and Digital UNIX; a port of Linux for the Alpha is being developed. #3 Lec # 4 Fall 2002 9-18-2002

SimOS System Diagram #4 Lec # 4 Fall 2002 9-18-2002

A Base SMT hardware Architecture. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. #5 Lec # 4 Fall 2002 9-18-2002

Alpha-based SMT Processor Parameters Duplicate the register file, program counter, subroutine stack and internal processor registers of a superscalar CPU to hold the state of multiple threads. Add per-context mechanisms for pipeline flushing, instruction retirement, subroutine return prediction, and trapping. Fetch unit, Functional units, Data L1, L2, TLB shared among contexts. ~ 10% chip-area increase over superscalar. (compared to ~ 5% for Intel s hyperthreaded Xeon) #6 Lec # 4 Fall 2002 9-18-2002

OS Modifications for SMT Only minimal required OS modifications to support SMT considered (no OS optimizations for SMT considered here): OS task scheduler must support multiple threads in running status: Shared-memory multiprocessor (SMP) aware OS (including Digital Unix) has this ability but each thread runs on a different CPU in SMP systems. An SMT processor reports to such an OS as multiple shared memory CPUs (logical processors). TLB-related code must be modified: Mutual exclusion support to access to address space number (ASN) tags of the TLB by multiple threads simultaneously. Modified ASN assignment to account for the presence of multiple threads. Internal CPU registers used to modify TLB entries replicated per context. No OS changes required to account for the shared L1 cache of SMT vs. the non shared L1 for SMP. #7 Lec # 4 Fall 2002 9-18-2002

SPECInt Workload Execution Cycle Breakdown Percentage of execution cycles for OS Kernel instructions: During program startup: 18%, mostly due to data TLB misses and to a lesser extent system calls. Steady state: 5% still dominated by TLB misses. #8 Lec # 4 Fall 2002 9-18-2002

Breakdown of Kernel Time for SPECInt95 18% mostly due to data TLB misses and system calls 5% dominated by TLB misses. #9 Lec # 4 Fall 2002 9-18-2002

SPEC System Calls Percentage System calls as a percentage of total execution cycles. #10 Lec # 4 Fall 2002 9-18-2002

SPECInt95 Dynamic Instruction Mix Percentage of dynamic instructions in the SPECInt workload by instruction type. The percentages in parenthesis for memory operations represent the proportion of loads and stores that are to physical addresses. A percentage breakdown of branch instructions is also included. For conditional branches, the number in parenthesis represents the percentage of conditional branches that are taken. #11 Lec # 4 Fall 2002 9-18-2002

SPECInt95 Total Miss rates & Distribution of Misses The miss categories are percentages of all user and kernel misses. Bold entries signify kernel-induced interference. User-kernel conflicts are misses in which the user thread conflicted with some type of kernel activity (the kernel executing on behalf of this user thread, some other user thread, a kernel thread, or an interrupt). #12 Lec # 4 Fall 2002 9-18-2002

Metrics for SPECInt95 with and without the Operating System for both SMT and Superscalar. The maximum issue for integer programs is 6 instructions on the 8-wide SMT, because there are only 6 integer units. #13 Lec # 4 Fall 2002 9-18-2002

Apache Workload Execution Cycle Breakdown Apache experiences little start-up period since Apache s start-up consists simply of receiving the first incoming requests and waking up the server threads. Once requests arrive, Apache spends over 75% of its time in the OS. #14 Lec # 4 Fall 2002 9-18-2002

Breakdown of kernel time for Apache vs.. SPECInt95 on SMT #15 Lec # 4 Fall 2002 9-18-2002

Apache System Calls By Name #16 Lec # 4 Fall 2002 9-18-2002

Apache System Calls By Function #17 Lec # 4 Fall 2002 9-18-2002

Apache Dynamic Instruction Mix The percentages in parenthesis for memory operations represent the proportion of loads and stores that are to physical addresses. A percentage breakdown of branch instructions is also included. For conditional branches, the number in parenthesis represents the percentage of conditional branches that are taken. #18 Lec # 4 Fall 2002 9-18-2002

Metrics for SMT SPEC, Apache & Superscalar Apache All applications are executing with the operating system. #19 Lec # 4 Fall 2002 9-18-2002

Apache+OS Total Miss rates & Distribution of Misses The miss categories are percentages of all user and kernel misses. Bold entries signify kernel-induced interference. User-kernel conflicts are misses in which the user thread conflicted with some type of kernel activity (the kernel executing on behalf of this user thread, some other user thread, a kernel thread, or an interrupt). #20 Lec # 4 Fall 2002 9-18-2002

Percentage of Misses Avoided Due to Interthread Cooperation on Apache Percentage of misses avoided due to interthread cooperation on Apache, shown by execution mode. The number in a table entry shows the percentage of overall misses for the given resource that threads executing in the mode indicated on the leftmost column would have encountered, if not for prefetching by other threads executing in the mode shown at the top of the column. #21 Lec # 4 Fall 2002 9-18-2002

OS Impact on Hardware Structures Performance #22 Lec # 4 Fall 2002 9-18-2002

OS Impact on SMT Study Summary Results show that for SMT, omission of the operating system did not lead to a serious misprediction of performance for SPECInt, although the effects were more significant for a superscalar executing the same workload. On the Apache workload, however, the operating system is responsible for the majority of instructions executed: Apache spends a significant amount of time responding to system service calls in the file system and kernel networking code. The result of the heavy execution of OS code is an increase of pressure on various low-level resources, including the caches and the BTB. Kernel threads also cause more conflicts in those resources, both with other kernel threads and with user threads; on the other hand, there is an positive interthread sharing effect as well. #23 Lec # 4 Fall 2002 9-18-2002

Possible SMT-specific OS Optimizations Smart SMT-optimized OS task scheduler for better SMT-core performance: Schedule cooperating threads that benefit from SMT s resource and data sharing to run simultaneously. To aid SMT s latency-hiding, avoid scheduling too many threads that have conflicts over same specific CPU resource (TLB, cache FP etc.) For SMP-SMT system tightly-coupled threads should be scheduled to logical processors in the same physical SMT CPU (processor affinity). Introduce a lightweight dedicated kernel context to cached in the SMT-core to handle process management and speedup system calls. Prevent the idle loop thread from consuming execution resources: Intel Hyper-threading solution: use HALT instruction. Allow thread caching in the CPU to further reduce contextswitching overheads. #24 Lec # 4 Fall 2002 9-18-2002