Parallel Processing and Software Performance. Lukáš Marek

Similar documents

How System Settings Impact PCIe SSD Performance

Benchmarking Hadoop & HBase on Violin

Virtuoso and Database Scalability

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Enabling Technologies for Distributed and Cloud Computing

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

HP Smart Array Controllers and basic RAID performance factors

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Sage SalesLogix White Paper. Sage SalesLogix v8.0 Performance Testing

Accelerating Server Storage Performance on Lenovo ThinkServer

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

SQL Server Business Intelligence on HP ProLiant DL785 Server

Benchmarking Cassandra on Violin

A Performance Analysis of Secure HTTP Protocol

Using Synology SSD Technology to Enhance System Performance Synology Inc.

VERITAS Database Edition for Oracle on HP-UX 11i. Performance Report

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Delivering Quality in Software Performance and Scalability Testing

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Enabling Technologies for Distributed Computing

Solid State Storage in Massive Data Environments Erik Eyberg

Best Practices for Optimizing Your Linux VPS and Cloud Server Infrastructure

Optimizing LTO Backup Performance

Intel DPDK Boosts Server Appliance Performance White Paper

F600Q 8Gb FC Storage Performance Report Date: 2012/10/30

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Performance Baseline of Oracle Exadata X2-2 HR HC. Part II: Server Performance. Benchware Performance Suite Release 8.4 (Build ) September 2013

Informatica Data Director Performance

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

Tableau Server Scalability Explained

Achieving a High Performance OLTP Database using SQL Server and Dell PowerEdge R720 with Internal PCIe SSD Storage

Quantifying Hardware Selection in an EnCase v7 Environment

SAN Conceptual and Design Basics

PARALLELS CLOUD STORAGE

Report Paper: MatLab/Database Connectivity

IOS110. Virtualization 5/27/2014 1

AMD Opteron Quad-Core

Microsoft Dynamics NAV 2013 R2 Sizing Guidelines for On-Premises Single Tenant Deployments

Performance brief for Oracle Enterprise Financial Management 8.9 (Order-to-Cash Counter Sales) on HP Integrity BL870c server blades

Managing Storage Space in a Flash and Disk Hybrid Storage System

Architecting High-Speed Data Streaming Systems. Sujit Basu

A Study of Application Performance with Non-Volatile Main Memory

7 Real Benefits of a Virtual Infrastructure

FPGA-based Multithreading for In-Memory Hash Joins

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

Price/performance Modern Memory Hierarchy

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Analysis of VDI Storage Performance During Bootstorm

Big Fast Data Hadoop acceleration with Flash. June 2013

SECURE Web Gateway Sizing Guide

IncidentMonitor Server Specification Datasheet

Optimizing SQL Server Storage Performance with the PowerEdge R720

HP Z Turbo Drive PCIe SSD

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Storage and SQL Server capacity planning and configuration (SharePoint...

Multi-Threading Performance on Commodity Multi-Core Processors

SAS Business Analytics. Base SAS for SAS 9.2

Infrastructure Matters: POWER8 vs. Xeon x86

I/O Performance of Cisco UCS M-Series Modular Servers with Cisco UCS M142 Compute Cartridges

A Quantum Leap in Enterprise Computing

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

WHITE PAPER Optimizing Virtual Platform Disk Performance

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

1 Storage Devices Summary

Comparison of Hybrid Flash Storage System Performance

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

An Overview of Flash Storage for Databases

Understanding the Performance of an X User Environment

PARALLELS CLOUD SERVER

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Configuring Apache Derby for Performance and Durability Olav Sandstå

Business white paper. HP Process Automation. Version 7.0. Server performance

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware

Virtualizing Microsoft SQL Server on Dell XC Series Web-scale Converged Appliances Powered by Nutanix Software. Dell XC Series Tech Note

Streaming and Virtual Hosted Desktop Study: Phase 2

Tableau Server 7.0 scalability

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

The Transition to PCI Express* for Client SSDs

PRODUCT BRIEF 3E PERFORMANCE BENCHMARKS LOAD AND SCALABILITY TESTING

Microsoft Exchange Server 2003 Deployment Considerations

Choosing a Computer for Running SLX, P3D, and P5

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

September 25, Maya Gokhale Georgia Institute of Technology

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

white paper Capacity and Scaling of Microsoft Terminal Server on the Unisys ES7000/600 Unisys Systems & Technology Modeling and Measurement

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Microsoft SQL Server 2014 Fast Track

Understanding the Benefits of IBM SPSS Statistics Server

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

Transcription:

Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics

Benchmarking in parallel environment Shared hardware resources CPU Cache coherency, cache bandwidth, cache sharing Main memory bandwidth (HDD) Many topics still left to explore Hyper-threading, NUMA, HDD (SSD), networking,... Lukáš Marek 2

Motivation Explore scenarios where parallel execution influences performance Cover all possible cases Design various benchmarks with possibility to run them in parallel with a different configuration Multi-platform (hardware) as much as possible Lukáš Marek 3

How we did it - framework Framework RIP C++, R Parallel execution of workloads (benchmarks) Designing new benchmarks as simple as possible example Automated benchmarking with various parameters Simple composition of benchmarks Automated graph creation Lukáš Marek 4

How we did it benchmarks Artificial workloads Designed to create highest (worst-case) overhead Effect isolation One observing workload (benchmark) Does all the measurement One or more load generating (interfering) workloads (benchmarks) Can also measure to confirm results Lukáš Marek 5

How we did it experiments Data collected 10 runs RDTSC 10 samples Average over 100 cycles because of overhead Performance counters 3 samples Average over 1000 cycles because of overhead Lukáš Marek 6

Tested Platform Intel Core 2 Quad dual CPU Differences between CPU architectures Core i7 CPU interconnect, integrated memory controller, NUMA, hyper-threading,... Differences between vendors 8 GB unified access main memory AMD uses NUMA architecture Lukáš Marek 7

Experiments CPU Cache coherency Ping-pong Producer consumer Cache sharing Bandwidth Thrashing capacity competition Memory Bandwidth Single-pointer Multi-pointer Lukáš Marek 8

Producer consumer Cache coherency Transport data from one core to another Write data on first core Read data on second core Ex: producer consumer Lukáš Marek 9

Producer consumer (clean data) Lukáš Marek 10

Producer consumer (modified data) Lukáš Marek 11

Producer consumer memory transfers Lukáš Marek 12

Producer consumer summary If we transfer data from one core to another: data transfer is relatively cheap for cores that share the L2 cache data transfer to the core that does not share with the source the L2 cache is 5 times slower then to the core which does modified data can be transported in some cases more than two times faster than clean data Lukáš Marek 13

Cache bandwidth Shared cache Competition for bandwidth to the shared cache Two threads accessing memory Interfering workload inserts NOP instructions between accesses to simulate intensity Interfering workload hits or misses the cache Linear reads Ex: two running processes (threads) access memory Lukáš Marek 14

Cache bandwidth interfering cache hit Lukáš Marek 15

Cache bandwidth interf. cache miss Lukáš Marek 16

Cache bandwidth summary When two cores compete for the bandwidth to the shared L2 cache: if both cores are hitting the L2 cache the overhead is only about 12% if one of the two cores starts missing the L2 cache, the access time for core that is still hitting the L2 cache can be twice as high than in the case of the single-core access Multi-pointers maybe unrealistic stressing Lukáš Marek 17

Cache thrashing Shared cache Competition for space in the shared cache Two threads accessing memory Observing workload has fixed size memory Interfering workload increases memory and uses NOPs for access intensity regulations Interfering workload modifies cache lines to distinguish whose cache line was evicted Ex: two running processes (threads) access memory Lukáš Marek 18

Cache thrashing thrasher 4MB Lukáš Marek 19

Cache thrashing thrasher 1MB Lukáš Marek 20

Cache thrashing summary When two cores compete for the space in the shared L2 cache: a competition for the cache space starts when a sum of the allocated memory blocks is over 2 MiB a core with smaller size of data could be also evicted if it does not access the data fast enough Lukáš Marek 21

Eviction Introduction behavior in physically mapped cache virtual physical physically mapped cache Lukáš Marek 22

Eviction behavior in physically mapped cache II Lukáš Marek 23

Memory bandwidth Sharing a memory bus and a memory controller Memory blocks with the same size accessed from two cores Lukáš Marek 24

Memory bandwidth SP read linear Lukáš Marek 25

Memory bandwidth SP write linear Lukáš Marek 26

Memory bandwidth SP read random Lukáš Marek 27

Memory bandwidth page walks Lukáš Marek 28

Memory bandwidth SP summary When two cores access the memory: with more shared components (like the memory bus or the shared L2 cache) an access slows down when using the linear access, the overhead can be almost 80% in comparison with the single-core access the reading linear access is almost two times faster than the writing linear access the linear access can be more than two times faster than the random access Lukáš Marek 29

Memory bandwidth MP read linear Lukáš Marek 30

Memory bandwidth MP read random Lukáš Marek 31

Memory bandwidth MP diff cores Lukáš Marek 32

Memory bandwidth MP L2 prefetch Lukáš Marek 33

Memory bandwidth MP summary When two cores access the memory with high intensity using multiple pointers: multiple pointers in random access can improve the throughput more than 300% an intensive access into the shared L2 cache can disable prefetching and improve the throughput Lukáš Marek 34

Other experiments Memory saturation Small number of tested configurations Two cores 90% throughput as four cores HDD Too many configurations HDD 7,2K, HDD 15K, SSD, RAID, SATA, PATA, SCSI,... Too many layers OS, File system, user libraries,... Some measures done with little (no) success AMD NUMA,... Lukáš Marek 35

Ping-pong Cache coherency Read or write to the shared variable Write is done using lock inc Ex: synchronization primitives Core 1 Core 2 L1 L1 Lukáš Marek 36

Ping-pong read x write Lukáš Marek 37

Ping-pong write x read Lukáš Marek 38

Ping-pong summary When we have a shared variable which is often modified: the average read access can be 20 times higher than in the case of non-shared variable the write access can be on average 5 times slower than the non-shared access cores further apart from each other can do more accesses in a row before the other core requests the memory again Partially also valid for false sharing (lock instruction) Lukáš Marek 39

The end Questions? Lukáš Marek 40