Pitfalls of Object Oriented Programming



Similar documents
NVPRO-PIPELINE A RESEARCH RENDERING PIPELINE MARKUS TAVENRATH MATAVENRATH@NVIDIA.COM SENIOR DEVELOPER TECHNOLOGY ENGINEER, NVIDIA

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

recursion, O(n), linked lists 6/14

2

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Caching Mechanisms for Mobile and IOT Devices

Unit Storage Structures 1. Storage Structures. Unit 4.3

Node-Based Structures Linked Lists: Implementation

Chapter 7 Memory Management

Lecture 18: Reliable Storage

Operating Systems. Virtual Memory

Binary search tree with SIMD bandwidth optimization using SSE

File System Management

language 1 (source) compiler language 2 (target) Figure 1: Compiling a program

Physical Data Organization

Computer Organization and Architecture

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

DATABASE DESIGN - 1DL400

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Distribution One Server Requirements

Data Structure with C

The 2014 Bottleneck Report on Enterprise Mobile

Optimizing matrix multiplication Amitabha Banerjee

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Lag Sucks! R.J. Lorimer Director of Platform Engineering Electrotank, Inc. Robert Greene Vice President of Technology Versant Corporation

361 Computer Architecture Lecture 14: Cache Memory

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Architecture Sensitive Database Design: Examples from the Columbia Group

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Glossary of Object Oriented Terms

MS SQL Performance (Tuning) Best Practices:

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

Compiling Object Oriented Languages. What is an Object-Oriented Programming Language? Implementation: Dynamic Binding

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar

PERFORMANCE MODELS FOR APACHE ACCUMULO:

Trace-Based and Sample-Based Profiling in Rational Application Developer

Conquer the 5 Most Common Magento Coding Issues to Optimize Your Site for Performance

CSE 326: Data Structures B-Trees and B+ Trees

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Chapter 13: Query Processing. Basic Steps in Query Processing

(Refer Slide Time: 2:03)

CS 4620 Practicum Programming Assignment 6 Animation

C++ INTERVIEW QUESTIONS

Analysis of Micromouse Maze Solving Algorithms

Business Application

Cobol. By: Steven Conner. COBOL, COmmon Business Oriented Language, one of the. oldest programming languages, was designed in the last six

Object Oriented Design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Measuring the Performance of Prefetching Proxy Caches

Optimal Binary Search Trees Meet Object Oriented Programming

Sitecore Health. Christopher Wojciech. netzkern AG. Sitecore User Group Conference 2015

Low Power AMD Athlon 64 and AMD Opteron Processors

Architecture of Hitachi SR-8000

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

PART-A Questions. 2. How does an enumerated statement differ from a typedef statement?

Virtuoso and Database Scalability

The Classical Architecture. Storage 1 / 36

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

ALLIED PAPER : DISCRETE MATHEMATICS (for B.Sc. Computer Technology & B.Sc. Multimedia and Web Technology)

Maximizing VMware ESX Performance Through Defragmentation of Guest Systems. Presented by

Increasing Flash Throughput for Big Data Applications (Data Management Track)

EE361: Digital Computer Organization Course Syllabus

MySQL performance in a cloud. Mark Callaghan

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

Vector storage and access; algorithms in GIS. This is lecture 6

Language Evaluation Criteria. Evaluation Criteria: Readability. Evaluation Criteria: Writability. ICOM 4036 Programming Languages

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Computer Systems Structure Input/Output

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

File Systems Management and Examples

Systolic Computing. Fundamentals

Optimizing LTO Backup Performance

Speeding up a Slow SuperSalon PC

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

NAND Flash FAQ. Eureka Technology. apn5_87. NAND Flash FAQ

Chapter 8: Bags and Sets

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

1/20/2016 INTRODUCTION

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

POWER8 Performance Analysis

A3 Computer Architecture

QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Sequences in the C++ STL

Transcription:

Sony Computer Entertainment Europe Research & Development Division Pitfalls of Object Oriented Programming Tony Albrecht Technical Consultant Developer Services

A quick look at Object Oriented (OO) programming A common example Optimisation of that example Summary What I will be covering Slide 2

Object Oriented (OO) Programming What is OO programming? a programming paradigm that uses "objects" data structures consisting of datafields and methods together with their interactions to design applications and computer programs. (Wikipedia) Includes features such as Data abstraction Encapsulation Polymorphism Inheritance Slide 3

What s OOP for? OO programming allows you to think about problems in terms of objects and their interactions. Each object is (ideally) self contained Contains its own code and data. Defines an interface to its code and data. Each object can be perceived as a black box. Slide 4

Objects If objects are self contained then they can be Reused. Maintained without side effects. Used without understanding internal implementation/representation. This is good, yes? Slide 5

Are Objects Good? Well, yes And no. First some history. Slide 6

A Brief History of C++ C++ development started 1979 2009 Slide 7

A Brief History of C++ Named C++ 1979 1983 2009 Slide 8

A Brief History of C++ First Commercial release 1979 1985 2009 Slide 9

A Brief History of C++ Release of v2.0 1979 1989 2009 Slide 10

A Brief History of C++ Release of v2.0 Added multiple inheritance, abstract classes, static member functions, const member functions protected members. 1979 1989 2009 Slide 11

A Brief History of C++ Standardised 1979 1998 2009 Slide 12

A Brief History of C++ Updated 1979 2003 2009 Slide 13

A Brief History of C++ C++0x 1979 2009? Slide 14

So what has changed since 1979? Many more features have been added to C++ CPUs have become much faster. Transition to multiple cores Memory has become faster. http://www.vintagecomputing.com Slide 15

CPU performance Slide 16 Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau

CPU/Memory performance Slide 17 Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau

What has changed since 1979? One of the biggest changes is that memory access speeds are far slower (relatively) 1980: RAM latency ~ 1 cycle 2009: RAM latency ~ 400+ cycles What can you do in 400 cycles? Slide 18

What has this to do with OO? OO classes encapsulate code and data. So, an instantiated object will generally contain all data associated with it. Slide 19

My Claim With modern HW (particularly consoles), excessive encapsulation is BAD. Data flow should be fundamental to your design (Data Oriented Design) Slide 20

Consider a simple OO Scene Tree Base Object class Contains general data Node Container class Modifier Updates transforms Drawable/Cube Renders objects Slide 21

Each object Object Maintains bounding sphere for culling Has transform (local and world) Dirty flag (optimisation) Pointer to Parent Slide 22

Class Definition Objects Each square is 4 bytes Memory Layout Slide 23

Nodes Each Node is an object, plus Has a container of other objects Has a visibility flag. Slide 24

Nodes Class Definition Memory Layout Slide 25

Consider the following code Update the world transform and world space bounding sphere for each object. Slide 26

Consider the following code Leaf nodes (objects) return transformed bounding spheres Slide 27

Consider the following code Leaf nodes What s (objects) wrong with this return transformed code? bounding spheres Slide 28

Consider the following code Leaf nodes If m_dirty=false (objects) then return we get branch transformed misprediction which costs 23 or 24 bounding spheres cycles. Slide 29

Consider the following code Leaf nodes (objects) return transformed Calculation of the world bounding sphere bounding spheres takes only 12 cycles. Slide 30

Consider the following code Leaf nodes (objects) return transformed So using a dirty flag here is actually bounding spheres slower than not using one (in the case where it is false) Slide 31

Lets illustrate cache usage Main Memory Each cache line is 128 bytes L2 Cache Slide 32

Cache usage Main Memory L2 Cache parenttransform is already in the cache (somewhere) Slide 33

Main Memory Cache usage Assume this is a 128byte boundary (start of cacheline) L2 Cache Slide 34

Cache usage Main Memory L2 Cache Load m_transform into cache Slide 35

Cache usage Main Memory L2 Cache m_worldtransform is stored via cache (write-back) Slide 36

Cache usage Main Memory L2 Cache Next it loads m_objects Slide 37

Cache usage Main Memory L2 Cache Then a pointer is pulled from somewhere else (Memory managed by std::vector) Slide 38

Cache usage Main Memory L2 Cache vtbl ptr loaded into Cache Slide 39

Cache usage Main Memory L2 Cache Look up virtual function Slide 40

Cache usage Main Memory L2 Cache Then branch to that code (load in instructions) Slide 41

Cache usage Main Memory L2 Cache New code checks dirty flag then sets world bounding sphere Slide 42

Cache usage Main Memory L2 Cache Node s World Bounding Sphere is then Expanded Slide 43

Cache usage Main Memory L2 Cache Then the next Object is processed Slide 44

Cache usage Main Memory L2 Cache First object costs at least 7 cache misses Slide 45

Cache usage Main Memory L2 Cache Subsequent objects cost at least 2 cache misses each Slide 46

The Test 11,111 nodes/objects in a tree 5 levels deep Every node being transformed Hierarchical culling of tree Render method is empty Slide 47

Performance This is the time taken just to traverse the tree! Slide 48

Why is it so slow? ~22ms Slide 49

Look at GetWorldBoundingSphere() Slide 50

Samples can be a little misleading at the source code level Slide 51

if(!m_dirty) comparison Slide 52

Slide 53 Stalls due to the load 2 instructions earlier

Slide 54 Similarly with the matrix multiply

Some rough calculations Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms Slide 55

Some rough calculations Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms L2 Cache misses: 36,345 @ 400 cycles each ~= 4.54ms Slide 56

From Tuner, ~ 3 L2 cache misses per object These cache misses are mostly sequential (more than 1 fetch from main memory can happen at once) Code/cache miss/code/cache miss/code Slide 57

Slow memory is the problem here How can we fix it? And still keep the same functionality and interface? Slide 58

The first step Use homogenous, sequential sets of data Slide 59

Homogeneous Sequential Data Slide 60

Generating Contiguous Data Use custom allocators Minimal impact on existing code Allocate contiguous Nodes Matrices Bounding spheres Slide 61

Performance 19.6ms -> 12.9ms 35% faster just by moving things around in memory! Slide 62

What next? Process data in order Use implicit structure for hierarchy Minimise to and fro from nodes. Group logic to optimally use what is already in cache. Remove regularly called virtuals. Slide 63

We start with a parent Node Hierarchy Node Slide 64

Hierarchy Node Node Which has children nodes Node Slide 65

Hierarchy Node Node And they have a parent Node Slide 66

Hierarchy Node Node Node Node Node Node Node Node Node Node Node And they have children Slide 67

Hierarchy Node Node Node Node Node Node Node Node Node Node Node And they all have parents Slide 68

Hierarchy Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node A lot of this information can be inferred Slide 69

Hierarchy Node Node Node Use a set of arrays, one per hierarchy level Node Node Node Node Node Node Node Node Slide 70

Hierarchy Node :2 Parent has 2 children Node :4 Node :4 Children have 4 children Node Node Node Node Node Node Node Node Slide 71

Hierarchy Node :2 Node :4 Node :4 Ensure nodes and their data are contiguous in memory Node Node Node Node Node Node Node Node Slide 72

Make the processing global rather than local Pull the updates out of the objects. No more virtuals Easier to understand too all code in one place. Slide 73

OO version Need to change some things Update transform top down and expand WBS bottom up Slide 74

Update transform Node Node Node Node Node Node Node Node Node Node Node Slide 75

Update transform Node Node Node Node Node Node Node Node Node Node Node Slide 76

Node Node Node Node Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 77

Node Node Node Node Node Node Node Node Node Node Node Add bounding sphere of child Slide 78

Node Node Node Node Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 79

Node Node Node Node Node Node Node Node Node Node Node Add bounding sphere of child Slide 80

Node Node Node Node Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 81

Node Node Node Node Node Node Node Node Node Node Node Add bounding sphere of child Slide 82

Hierarchical bounding spheres pass info up Transforms cascade down Data use and code is striped. Processing is alternating Slide 83

Conversion to linear To do this with a flat hierarchy, break it into 2 passes Update the transforms and bounding spheres(from top down) Expand bounding spheres (bottom up) Slide 84

Transform and BS updates Node :2 Node :4 Node :4 For each node at each level (top down) { multiply world transform by parent s transform wbs by world transform } Node Node Node Node Node Node Node Node Slide 85

Update bounding sphere hierarchies Node :2 Node :4 Node :4 For each node at each level (bottom up) { add wbs to parent s } cull wbs against frustum } Node Node Node Node Node Node Node Node Slide 86

Update Transform and Bounding Sphere How many children nodes to process Slide 87

Update Transform and Bounding Sphere For each child, update transform and bounding sphere Slide 88

Update Transform and Bounding Sphere Note the contiguous arrays Slide 89

So, what s happening in the cache? Parent Unified L2 Cache Children s node data not needed Parent Data Childrens Data Slide 90

Load parent and its transform Parent Unified L2 Cache Parent Data Childrens Data Slide 91

Load child transform and set world transform Parent Unified L2 Cache Parent Data Childrens Data Slide 92

Parent Load child BS and set WBS Unified L2 Cache Parent Data Childrens Data Slide 93

Load child BS and set WBS Parent Next child is calculated with no extra cache misses! Unified L2 Cache Parent Data Childrens Data Slide 94

Load child BS and set WBS Parent The next 2 children incur 2 cache misses in total Unified L2 Cache Parent Data Childrens Data Slide 95

Prefetching Parent Because all data is linear, we can predict what memory will be needed in ~400 cycles and prefetch Unified L2 Cache Parent Data Childrens Data Slide 96

Tuner scans show about 1.7 cache misses per node. But, these misses are much more frequent Code/cache miss/cache miss/code Less stalling Slide 97

Performance 19.6 -> 12.9 -> 4.8ms Slide 98

Prefetching Data accesses are now predictable Can use prefetch (dcbt) to warm the cache Data streams can be tricky Many reasons for stream termination Easier to just use dcbt blindly (look ahead x number of iterations) Slide 99

Prefetching example Prefetch a predetermined number of iterations ahead Ignore incorrect prefetches Slide 100

Performance 19.6 -> 12.9 -> 4.8 -> 3.3ms Slide 101

A Warning on Prefetching This example makes very heavy use of the cache This can affect other threads use of the cache Multiple threads with heavy cache use may thrash the cache Slide 102

The old scan ~22ms Slide 103

The new scan ~16.6ms Slide 104

Up close ~16.6ms Slide 105

Looking at the code (samples) Slide 106

Performance counters Branch mispredictions: 2,867 (cf. 47,000) L2 cache misses: 16,064 (cf 36,000) Slide 107

In Summary Just reorganising data locations was a win Data + code reorganisation= dramatic improvement. + prefetching equals even more WIN. Slide 108

OO is not necessarily EVIL Be careful not to design yourself into a corner Consider data in your design Can you decouple data from objects? code from objects? Be aware of what the compiler and HW are doing Slide 109

Its all about the memory Optimise for data first, then code. Memory access is probably going to be your biggest bottleneck Simplify systems KISS Easier to optimise, easier to parallelise Slide 110

Homogeneity Keep code and data homogenous Avoid introducing variations Don t test for exceptions sort by them. Not everything needs to be an object If you must have a pattern, then consider using Managers Slide 111

Remember You are writing a GAME You have control over the input data Don t be afraid to preformat it drastically if need be. Design for specifics, not generics (generally). Slide 112

Data Oriented Design Delivers Better performance Better realisation of code optimisations Often simpler code More parallelisable code Slide 113

Questions? The END Slide 114