gcc link time optimization and the Linux kernel Andi Kleen Intel OTC Apr 2013 andi@firstfloor.org

Similar documents

LLVMLinux: Embracing the Dragon

Applying Clang Static Analyzer to Linux Kernel

C++ Programming Language

COS 318: Operating Systems

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

Where is the memory going? Memory usage in the 2.6 kernel

RISC-V Software Ecosystem. Andrew Waterman UC Berkeley

PART-A Questions. 2. How does an enumerated statement differ from a typedef statement?

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

C Programming Review & Productivity Tools

Rethinking SIMD Vectorization for In-Memory Databases

System/Networking performance analytics with perf. Hannes Frederic Sowa

C++ INTERVIEW QUESTIONS

Pattern Insight Clone Detection

I Control Your Code Attack Vectors Through the Eyes of Software-based Fault Isolation. Mathias Payer, ETH Zurich

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

File System Management

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Using Code Coverage Tools in the Linux Kernel

Compiler-Assisted Binary Parsing

CpSc212 Goddard Notes Chapter 6. Yet More on Classes. We discuss the problems of comparing, copying, passing, outputting, and destructing

NEON. Support in Compilation Tools. Development Article. Copyright 2009 ARM Limited. All rights reserved. DHT 0004A (ID081609)

The C Programming Language course syllabus associate level

CSE 326: Data Structures B-Trees and B+ Trees

10CS35: Data Structures Using C

DATA STRUCTURES USING C

Object Oriented Software Design II

Physical Data Organization

Object Oriented Programming With C++(10CS36) Question Bank. UNIT 1: Introduction to C++

THE VELOX STACK Patrick Marlier (UniNE)

File Systems Management and Examples

Storage Classes CS 110B - Rule Storage Classes Page 18-1 \handouts\storclas

2) Write in detail the issues in the design of code generator.

Optimizing unit test execution in large software programs using dependency analysis

1 Abstract Data Types Information Hiding

Automatic System for Linux Kernel Performance Testing

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

GDB Tutorial. A Walkthrough with Examples. CMSC Spring Last modified March 22, GDB Tutorial

Chapter 13: Query Processing. Basic Steps in Query Processing

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

Compiler Construction

How To Write Portable Programs In C

Binary compatibility for library developers. Thiago Macieira, Qt Core Maintainer LinuxCon North America, New Orleans, Sept. 2013

Monitoring, Tracing, Debugging (Under Construction)

Multi-Threading Performance on Commodity Multi-Core Processors

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Linux Driver Devices. Why, When, Which, How?

Installation Guide for Basler pylon 2.3.x for Linux

MatrixSSL Porting Guide

Google File System. Web and scalability

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

C Compiler Targeting the Java Virtual Machine

C++ Crash Kurs. C++ Object-Oriented Programming

Logistics. Software Testing. Logistics. Logistics. Plan for this week. Before we begin. Project. Final exam. Questions?

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Leak Check Version 2.1 for Linux TM

Instruction Set Architecture (ISA) Design. Classification Categories

CORBA Programming with TAOX11. The C++11 CORBA Implementation

Boolean Expressions, Conditions, Loops, and Enumerations. Precedence Rules (from highest to lowest priority)

Lecture 22: C Programming 4 Embedded Systems

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Partitioning under the hood in MySQL 5.5

Lumousoft Visual Programming Language and its IDE

Virtualization and Other Tricks.

Continuous integration End of the big bang integration era

SMock A Test Platform for the Evaluation of Monitoring Tools

Fast Arithmetic Coding (FastAC) Implementations

Keil C51 Cross Compiler

MPLAB TM C30 Managed PSV Pointers. Beta support included with MPLAB C30 V3.00

umps software development

Raima Database Manager Version 14.0 In-memory Database Engine

Virtual Servers. Virtual machines. Virtualization. Design of IBM s VM. Virtual machine systems can give everyone the OS (and hardware) that they want.

RED HAT DEVELOPER TOOLSET Build, Run, & Analyze Applications On Multiple Versions of Red Hat Enterprise Linux

GPGPU Parallel Merge Sort Algorithm

MatrixSSL Developer s Guide

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

User-level processes (clients) request services from the kernel (server) via special protected procedure calls

Glossary of Object Oriented Terms

CPSC 491. Today: Source code control. Source Code (Version) Control. Exercise: g., no git, subversion, cvs, etc.)

Data Structure Reverse Engineering

Application-Level Debugging and Profiling: Gaps in the Tool Ecosystem. Dr Rosemary Francis, Ellexus

Software Development for Embedded GNU Radio Applications

Common clock framework: how to use it

An Introduction to Assembly Programming with the ARM 32-bit Processor Family

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

esrever gnireenigne tfosorcim seiranib

How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)

Winning the Battle against Automated Testing. Elena Laskavaia March 2016

Transcription:

gcc link time optimization and the Linux kernel Andi Kleen Intel OTC Apr 2013 andi@firstfloor.org

Acknowledgments Lots of people helped/contributed Ralf Baechle, Richard Biener, Tim Bird, Honza Hubicka, H.J. Lu, Joe Mario, Markus Trippelsdorf, Changlong Xie, others

Why LTO? Optimize over the whole binary Not just function (3.0) or file (<4.5) Avoid inline dependency hell in header files Without changing Makefiles significantly

gcc 4.7+ LTO WHOPR crash course Compiler parses files, writes GIMPLE to object files (LGEN) Function optimization summaries are computed Linker calls lto1 with all files for sequential whole program analysis (WPA) Merges types, Generates global callgraph, IPA optimization summaries, writes partitions Generate code per partition in parallel (LTRANS) Inline inside partition, run IPA optimizations for real, run per function optimizations, generate object code

LTO IPA optimizations (4.8) Inlining between files Function cloning for specific arguments Remove unused code (fwhole-program) Increase alignment Discover pure/const Keep globals/statics alive over calls De-virtualization Change ABI (SSE) Constant propagation Scalar replacement of aggregates Constructor / destructor merging green: does not benefit kernel today

Build time Build time small User time 800 600 400 200 0 Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto user time (s) 4000 2000 0 small config Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto User time / real time 15.00 10.00 5.00 0.00 Parallelism Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Faults (M) 60.00 40.00 20.00 0.00 Faults small config Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto LTO is slow and not parallel enough

Parallelism small build 4.7 Object file generation Parsing / LGEN 60 Runnable processes LTO kernel build small config Modules without job server 50 runnable (p) 40 30 20 r LTRANS code generation 10 0 11 31 51 71 91 111131 151171 191211 231251 271291 311331 351371 391411 431451 471491 511531 551 1 21 41 61 81 101121 141161 181201 221241 261281301 321341 361381 401421441 461481 501521 541 WPA + real linker Type merging 4 times Small config parallelism time (s) User time / Runtime 15 time (s) 10 5 0 gcc 4.8 Gcc 4.7 Gcc 4.7 no lto Gcc 4.8 no lto WHOPR still has poor parallelism

Multilink vmlinux links 2-4x: runs LTO that often Generates integrated symbol table (kallsyms) So far not fixed KALLSYMS can be disabled One of those unexpected quirks of real build systems 6000.00 4000.00 2000.00 0.00 Gcc 4.8 no lto Gcc 4.8 w/o kallsyms Gcc 4.7 with kallsyms

Memory usage 4.7 small build Active memory Kernel LTO build small config 8 7 6 active mem (GB) 5 4 3 2 1 0 9 25 41 57 73 89 105121 137153 169185 201217 233249 265281 297313 329345 361377 393409 425441 457473 489505 521537 553 1 17 33 49 65 81 97 113129 145161 177193 209225 241257 273289 305321 337353 369385 401417 433449 465481 497513 529545 time (s) Faults (M) 60.00 40.00 20.00 0.00 Faults small config Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Even small build swaps in 4GB system Memory peaks all in WPA

Memory consumption Temporary data can be a problem, together with WPA Early many swap storms with /tmp = tmpfs Partitioning algorithm was improved Use TMPDIR=objdir With modules need to avoid too large -j* for parallel WPA Jobserver has to be disabled, makes it worse

Large build 4.8 (allyes) Linux allyes nosym runnable 4.8 -j16 60 50 40 Kallsyms disabled (disables various features) runnable (p) 30 20 10 0 81 237 393 549 705 861 1017 1173 1329 1485 1641 1797 1953 2109 2265 2421 3 159 315 471 627 783 939 1095 1251 1407 1563 1719 1875 2031 2187 2343 2499 ~ 42mins ~1h 11min with kallsyms ~15GB peak time (s) Linux allyes nosym gcc 4.8 memory active mem (G) 14.00 12.00 10.00 8.00 6.00 4.00 2.00 0.00 87 255 423 591 759 927 1095 1263 1431 1599 1767 1935 2103 2271 2439 3 171 339 507 675 843 10111179 1347 1515 1683 1851 2019 2187 2355 2523 time (s)

WPA cycle profile 4.8 Samples: 257K of event 'cycles', Event count (approx.): 183142453583 9.71% lto1 wpa lto1 [.] htab_find_slot_with_hash htab_find_slot_with_hash + 61.79% gimple_type_hash(void const*) + 26.54% gimple_register_type_1(tree_node*, bool) + 4.79% visit(tree_node*, sccs*, unsigned int, vec<tree_node*, va_heap, v + 1.53% iterative_hash_canonical_type(tree_node*, unsigned int) + 0.98% iterative_hash_gimple_type(tree_node*, unsigned int, vec<tree_nod + 0.92% symtab_get_node(tree_node const*) + 0.77% gimple_type_eq(void const*, void const*) + 0.53% build_int_cst_wide(tree_node*, unsigned long, long) + 0.53% streamer_string_index(output_block*, char const*, unsigned int, b + 9.00% lto1 wpa libc 2.17.so [.] _int_malloc + 7.34% lto1 wpa lto1 [.] iterative_hash_hashval_t(unsigned i + 7.33% lto1 wpa lto1 [.] gimple_type_eq(void const*, void co + 6.66% lto1 wpa libc 2.17.so [.] memset_sse2 + 5.35% lto1 wpa libc 2.17.so [.] malloc_consolidate + 4.11% lto1 wpa libc 2.17.so [.] _int_free + 4.10% lto1 wpa lto1 [.] tree_map_base_eq(void const*, void + 2.85% lto1 wpa lto1 [.] pointer_map_insert(pointer_map_t*, + 2.76% lto1 wpa lto1 [.] inflate_fast + 2.17% lto1 wpa lto1 [.] lto_output_tree(output_block*, tree Most time spent in hashing types for type merging

Type merging struct foo ** Complicated iterative hash consisting of 4 hash tables (including two hash cache hashes ) pointer_type Each LGEN has its own types, need to be unified pointer_type Struct foo (record_type) Members

Hashing improvements Use tcmalloc instead of glibc malloc build time -9%; peak mem +35% Start with large type hash tables / caches Use murmur for hashing / fix iterative / fix pointer hash to handle all bits (collisions 90%->70%->64%) Splay trees for type merging (bad idea) 12.00% 10.00% 8.00% time (s) delta 6.00% 4.00% 2.00% 0.00% Build time Improvement hash changes 4.8, build time small config tcmalloc murmur fullhash

Possible future hash improvements Precompute hash in LGEN And stream types out in hash order Requires even larger hash tables to avoid collisions (or a good tree) Only merge per partition? And do that in parallel Parallelize WPA further?

Incremental linking Kernel build system uses ld -r / ar extensively Linker merges all sections with the same name Added random postfixes to LTO sections Still problem with pure assembler files.s assembler code dropped during LTO Originally attempted to fix with slim LTO Then fixed in Linux binutils Unfortunately changes not merged into mainline binutils Use Linux binutils for kernel

Kernel modifications (excerpt) Section attributes const initconst char * vs initconst char * const Toplevel inline assembler Need globals and visible for any symbols Lots of externally_visible Various changes for dot NUMBER symbols Disable LTO for some files Workarounds for 4.7 partitioning bugs (visible) Gcc-ld A few fixes for optimizations (rare) Kernel changes: 108 files changed, 740 insertions(+), 313 deletions(-) (v3.8; without most section attributes)

Debugging Kernel debugging: more difficult with more inlining Especially with initialization functions May need inline aware backtracer ( dwarf kallsyms ) Compiler No simple test cases. Lots of regressions; Complicated delta. Complex multi process Hard to find right process to attach/profile -dh (core dumps) are most useful LTO has challenges for debugging

Global Optimizations (small config) Inlining 21k new inlines between files 13k unique Scalar replacement No change Partial Inline +16% Constant Propagation +40% Cloning for constant propagation (CONFIG_LTO_CLONE) 266->1052 const clones +2% text (~200k)

Static instruction changes Generated code 4.8 small LTO Changes in static instructions (vmlinux) More = Better 12.00% 10.00% 8.00% delta 6.00% 4.00% 2.00% 0.00% calls functions fsize avg push pop mov instruction

Runtime Runtime Mixed results LKP dbench tbench AIM7 OLTP Sandy Bridge 0% 0% +3% 0% Nehalem 0% 0% +4% +6% Likely need some more optimizations Open: Atom testing (more sensitive to bad code)

Text size Text size Slight increase for large configs, but lower for small configs small config +3% allno shrinks -50k The smaller the kernel the less infrastructure is used. But EXPORT_SYMBOL defeats it again.

ARM embedded improvements (Tim Bird) vmlinux.non-lto => vmlinux.lto baseline other change percent -------- ----- ------ ------- text: 5242000 4869340-372660 -7% data: 490184 476200-13984 -2% bss: 121824 122112 288 0% total: 5854008 5467652-386356 -6% Kernel: non-lto lto --------------------------- ------- -------- time for full build: 1m5.87s 3m22s time for incremental build: 0m5.24s 2m12s kernel image size: 5854008 5467652 compressed uimage size: 2849920 2694224 system meminfo MemTotal: 17804 kb 18188 kb system meminfo MemFree: 11020 kb 11260 kb boot time: 2.466369 2.414428

gcc LTO improvements wish list Per file options (partial patch) Fix passing through of options Standard gcc-ld Don't error for section mismatches Better procedure for LTO error reporting Better messages for type mismatches No.XXXX postfixes -fno-toplevel-reorder by symbol? Automatic procedure for externally_visible discovery? Better syntax for symbol visible to toplevel assembler

Status Tested on x86, ARM, MIPS Uptodate with Linux 3.8 Some options disabled (function-trace, gcov, modversions)

References LTO kernel tree http://github.com/andikleen/linux-misc Branch lto-3.x (x increasing) Linux binutils http://www.kernel.org/pub/linux/devel/binutils/ Tcmalloc http://code.google.com/p/gperftools/ andi@firstfloor.org http://halobates.de

Backup

Future kernel improvements Avoid multi-link Use constructors to allow disabling -fnotoplevel-reorder Support new option to disable instrument_function

Quick HOWTO for LTO kernel Get git://github.com/andikleen/linux-misc lto-3.x tree Need gcc 4.7+ set up for LTO / Linux binutils Enable CONFIG_LTO Make sure FUNCTION_PROFILER et.al./modversions are disabled make CC=compiler LD=linux-ld AR=gcc-ld -jx Check LTO does not get disabled See Documentation/lto-build for more details