https://johnnysswlab.com/link-time-optimizations-new-way-to-do-compiler-optimizations/ Johnny's Software LabJohnny's Software Lab Johnny's Software Lab We help you deliver fast software * Home * Performance + 2 Minute Reads + C++ Performance + Standard Library and Performance + Algorithms and Performance + Toolchain and Performance + Help the Compiler + Performance Analysis Tools + Computational Performance + Low Level Performance + Parallelization + Multithreaded Performance + Performance Contest * Debugging * Developer Tools * Vectorization Workshop * Need help? * Talks * Contact * About us Menu [lto-featured-e1590579921753] Link Time Optimizations: New Way to Do Compiler Optimizations Posted on May 27, 2020February 7, 2023Author Ivica Bogosavljevic Posted in Memory Footprint, Performance, Toolchain and Performance4 Replies UPCOMING VECTORIZATION WORKSHOPS AVX Vectorization Workshop: 4 half days, May 26th to May 29th, 11 AM - 3 PM (US East Coast) 8 AM - 12 PM (US West Coast) 5 PM - 9 PM CET (Europe) NEON Vectorization Workshop: TBD, send e-mail to info@johnnysswlab.com to express interest More info... Early in their career every C/C++ developer has had an eureka moment: the discovery of optimization options in their compiler. A discovery that there is GCC compiler offers -O0 optimization lever for regular debugging and -O3 for fast release code will be one of the moment I will never forget. I just needed to compile my program with -O3 and everything would run faster without any effort on my side. Many years have passed since and I believe everyone compiles their code in the similar way: we compile our .c or .cpp files using a compiler, with a certain optimization option (-O0 or O3) to get object files (.o). Then linker takes all our object files and merges them into one big executable or a library. In this story, linker is the last piece of the puzzle, and compared to what the compiler is doing, its job is much simpler. But doing things this way we are missing out on some important optimization opportunities. So, ladies and gentlemen, allow me to explain. Table Of Contents 1. Motivation 2. Link Time Optimizations + Introduction + + Under the hood + + How to enable link time optimizations? 3. Experiments + ProjectX + + ffmpeg Final Words Further Read Motivation As I said earlier, compiler makes many optimizations on the code. Allow me to present two of them: inlining and loop merging. Have a look at the following code: int find_min(const int* array, const int len) { int min = a[0]; for (int i = 1; i < len; i++) { if (a[i] < min) { min = a[i]; } } return min; } int find_max(const int* array, const int len) { int max = a[0]; for (int i = 1; i < len; i++) { if (a[i] > max) { max = a[i]; } } return max; } void main() { int* array, len, min, max; initialize_array(array, &len); min = find_min(array, len); max = find_max(array, len); ... } This is a very simple code that initializes an array and then finds its minimum and maximum. Please pay attention to functions find_min and find_max and notice they are very similar. When we compile optimized version of our program, the compiler might inline functions find_min and find_max. Inlining simply means to remove the calls to those functions and instead copy their bodies in the place of the call. After having done the inlining, the compiler sees that both find_min and find_max loop over the same range and there are no data dependencies between them so it decides to do loop merging. It merges loops from find_min and find_max into a single loop. This decreases amount of work the CPU needs to do and also increases data cache utilization. So the optimized version might look like this: void main() { int* array, len, min, max; initialize_array(array, &len); min = a[0]; max = a[0]; // compiler inlined find_min and find_max for (int i = 0; i < len; i++) { // compiler merged two loops int one if (a[i] < min) { min = a[i]; } if (a[i] > max) { max = a[i]; } } ... } In the optimized version calls to functions find_min and find_max have completely disappeared. This saves some calling time. Furthermore, the compiler merged loops from find_min and find_max and now we are looping just once. These two optimizations alone will benefit the performance. But everything we talked about here can happen if main, find_min and find_max are in the same compilation unit (in C and C++ compilation unit is a file since compilers work file by file). If find_min and find_max are defined in find.c and main is defined in main.c, the compiler will not be able to perform the optimization. While the compiler is compiling main.c it is aware of find_min and find_max only through their signature. It knows these two functions exist, but it doesn't know how do they look internally or their addresses. It will leave placeholders to call these two functions, and later the linker will resolve calls to those two functions with actual addresses. If the compiler and linker work in this way, they miss out on important optimization potential. You probably have guessed: why doesn't the linker do some optimizations? Linker is aware of multiple compilation units and certainly could do them. Well, ladies and gentlemen, allow me to introduce: Link Time Optimizations Introduction Link time optimizations (abbr. LTO) are just that, optimizations done during linking. Linker is aware of all the compilation units and can optimize more compared to what conventional compiler can do. The most important optimizations that linker does are inlining and code locality improvements. As already said, inlining just means copying body of a function in the place of the call instead of calling the function directly. Linker inlines functions from other compilation units if that makes sense. When the function is inlined, the linker has an opportunity to do other optimizations, similar to what compilers do when they compile a singe file. Regular compilers try to figure out which functions execute often and which functions execute rarely. Depending on that they move the functions that execute often close to one another in memory in order to improve code locality. Additionally, if a function A() calls function B() it makes sense to put them close to one another in memory for the same reason. Better locality means less page faults^1 and faster code. Linkers are much better aware how the functions from different compilation units call one another and therefore can put them in memory so that they optimize for better code locality. The result of LTO is a binary which is typically a few percent faster than the original and a few percent smaller. This is not much but for performance sensitive programs can be a big deal. There is a downside to LTO however. Compiling and linking with LTO is much slower and uses more memory, and this doesn't scale well. For large projects, e.g Chrome browser or Firefox it requires a dedicated machine with a lot of memory to link. The reason is quite simple, it's because linker now has to keep in memory all compilation units. Also, total compilation time with LTO enabled is longer and I think it is one of the reasons why we haven't seen its wide adoption (developers are not patient people). It is very simple to enable LTO, and later we will enable it on a some projects and measure the results. Under the hood You can skip this and move to next chapter as this is not important to work with LTO. So how does LTO work? Before LTO, linkers were in change only of linking. Linkers would take all the generated object files, for each file they would resolve calls to functions in other object files, create a table of exported functions and finally output the executable to disk. Now with LTO the linker takes a lot of responsibilities compilers used to have, namely the optimization. Common object files contain binary code, but it is impossible to do LTO on the binary code. So the compiler needs to be run with LTO enabled in order to produce an object file that can be used for LTO. When compiler is started in LTO mode, it writes in special .lto sections the intermediate representation of the code. Intermediate representation is the thing on which the compilers do the optimizations; now when the optimization step is moved to linking phase, this information is necessary for the linker as well. The line between compilers and linkers gets blurry. I wouldn't be surprised if one day it disappears and we are left with only one tool. How to enable link time optimizations? LTO has been supported on compilers for a long time; GCC 4.7 which is eight years old supports it and I assume that all the other popular compilers support LTO for the same time. To enable LTO, follow these simple steps: * Add option -flto to the invocation of compiler. * Add option -flto to the invocation of the linker. Additionally, you need to add all options from the compiler invocations to the invocation of the linker. So if you called your compiler with "-march=i486 -O3 -fno-stack-protector", you will need to pass the same options to the linker. Now you compile your program as regular. Unless you are using a very old version of the compiler, you shouldn't expect any problems here. Experiments Now let's get our hands dirty. I already made some tests on ProjectX^ 2, the commercial project I am working on, so I am going to recount my experiences with it. Second thing we are going to investigate is the impact of LTO on ffmpeg open source software used for media files manipulation. ProjectX ProjectX is a medium sized project written in C++. There is no special team which is in charge of tools or performance^3 and this corresponds to what we typically see in most medium sized commercial projects. It is compiled using GCC 4.9.4 and without LTO. There is one component of the system which is performance critical and I measured how the performance of that component behaves with respect to LTO. I picked a single test since testing is slow and repeated it 10 times. Here are the results: Phase Without LTO With LTO Difference Test runtime 151.7s 138.7s +9.2% Binary size - libA 362kB 292kB +24% - libB 1698kB 1394kB +21.8% - libC 31.6MB 23.8MB +32.7% - libD 1204kB 1144kB +5.2% Compile time for 20.2s 214.2s 10.6 x slower a single .cpp file Linking time for libC 1.6s 63.9s 39.9x slower (18 .o files) Linking memory 119MB 709MB 5.95x more consumption for libC Comparison between LTO and non-LTO compilation As the above table shows, the test runtime has decreased by 9.2%, and binaries sizes have decreased by 20% on average. But the resources needed for compilation have increased dramatically. Compiling a .cpp file is 10 times slower, linking is 40 times slower and memory consumption needed for linking is almost 6 times higher. Please note that GCC 4.9 is really old and LTO has been correctly implemented just in that version. Let's try a next experiment, maybe we'll have more luck there. ffmpeg ffmpeg is a famous library for processing audio and video. We used ffmpeg 4.2.3 available from the web site and compiled it with GCC 8 and CLANG 9. The configure script of ffmpeg has an option to enable LTO, so that's just what we did. ffmpeg uses some hand written assembler, we disabled this in order to measure the impact of LTO on optimizing C code. The configure line we used is: ./configure --enable-lto --disable-inline-asm --disable-x86asm Enabling LTO on GCC went easily. On CLANG however we needed to tweak the configure command line to force it to use LLD linker^4. We also wanted to try out ThinLTO, a feature of CLANG that promises the same performance gains as regular LTO and compiles in time comparable to LTO. To test the performance we asked ffmpeg to convert a mpegts container with h264 video and aac audio to matroska container with mpeg4 video and ac3 audio. The input file was 708MB large and the output was 349MB. Here are the results of our measurements: Phase GCC 8 no-LTO GCC 8 LTO CLANG 9 CLANG 9 LTO no-LTO Conversion time 1198s 1188s 884s 891s ffmpeg size 17.1MB 17.7MB 18.7MB 20.5MB Compilation time 626s 1369s 646s 1112s (make -j4) Compiling ffmpeg.cc 5.5s 1.1s 4.9s 2.6s Linking ffmpeg time 13.1s 1132s 7.7s 368s Linking ffmpeg 287MB 381MB 203MB 988MB memory LTO to no-LTO comparison on various aspects I was expecting something else, but facts are stubborn things. First thing we see from the above table, is that CLANG generates a much faster binary compared to GCC for both non-LTO and LTO. The version of CLANG was one year younger than GCC, but the difference is still remarkable. GCC generates smaller binaries, and LTO binaries are bigger than non-LTO binaries for both compilers. Compilation takes double the time with LTO enabled for both compilers. Linking time is huge with LTO for both compilers, on GCC it's 86 times slower, on CLANG 48 times slower. This experiment has shown that, for ffmpeg at least, enabling LTO doesn't bring expected speed ups. Final Words Original assumption was that the turning on LTO will be followed with an improvements in speed of a few percent, decrease in binary size of a few percent, and spending more time and needing more memory for compilation and linking. On ProjectX, with a several year old toolchain that did happen. ffmpeg tells completely different story. What I assume to have happened is that ffmpeg has already been highly optimized for speed; the developers moved all the small functions into headers so they can be easily inlined by the compiler. Also ffmpeg is written in C which is much easier to keep low-overhead than it is the case with C++. to Most of the optimization potential has been used already and linker can't do much more. I think that for most projects that haven't been aggressively optimized for speed, it makes sense to give LTO a try. I would expect to see the promises made by LTO related to speed and binary size fulfilled most of the time. But at the end, the ultimate indication if LTO should be enabled or not are performance numbers on your specific project. Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available. Further Read Optimizing real-world applications with GCC Link Time Optimization LLVM Blog: ThinLTO: Scalable and Incremental LTO Honza Hubicka's Blog: GCC 9: Link-time and inter-procedural optimization improvements 1. Page faults happen when your program calls another function or tries to access another part of memory, but this part of memory hasn't been loaded from the disk yet. In that case the operating system needs to load the missing part from the disk which results in program slowdown [-] 2. ProjectX is a made up name, since I am not allowed to disclose the details of the project [-] 3. Large software project, for example Chromium or Firefox have teams that are only in charge of tweaking system performance. [-] 4. LLD linker is a linker provided by LLVM; LLVM also provides CLANG. This linker understands CLANG intermediate representation out of the box and can link LTO enabled object files [-] Tagged: cc++clangembedded systemsgccgeneral purpose systems high-performance systemslink time optimizationslinkerperformance thinltoWHOPR Post navigation - Make your programs run faster by better using the data cache MOSH - a simple SSH replacement that works when network conditions are bad - 4 comments / Add your comment below 1. [14779c9684][svg] Charles Van Noland says: November 5, 2021 at 1:27 am I've seen link-time optimizations provide significant speed ups to code (with GCC at least) but it's not universal. It depends on what the code is like, what it does and how it does it. I've not yet seen code run slower as a result of LTO, so there's that. Reply 2. [5e751e6138][svg] Mladen says: April 13, 2022 at 8:04 am Are there any drawbacks? Can this optimization make something wrong and how we can be aware of that? Reply 1. [a9ad57c39a][svg] Ivica Bogosavljevic says: April 13, 2022 at 8:15 am In principle, these optimizations should not make code faster. Of course, there can be some bug in the compiler/ linker that causes a performance regression, but in that case you should file a bug. Reply 3. [454c78ae2e][svg] tlanyan says: March 5, 2024 at 10:39 am I wonder is there also a performance gap to GCC when compiling ProjectX with Clang? Reply Leave a Reply Cancel reply Your email address will not be published. Required fields are marked * [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment *[ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] Like what you're reading? Follow us! * [10_top10m][svg]Growing Buffers to Avoid Copying Data * [logo-150x][svg]Performance Debugging with llvm-mca: Simulating the CPU! * [speedscop][svg]FIYA - Flamegraphs in Your App * [intel-cpu][svg]Memory Subsystem Optimizations - The Remaining Topics * [DampedCos][svg]Speeding Up Convergence Loops. Or, on Vectorization and Precision Control Search for: [ ] [Search] Recent Posts * Growing Buffers to Avoid Copying Data * Performance Debugging with llvm-mca: Simulating the CPU! * FIYA - Flamegraphs in Your App * Memory Subsystem Optimizations - The Remaining Topics * Speeding Up Convergence Loops. Or, on Vectorization and Precision Control Recent Comments * Ivica Bogosavljevic on Performance Debugging with llvm-mca: Simulating the CPU! * Denis Bakhvalov on Performance Debugging with llvm-mca: Simulating the CPU! * Cai on Loop Optimizations: interpreting the compiler optimization report * River on What is faster: vec.emplace_back(x) or vec[x] ? * Ivica Bogosavljevic on Speedscope: visualize what your program is doing and where it is spending time Archives * March 2025 * January 2025 * December 2024 * October 2024 * August 2024 * June 2024 * April 2024 * March 2024 * February 2024 * January 2024 * December 2023 * November 2023 * October 2023 * September 2023 * August 2023 * July 2023 * June 2023 * May 2023 * April 2023 * March 2023 * February 2023 * January 2023 * December 2022 * November 2022 * October 2022 * September 2022 * August 2022 * July 2022 * June 2022 * May 2022 * April 2022 * March 2022 * February 2022 * January 2022 * December 2021 * November 2021 * October 2021 * September 2021 * August 2021 * July 2021 * June 2021 * May 2021 * April 2021 * March 2021 * February 2021 * January 2021 * December 2020 * November 2020 * October 2020 * September 2020 * August 2020 * July 2020 * June 2020 * May 2020 Categories * 2 Minute Reads * Algorithms and Performance * C++ Performance * Computational Performance * Data Structure Performance * Debugging * Developer Tools * Help the Compiler * Kernel Space and Performance * Low Level Performance * Memory Footprint * Memory Subsystem Performance * Multithreaded Performance * Parallelization * Performance * Performance Analysis Tools * Performance Contest * Reliability * Standard Library and Performance * System Design * Toolchain and Performance * Vectorization Meta * Log in * Entries feed * Comments feed * WordPress.org (c)2025 Johnny's Software Lab | WordPress Theme by Superb WordPress Themes