https://johnnysswlab.com/link-time-optimizations-new-way-to-do-compiler-optimizations/

Johnny's Software LabJohnny's Software Lab

Johnny's Software Lab

We help you deliver fast software

  * Home
  * Performance
      + 2 Minute Reads
      + C++ Performance
      + Standard Library and Performance
      + Algorithms and Performance
      + Toolchain and Performance
      + Help the Compiler
      + Performance Analysis Tools
      + Computational Performance
      + Low Level Performance
      + Parallelization
      + Multithreaded Performance
      + Performance Contest
  * Debugging
  * Developer Tools
  * Vectorization Workshop
  * Need help?
  * Talks
  * Contact
  * About us

Menu
[lto-featured-e1590579921753]

Link Time Optimizations: New Way to Do Compiler Optimizations

Posted on May 27, 2020February 7, 2023Author Ivica Bogosavljevic
Posted in Memory Footprint, Performance, Toolchain and Performance4
Replies

UPCOMING VECTORIZATION WORKSHOPS
AVX Vectorization Workshop: 4 half days, May 26th to May 29th,
11 AM - 3 PM (US East Coast)
8 AM - 12 PM (US West Coast)
5 PM - 9 PM CET (Europe)
NEON Vectorization Workshop: TBD, send e-mail to
info@johnnysswlab.com to express interest
More info...

Early in their career every C/C++ developer has had an eureka moment:
the discovery of optimization options in their compiler. A discovery
that there is GCC compiler offers -O0 optimization lever for regular
debugging and -O3 for fast release code will be one of the moment I
will never forget. I just needed to compile my program with -O3 and
everything would run faster without any effort on my side.

Many years have passed since and I believe everyone compiles their
code in the similar way: we compile our .c or .cpp files using a
compiler, with a certain optimization option (-O0 or O3) to get
object files (.o). Then linker takes all our object files and merges
them into one big executable or a library.

In this story, linker is the last piece of the puzzle, and compared
to what the compiler is doing, its job is much simpler. But doing
things this way we are missing out on some important optimization
opportunities. So, ladies and gentlemen, allow me to explain.

Table Of Contents

 1. Motivation
 2. Link Time Optimizations
      + Introduction
      + 
      + Under the hood
      + 
      + How to enable link time optimizations?
 3. Experiments
      + ProjectX
      + 
      + ffmpeg

Final Words
Further Read

Motivation

As I said earlier, compiler makes many optimizations on the code.
Allow me to present two of them: inlining and loop merging. Have a
look at the following code:

int find_min(const int* array, const int len) {
    int min = a[0];
    for (int i = 1; i < len; i++) {
        if (a[i] < min) { min = a[i]; }
    }
    return min;
}

int find_max(const int* array, const int len) {
    int max = a[0];
    for (int i = 1; i < len; i++) {
        if (a[i] > max) { max = a[i]; }
    }
    return max;
}

void main() {
    int* array, len, min, max;
    initialize_array(array, &len);
    min = find_min(array, len);
    max = find_max(array, len);
    ...
}

This is a very simple code that initializes an array and then finds
its minimum and maximum. Please pay attention to functions find_min
and find_max and notice they are very similar. When we compile
optimized version of our program, the compiler might inline functions
find_min and find_max. Inlining simply means to remove the calls to
those functions and instead copy their bodies in the place of the
call. After having done the inlining, the compiler sees that both
find_min and find_max loop over the same range and there are no data
dependencies between them so it decides to do loop merging. It merges
loops from find_min and find_max into a single loop. This decreases
amount of work the CPU needs to do and also increases data cache
utilization. So the optimized version might look like this:

void main() {
    int* array, len, min, max;
    initialize_array(array, &len);
    min = a[0]; max = a[0];  // compiler inlined find_min and find_max
    for (int i = 0; i < len; i++) {  // compiler merged two loops int one
        if (a[i] < min) { min = a[i]; }
        if (a[i] > max) { max = a[i]; }
    }
    ...
}

In the optimized version calls to functions find_min and find_max
have completely disappeared. This saves some calling time.
Furthermore, the compiler merged loops from find_min and find_max and
now we are looping just once.

These two optimizations alone will benefit the performance. But
everything we talked about here can happen if main, find_min and
find_max are in the same compilation unit (in C and C++ compilation
unit is a file since compilers work file by file).

If find_min and find_max are defined in find.c and main is defined in
main.c, the compiler will not be able to perform the optimization.
While the compiler is compiling main.c it is aware of find_min and
find_max only through their signature. It knows these two functions
exist, but it doesn't know how do they look internally or their
addresses. It will leave placeholders to call these two functions,
and later the linker will resolve calls to those two functions with
actual addresses. If the compiler and linker work in this way, they
miss out on important optimization potential.

You probably have guessed: why doesn't the linker do some
optimizations? Linker is aware of multiple compilation units and
certainly could do them. Well, ladies and gentlemen, allow me to
introduce:

Link Time Optimizations

Introduction

Link time optimizations (abbr. LTO) are just that, optimizations done
during linking. Linker is aware of all the compilation units and can
optimize more compared to what conventional compiler can do. The most
important optimizations that linker does are inlining and code
locality improvements. As already said, inlining just means copying
body of a function in the place of the call instead of calling the
function directly. Linker inlines functions from other compilation
units if that makes sense. When the function is inlined, the linker
has an opportunity to do other optimizations, similar to what
compilers do when they compile a singe file.

Regular compilers try to figure out which functions execute often and
which functions execute rarely. Depending on that they move the
functions that execute often close to one another in memory in order
to improve code locality. Additionally, if a function A() calls
function B() it makes sense to put them close to one another in
memory for the same reason. Better locality means less page faults^1
and faster code. Linkers are much better aware how the functions from
different compilation units call one another and therefore can put
them in memory so that they optimize for better code locality.

The result of LTO is a binary which is typically a few percent faster
than the original and a few percent smaller. This is not much but for
performance sensitive programs can be a big deal.

There is a downside to LTO however. Compiling and linking with LTO is
much slower and uses more memory, and this doesn't scale well. For
large projects, e.g Chrome browser or Firefox it requires a dedicated
machine with a lot of memory to link. The reason is quite simple,
it's because linker now has to keep in memory all compilation units.

Also, total compilation time with LTO enabled is longer and I think
it is one of the reasons why we haven't seen its wide adoption
(developers are not patient people). It is very simple to enable LTO,
and later we will enable it on a some projects and measure the
results.

Under the hood

You can skip this and move to next chapter as this is not important
to work with LTO.

So how does LTO work? Before LTO, linkers were in change only of
linking. Linkers would take all the generated object files, for each
file they would resolve calls to functions in other object files,
create a table of exported functions and finally output the
executable to disk.

Now with LTO the linker takes a lot of responsibilities compilers
used to have, namely the optimization. Common object files contain
binary code, but it is impossible to do LTO on the binary code. So
the compiler needs to be run with LTO enabled in order to produce an
object file that can be used for LTO. When compiler is started in LTO
mode, it writes in special .lto sections the intermediate
representation of the code. Intermediate representation is the thing
on which the compilers do the optimizations; now when the
optimization step is moved to linking phase, this information is
necessary for the linker as well.

The line between compilers and linkers gets blurry. I wouldn't be
surprised if one day it disappears and we are left with only one
tool.

How to enable link time optimizations?

LTO has been supported on compilers for a long time; GCC 4.7 which is
eight years old supports it and I assume that all the other popular
compilers support LTO for the same time.

To enable LTO, follow these simple steps:

  * Add option -flto to the invocation of compiler.
  * Add option -flto to the invocation of the linker. Additionally,
    you need to add all options from the compiler invocations to the
    invocation of the linker. So if you called your compiler with
    "-march=i486 -O3 -fno-stack-protector", you will need to pass the
    same options to the linker.

Now you compile your program as regular. Unless you are using a very
old version of the compiler, you shouldn't expect any problems here.

Experiments

Now let's get our hands dirty. I already made some tests on ProjectX^
2, the commercial project I am working on, so I am going to recount
my experiences with it. Second thing we are going to investigate is
the impact of LTO on ffmpeg open source software used for media files
manipulation.

ProjectX

ProjectX is a medium sized project written in C++. There is no
special team which is in charge of tools or performance^3 and this
corresponds to what we typically see in most medium sized commercial
projects. It is compiled using GCC 4.9.4 and without LTO. There is
one component of the system which is performance critical and I
measured how the performance of that component behaves with respect
to LTO. I picked a single test since testing is slow and repeated it
10 times. Here are the results:

Phase                 Without LTO With LTO Difference
Test runtime          151.7s      138.7s   +9.2%
Binary size
- libA                362kB       292kB    +24%
- libB                1698kB      1394kB   +21.8%
- libC                31.6MB      23.8MB   +32.7%
- libD                1204kB      1144kB   +5.2%
Compile time for      20.2s       214.2s   10.6 x slower
a single .cpp file
Linking time for libC 1.6s        63.9s    39.9x slower
(18 .o files)
Linking memory        119MB       709MB    5.95x more
consumption for libC

Comparison between LTO and non-LTO compilation

As the above table shows, the test runtime has decreased by 9.2%, and
binaries sizes have decreased by 20% on average.

But the resources needed for compilation have increased dramatically.
Compiling a .cpp file is 10 times slower, linking is 40 times slower
and memory consumption needed for linking is almost 6 times higher.
Please note that GCC 4.9 is really old and LTO has been correctly
implemented just in that version. Let's try a next experiment, maybe
we'll have more luck there.

ffmpeg

ffmpeg is a famous library for processing audio and video. We used
ffmpeg 4.2.3 available from the web site and compiled it with GCC 8
and CLANG 9. The configure script of ffmpeg has an option to enable
LTO, so that's just what we did. ffmpeg uses some hand written
assembler, we disabled this in order to measure the impact of LTO on
optimizing C code. The configure line we used is:

./configure --enable-lto --disable-inline-asm --disable-x86asm

Enabling LTO on GCC went easily. On CLANG however we needed to tweak
the configure command line to force it to use LLD linker^4. We also
wanted to try out ThinLTO, a feature of CLANG that promises the same
performance gains as regular LTO and compiles in time comparable to
LTO.

To test the performance we asked ffmpeg to convert a mpegts container
with h264 video and aac audio to matroska container with mpeg4 video
and ac3 audio. The input file was 708MB large and the output was
349MB. Here are the results of our measurements:

Phase                GCC 8 no-LTO GCC 8 LTO CLANG 9       CLANG 9 LTO
                                            no-LTO
Conversion time      1198s        1188s     884s          891s
ffmpeg size          17.1MB       17.7MB    18.7MB        20.5MB
Compilation time     626s         1369s     646s          1112s
(make -j4)
Compiling ffmpeg.cc  5.5s         1.1s      4.9s          2.6s
Linking ffmpeg time  13.1s        1132s     7.7s          368s
Linking ffmpeg       287MB        381MB     203MB         988MB
memory

LTO to no-LTO comparison on various aspects

I was expecting something else, but facts are stubborn things. First
thing we see from the above table, is that CLANG generates a much
faster binary compared to GCC for both non-LTO and LTO. The version
of CLANG was one year younger than GCC, but the difference is still
remarkable. GCC generates smaller binaries, and LTO binaries are
bigger than non-LTO binaries for both compilers.

Compilation takes double the time with LTO enabled for both
compilers. Linking time is huge with LTO for both compilers, on GCC
it's 86 times slower, on CLANG 48 times slower.

This experiment has shown that, for ffmpeg at least, enabling LTO
doesn't bring expected speed ups.

Final Words

Original assumption was that the turning on LTO will be followed with
an improvements in speed of a few percent, decrease in binary size of
a few percent, and spending more time and needing more memory for
compilation and linking.

On ProjectX, with a several year old toolchain that did happen.
ffmpeg tells completely different story. What I assume to have
happened is that ffmpeg has already been highly optimized for speed;
the developers moved all the small functions into headers so they can
be easily inlined by the compiler. Also ffmpeg is written in C which
is much easier to keep low-overhead than it is the case with C++. to
Most of the optimization potential has been used already and linker
can't do much more.

I think that for most projects that haven't been aggressively
optimized for speed, it makes sense to give LTO a try. I would expect
to see the promises made by LTO related to speed and binary size
fulfilled most of the time. But at the end, the ultimate indication
if LTO should be enabled or not are performance numbers on your
specific project.

Do you need to discuss a performance problem in your project? Or
maybe you want a vectorization training for yourself or your team? 
Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as
soon as new content becomes available.

Further Read

Optimizing real-world applications with GCC Link Time Optimization

LLVM Blog: ThinLTO: Scalable and Incremental LTO

Honza Hubicka's Blog: GCC 9: Link-time and inter-procedural
optimization improvements

 1. Page faults happen when your program calls another function or
    tries to access another part of memory, but this part of memory
    hasn't been loaded from the disk yet. In that case the operating
    system needs to load the missing part from the disk which results
    in program slowdown [-]
 2. ProjectX is a made up name, since I am not allowed to disclose
    the details of the project [-]
 3. Large software project, for example Chromium or Firefox have
    teams that are only in charge of tweaking system performance. [-]
 4. LLD linker is a linker provided by LLVM; LLVM also provides
    CLANG. This linker understands CLANG intermediate representation
    out of the box and can link LTO enabled object files [-]

Tagged: cc++clangembedded systemsgccgeneral purpose systems
high-performance systemslink time optimizationslinkerperformance
thinltoWHOPR

Post navigation

- Make your programs run faster by better using the data cache
MOSH - a simple SSH replacement that works when network conditions
are bad -

4 comments / Add your comment below

 1. [14779c9684][svg] Charles Van Noland says:
    November 5, 2021 at 1:27 am

    I've seen link-time optimizations provide significant speed ups
    to code (with GCC at least) but it's not universal. It depends on
    what the code is like, what it does and how it does it. I've not
    yet seen code run slower as a result of LTO, so there's that.

    Reply
 2. [5e751e6138][svg] Mladen says:
    April 13, 2022 at 8:04 am

    Are there any drawbacks? Can this optimization make something
    wrong and how we can be aware of that?

    Reply
     1. [a9ad57c39a][svg] Ivica Bogosavljevic says:
        April 13, 2022 at 8:15 am

        In principle, these optimizations should not make code
        faster. Of course, there can be some bug in the compiler/
        linker that causes a performance regression, but in that case
        you should file a bug.

        Reply
 3. [454c78ae2e][svg] tlanyan says:
    March 5, 2024 at 10:39 am

    I wonder is there also a performance gap to GCC when compiling
    ProjectX with Clang?

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked 
*

         [                                             ]
         [                                             ]
         [                                             ]
         [                                             ]
         [                                             ]
         [                                             ]
         [                                             ]
Comment *[                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

[Post Comment] 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

Like what you're reading? Follow us!

  

  * [10_top10m][svg]Growing Buffers to Avoid Copying Data
  * [logo-150x][svg]Performance Debugging with llvm-mca: Simulating
    the CPU!
  * [speedscop][svg]FIYA - Flamegraphs in Your App
  * [intel-cpu][svg]Memory Subsystem Optimizations - The Remaining
    Topics
  * [DampedCos][svg]Speeding Up Convergence Loops. Or, on
    Vectorization and Precision Control

Search for: [                    ] [Search]
Recent Posts

  * Growing Buffers to Avoid Copying Data
  * Performance Debugging with llvm-mca: Simulating the CPU!
  * FIYA - Flamegraphs in Your App
  * Memory Subsystem Optimizations - The Remaining Topics
  * Speeding Up Convergence Loops. Or, on Vectorization and Precision
    Control

Recent Comments

  * Ivica Bogosavljevic on Performance Debugging with llvm-mca:
    Simulating the CPU!
  * Denis Bakhvalov on Performance Debugging with llvm-mca:
    Simulating the CPU!
  * Cai on Loop Optimizations: interpreting the compiler optimization
    report
  * River on What is faster: vec.emplace_back(x) or vec[x] ?
  * Ivica Bogosavljevic on Speedscope: visualize what your program is
    doing and where it is spending time

Archives

  * March 2025
  * January 2025
  * December 2024
  * October 2024
  * August 2024
  * June 2024
  * April 2024
  * March 2024
  * February 2024
  * January 2024
  * December 2023
  * November 2023
  * October 2023
  * September 2023
  * August 2023
  * July 2023
  * June 2023
  * May 2023
  * April 2023
  * March 2023
  * February 2023
  * January 2023
  * December 2022
  * November 2022
  * October 2022
  * September 2022
  * August 2022
  * July 2022
  * June 2022
  * May 2022
  * April 2022
  * March 2022
  * February 2022
  * January 2022
  * December 2021
  * November 2021
  * October 2021
  * September 2021
  * August 2021
  * July 2021
  * June 2021
  * May 2021
  * April 2021
  * March 2021
  * February 2021
  * January 2021
  * December 2020
  * November 2020
  * October 2020
  * September 2020
  * August 2020
  * July 2020
  * June 2020
  * May 2020

Categories

  * 2 Minute Reads
  * Algorithms and Performance
  * C++ Performance
  * Computational Performance
  * Data Structure Performance
  * Debugging
  * Developer Tools
  * Help the Compiler
  * Kernel Space and Performance
  * Low Level Performance
  * Memory Footprint
  * Memory Subsystem Performance
  * Multithreaded Performance
  * Parallelization
  * Performance
  * Performance Analysis Tools
  * Performance Contest
  * Reliability
  * Standard Library and Performance
  * System Design
  * Toolchain and Performance
  * Vectorization

Meta

  * Log in
  * Entries feed
  * Comments feed
  * WordPress.org

(c)2025 Johnny's Software Lab | WordPress Theme by Superb WordPress
Themes