[HN Gopher] The Performance Impact of C++'s `final` Keyword
       ___________________________________________________________________
        
       The Performance Impact of C++'s `final` Keyword
        
       Author : hasheddan
       Score  : 233 points
       Date   : 2024-04-22 17:32 UTC (1 days ago)
        
 (HTM) web link (16bpp.net)
 (TXT) w3m dump (16bpp.net)
        
       | mgaunard wrote:
       | What final enables is devirtualization in certain cases. The main
       | advantage of devirtualization is that it is necessary for
       | inlining.
       | 
       | Inlining has other requirements as well -- LTO pretty much covers
       | it.
       | 
       | The article doesn't have sufficient data to tell whether the
       | testcase is built in such a way that any of these optimizations
       | can happen or is beneficial.
        
         | i80and wrote:
         | If you already have LTO, can't the compiler determine this
         | information for devirtualization purposes on its own?
        
           | nickwanninger wrote:
           | At the level that LLVM's LTO operates, no information about
           | classes or objects is left, so LLVM itself can't really
           | devirtualize C++ methods in most cases
        
             | nwallin wrote:
             | You appear to be correct. Clang does not devirtualize in
             | LTO, but GCC does. Personally I consider this very strange.
             | $ cat animal.h cat.cpp main.cpp         // animal.h
             | #pragma once                  class animal {
             | public:           virtual ~animal() {}           virtual
             | void speak() = 0;         };                  animal&
             | get_mystery_animal();         // cat.cpp
             | #include "animal.h"         #include <cstdio>
             | class cat final : public animal {         public:
             | ~cat() override{}           void speak() override{
             | puts("meow");           }         };
             | static cat garfield{};                  animal&
             | get_mystery_animal() {           return garfield;         }
             | // main.cpp                  #include "animal.h"
             | int main() {           animal& a = get_mystery_animal();
             | a.speak();         }          $ make clean && CXX=clang++
             | make -j && objdump --disassemble=main -C lto_test
             | rm -f *.o lto_test         clang++ -c -flto -O3 -g cat.cpp
             | -o cat.o         clang++ -c -flto -O3 -g main.cpp -o main.o
             | clang++ -flto -O3 -g cat.o main.o -o lto_test
             | lto_test:     file format elf64-x86-64
             | Disassembly of section .init:                  Disassembly
             | of section .plt:                  Disassembly of section
             | .plt.got:                  Disassembly of section .text:
             | 00000000000011b0 <main>:             11b0: 50
             | push   %rax             11b1: 48 8b 05 58 2e 00 00  mov
             | 0x2e58(%rip),%rax        # 4010 <garfield>
             | 11b8: 48 8d 3d 51 2e 00 00  lea    0x2e51(%rip),%rdi
             | # 4010 <garfield>             11bf: ff 50 10
             | call   *0x10(%rax)             11c2: 31 c0
             | xor    %eax,%eax             11c4: 59
             | pop    %rcx             11c5: c3                    ret
             | Disassembly of section .fini:          $ make clean &&
             | CXX=g++ make -j && objdump --disassemble=main -C
             | lto_test|sed -e 's,^,    ,'         rm -f *.o lto_test
             | g++ -c -flto -O3 -g cat.cpp -o cat.o         g++ -c -flto
             | -O3 -g main.cpp -o main.o         g++ -flto -O3 -g cat.o
             | main.o -o lto_test                  lto_test:     file
             | format elf64-x86-64                           Disassembly
             | of section .init:                  Disassembly of section
             | .plt:                  Disassembly of section .plt.got:
             | Disassembly of section .text:
             | 0000000000001090 <main>:             1090: 48 83 ec 08
             | sub    $0x8,%rsp             1094: 48 8d 3d 75 2f 00 00
             | lea    0x2f75(%rip),%rdi        # 4010 <garfield>
             | 109b: e8 50 01 00 00        call   11f0 <cat::speak()>
             | 10a0: 31 c0                 xor    %eax,%eax
             | 10a2: 48 83 c4 08           add    $0x8,%rsp
             | 10a6: c3                    ret
             | Disassembly of section .fini:
        
               | ranger_danger wrote:
               | What if you add -fwhole-program-vtables on clang?
        
           | wiml wrote:
           | If your runtime environment has dynamic linking, then the LTO
           | pass can't always be sure that a subclass won't be introduced
           | later that overrides the method.
        
             | i80and wrote:
             | Aha! That makes sense. I wasn't thinking of that case.
             | Thanks!
        
             | gpderetta wrote:
             | You can tell the compiler it is indeed compiling the whole
             | program.
        
           | adzm wrote:
           | MSVC with LTO and PGO will inline virtual calls in some
           | situations along with a check for the expected vtable,
           | bypassing the inlined code and calling the virtual function
           | normally if it is an unexpected value.
        
           | bluGill wrote:
           | not if there is a shared libray or other plugin. Then you
           | coannot determine until runtime if there is an override.
        
           | ot wrote:
           | In general the compiler/linker cannot assume that derived
           | classes won't arrive later through a shared object.
           | 
           | You can tell it "I won't do that" though with additional
           | flags, like Clang's -fwhole-program-vtables, and even then
           | it's not that simple. There was an effort in Clang to better
           | support whole program devirtualization, but I haven't been
           | following what kind of progress has been made:
           | https://groups.google.com/g/llvm-dev/c/6LfIiAo9g68?pli=1
        
             | Slix wrote:
             | This optimization option isn't on by default? That sounds
             | like a lot of missed optimization. Most programs aren't
             | going to be loading from shared libraries.
             | 
             | Maybe I can set this option at work. Though it's scary
             | because I'd have to be certain.
        
               | soooosn wrote:
               | I think you have answered your own question: If turning
               | on the setting is scary for you in a very localized
               | project at your company, imagine how scary it would be to
               | turn on by default for everybody :-P
        
               | Thiez wrote:
               | The JVM can actually perform this optimization
               | optimistically and can undo it if the assumption is
               | violated at runtime. So Java's 'everything is virtual by
               | default' approach doesn't hurt. Of course relying an a
               | sufficiently smart JIT comes with its own trade-offs.
        
           | samus wrote:
           | This is one of the cases where JIT compiling can shine. You
           | can use a bazillion interfaces to decouple application code,
           | and the JIT will optimize the calls after it found out which
           | implementation is used. This works as long as there is only
           | one or two of them actually active at runtime.
        
             | account42 wrote:
             | You don't need a JIT to do whole program optimization.
        
               | samus wrote:
               | AOT whole program optimization has two limits:
               | 
               | * It is possible with `dlopen()` to load code objects
               | that violate the assumptions made during compilation.
               | 
               | * The presence of runtime configuration mechanisms and
               | application input can make it impossible to anticipate
               | things like the choice of implementations of an
               | interface.
               | 
               | One can always strive to reduce such situations, but it
               | might simply not be necessary if a JIT is present.
        
         | Negitivefrags wrote:
         | See this is why I find this odd.
         | 
         | Is there a theory as to how devirtualisation could hurt
         | performance?
        
           | samus wrote:
           | Devirtualization maybe not necessarily, but inlining might
           | make code fail to fit into instruction caches.
        
           | hansvm wrote:
           | There's a cost to loading more instructions, especially if
           | you have more types of instructions.
           | 
           | The main advantages to inlining are (1) avoiding a jump and
           | other function call overhead, (2) the ability to push down
           | optimizations.
           | 
           | If you execute the "same" code (same instructions, different
           | location) in many places that can cause cache evictions and
           | other slowdowns. It's worse if some minor optimizations were
           | applied by the inlining, so you have more types of
           | instructions to unpack.
           | 
           | The question, roughly, is whether the gains exceed the costs.
           | This can be a bit hard to determine because it can depend on
           | the size of the whole program and other non-local parameters,
           | leading to performance cliffs at various stages of
           | complexity. Microbenchmarks will tend to suggest inlining is
           | better in more cases that it actually is.
           | 
           | Over time you get a feel for which functions should be
           | inlined. E.g., very often you'll have guard clauses or
           | whatnot around a trivial amount of work when the caller is
           | expected to be able to prove the guarded information at
           | compile-time. A function call takes space in the generated
           | assembly too, and if you're only guarding a few instructions
           | it's usually worth forcing an inline (even in places where
           | the compiler's heuristics would choose not to because the
           | guard clauses take up too much space), regardless of the
           | potential cache costs.
        
           | masklinn wrote:
           | Code bloat causing icache evictions?
        
           | cogman10 wrote:
           | Through inlining.
           | 
           | If you have something like a `while` loop and that while
           | loop's instructions fit neatly on the cache line, then
           | executing that loop can be quiet fast even if you have to
           | jump to different code locations to do the internals.
           | However, if you pump in more instructions in that loop you
           | can exceed the length of the cache line which causes you to
           | need more memory loads to do the same work.
           | 
           | It can also create more code. A method that took a
           | `foo(NotFinal& bar)` could be duplicated by the compiler for
           | the specialized cases which would be bad if there's a lot of
           | implementations of `NotFinal` that end up being marshalled
           | into foo. You could end up loading multiple implementations
           | of the same function which may be slower than just keeping
           | the virtual dispatch tables warm.
        
           | phire wrote:
           | Jumps/calls are actually be pretty cheap with modern branch
           | predictors. Even indirect calls through vtables, which is the
           | opposite of most programmers intuition.
           | 
           | And if the devirtualisation leads to inlining, that results
           | in code bloat which can lower performance though more
           | instruction cache misses, which are not cheap.
           | 
           | Inlining is actually pretty evil. It almost always speeds
           | things up for microbenchmarks, as such benchmarks easily fit
           | in icache. So programmers and modern compilers often go out
           | of their way to do more inlining. But when you apply too much
           | inlining to a whole program, things start to slow down.
           | 
           | But it's not like inlining is universally bad in larger
           | program, inlining can enable further optimisations, mostly
           | because it allows constant propagation to travel across
           | function boundaries.
           | 
           | Basically, compilers need better heuristics about when they
           | should be inlining. If it's just saving the overhead of a
           | lightweight call, then they shouldn't be inlining.
        
             | qsdf38100 wrote:
             | "Inlining is actually pretty evil".
             | 
             | No it's not. Except if you __force_inline__ everything, of
             | course.
             | 
             | Inlining reduces the number of instructions in a lot of
             | cases. Especially when things are abstracted and factored
             | with lot of indirections into small functions that calls
             | other small functions and so on. Consider a 'isEmpty'
             | function, which dissolves to 1 cpu instruction once
             | inlined, compared with a call/save reg/compare/return.
             | Highly dynamic code (with most functions being virtual)
             | tend to result in a fest of chained calls, jumping into
             | functions doing very little work. Yes the stack is usually
             | hot and fast, but spending 80% of the instructions doing
             | stack management is still a big waste.
             | 
             | Compilers already have good heuristics about when they
             | should be inlining, chances are they are a lot better at it
             | than you. They don't always inline, and that's not possible
             | anyway.
             | 
             | My experience is that compiler do marvels with inlining
             | decisions when there are lots of small functions they _can_
             | inline if they want to. It gives the compiler a lot of
             | freedom. Lambdas are great for that as well.
             | 
             | Make sure you make the most possible compile-time
             | information available to the compiler, factor your code,
             | don't have huge functions, and let the compiler do its
             | magic. As a plus, you can have high level abstractions,
             | deep hierarchies, and still get excellent performances.
        
               | grdbjydcv wrote:
               | The "evilness" is just that sometimes if you inline
               | aggressively in a microbenchmark things get faster but in
               | real programs things get slower.
               | 
               | As you say: "chances are they are a lot better at it than
               | you". Infrequently they are not.
        
               | EasyMark wrote:
               | doesn't the compiler usually do well enough that you
               | really only need to worry about time critical sections of
               | code? Even then you could go in and look at the assembler
               | and see if it's being inlined, no?
        
               | usefulcat wrote:
               | I find that gcc and clang are so aggressive about
               | inlining that it's usually more effective to tell them
               | what _not_ to inline.
               | 
               | In a moderately-sized codebase I regularly work on, I use
               | __attribute__((noinline)) nearly ten times as often as
               | __attribute__((always_inline)). And I use
               | __attribute__((cold)) even more than noinline.
               | 
               | So yeah, I can kind of see why someone would say inlining
               | is 'evil', though I think it's more accurate to say that
               | it's just not possible for compilers to figure out these
               | kinds of details without copious hints (like PGO).
        
               | jandrewrogers wrote:
               | +1 on the __attribute__((cold)). Compilers so
               | aggressively optimize based on their heuristics that you
               | spend more time telling them that an apparent
               | optimization opportunity is not actually an optimization.
               | 
               | When writing ultra-robust code that has to survive every
               | vaguely plausible contingency in a graceful way, the code
               | is littered with code paths that only exist for
               | astronomically improbable situations. The branch
               | predictor can figure this out but the compiler frequently
               | cannot without explicit instructions to not pollute the
               | i-cache.
        
               | somenameforme wrote:
               | I find the Unreal Engine source to be a reasonable
               | reference for C++ discussions, because it runs just
               | unbelievably well for what it does, and on a huge array
               | of hardware (and software). And it's explicit with
               | inlining, other hints, and even a million things that
               | could be easily called micro-optimizations, to a somewhat
               | absurd degree. So I'd take away two conclusions from
               | this.
               | 
               | The first is that when building a code base you don't
               | necessarily know what it's being compiled with. And so
               | even _if_ there were a super-amazing compiler, there 's
               | no guarantee that's what will be compiling your code.
               | Making it explicit, so long as you have a reasonably good
               | idea of what you're doing, is generally just a good idea.
               | It also conveys intent to some degree, especially things
               | like final.
               | 
               | The second is that I think the saying _' premature
               | optimization is the root of all evil'_ is the root of all
               | evil. Because that mindset has gradually transitioned to
               | being against optimization in general outside of the most
               | primitive things like not running critical sections in
               | O(N^2) when they could be O(N). And I think it's this
               | mindset that has gradually brought us to where we are
               | today where need what what would have been a literal
               | supercomputer not that long ago, to run a word processor.
               | It's like death by a thousand cuts, and quite ridiculous.
        
             | a_e_k wrote:
             | Another for the pro side: inlining can allow for better
             | branch prediction if the different call sites would tend to
             | drive different code paths in the function.
        
               | phire wrote:
               | This was true 15 years ago, but not so much today.
               | 
               | The branch predictors actually hash the history of the
               | last few branches taken into the branch prediction query.
               | So the exact same branch within a child function will map
               | different branch predictors entries depending on which
               | parent function it was called from, and there is no
               | benifit to inlining.
               | 
               | It also means that branch predictor can also learn
               | correlations between branches within a function. Like
               | when a branches at the top and bottom of functions share
               | conditions, or have inverted conditions.
        
           | neonsunset wrote:
           | Practically - it never does. It is always cheaper to perform
           | a direct, possibly inlined, call (devirtualization !=
           | inlining) than a virtual one.
           | 
           | Guarded devirtualization is also cheaper than virtual calls,
           | even when it has to do                   if (instance is
           | SpecificType st) { st.Call() }         else { instance.Call()
           | }
           | 
           | or even chain multiple checks at once (with either regular
           | ifs or emitting a jump table)
           | 
           | This technique is heavily used in various forms by .NET, JVM
           | and JavaScript JIT implementations (other platforms also do
           | that, but these are the major ones)
           | 
           | The first two devirtualize virtual and interface calls
           | (important in Java because all calls default to virtual,
           | important in C# because people like to abuse interfaces and
           | occasionally inheritance, C# delegates are also
           | devirtualized/inlined now). The JS JIT (like V8) performs
           | "inline caching" which is similar where for known object
           | shapes property access is shape type identifier comparison
           | and direct property read instead of keyed lookup which is way
           | more expensive.
        
             | ynik wrote:
             | Caution! If you compare across languages like that, not all
             | virtual calls are implemented equally. A C++ virtual call
             | is just a load from a fixed offset in the vtbl followed by
             | an indirect call. This is fairly cheap, on modern CPUs
             | pretty much the same as a non-virtual non-inlined call. A
             | Java/C# interface call involves a lot more stuff, because
             | there's no single fixed vtbl offset that's valid for all
             | classes implementing the interface.
        
               | neonsunset wrote:
               | Yes, it is true that there is difference. I'm not sure
               | about JVM implementation details but the reason the
               | comment says "virtual _and_ interface " calls is to
               | outline it. Virtual calls in .NET are sufficiently
               | close[0] to virtual calls in C++. Interface calls,
               | however, are coded differently[1].
               | 
               | Also you are correct - virtual calls are not terribly
               | expensive, but they encroach on ever limited* CPU
               | resources like indirect jump and load predictors and, as
               | noted in parent comments, block inlining, which is highly
               | undesirable.
               | 
               | [0] https://github.com/dotnet/runtime/blob/5111fdc0dc464f
               | 01647d6...
               | 
               | [1] https://github.com/dotnet/runtime/blob/main/docs/desi
               | gn/core... (mind you, the text was initially written 18
               | years ago, wow)
               | 
               | * through great effort of our industry to take back
               | whatever performance wins each generation brings with
               | even more abstractions that fail to improve our
               | productivity
        
           | variadix wrote:
           | It basically never should unless the inliner made a terrible
           | judgement. Devirtualizing in C++ can remove 3 levels of
           | pointer chasing, all of which could be cache misses. Many
           | optimizations in modern compilers require the context of the
           | function to be inlined to make major optimizations, which
           | requires devirtualization. The only downside is I$ pressure,
           | but this is generally not a problem because hot loops are
           | usually tight.
        
           | bandrami wrote:
           | If it's done badly, the same code that runs N times also gets
           | cached N times because it's in N different locations in
           | memory rather than one location that gets jumped to. Modern
           | compilers and schedulers will eliminate a lot of that (but
           | probably not for anything much smaller than a page), but in
           | general there's always a tradeoff.
        
         | chipdart wrote:
         | > What final enables is devirtualization in certain cases. The
         | main advantage of devirtualization is that it is necessary for
         | inlining.
         | 
         | I think that enabling inlining is just one of the indirect
         | consequences of devirtualization, and perhaps one that is
         | largely irrelevant for performance improvements.
         | 
         | The whole point of devirtualization is eliminating the need to
         | resort to pointer dereferencing when calling virtual members.
         | The main trait of a virtual class is it's use of a vtable that
         | requires dereferencing virtual members to access each and every
         | one of them.
         | 
         | In classes with larger inheritance chains, you can easily have
         | more than one pointer dereferencing taking place before you
         | call a virtual members function.
         | 
         | Once a class is final, none of that is required anymore. When a
         | member is referred, no dereferencing takes place.
         | 
         | Devirtualization helps performance because you are able to
         | benefit from inheritance and not have to pay a performance
         | penalty for that. Without the final keyword, a performance
         | oriented project would need to be architected to not use
         | inheritance at all, or in the very least in code in the hot
         | path, because that sneaks gratuitous pointer dereferences all
         | over the place, which require running extra operations and has
         | a negative impact on caching.
         | 
         | The whole purpose of the final keyword is that compilers can
         | easily eliminate all pointer dereferencing used by virtual
         | members. What stops them from applying this optimization is
         | that they have no information on whether that class will be
         | inherited and one of its members will either override any of
         | its members or invoke any member function implemented by one of
         | its parent classes.
         | 
         | With the introduction of the final keyword, you are now able to
         | tell the compiler "from thereon, this is exactly what you get"
         | and the compiler can trim out anything loose.
        
           | simonask wrote:
           | An extra indirection (indirect call versus direct call) is
           | practically nothing on modern hardware. Branch predictors are
           | insanely good, and this isn't something you generally have to
           | worry about.
           | 
           | Inlining is by far the most impactful optimization here,
           | because it can eliminate the call altogether, and thus
           | specialize the called function to the callsite, lifting
           | constants, hoisting loop variables, etc.
        
             | silvestrov wrote:
             | "is practically nothing on modern hardware" _if the data is
             | already present in the L2 cache._ Random RAM access that
             | stalls execution is expensive.
             | 
             | My guess is this is why he didn't see any speedup: all the
             | code could fit inside the L2 cache, so he did not have to
             | pay for RAM access for the deference.
             | 
             | The number of different classes is important, not the
             | number of objects as they have the same small number of
             | vtable pointers.
             | 
             | It might be different for large codebases like Chrome and
             | Firefox.
        
               | dblohm7 wrote:
               | Firefox has done a lot of work on devirtualization over
               | the years. There is a cost.
        
             | ot1138 wrote:
             | I had a section of code which incurred ~20 clock cycles to
             | make a function call to a virtual function in a critical
             | loop. That's over and above potential delays resulting from
             | cache misses and the need to place multiple parameters on
             | the stack.
             | 
             | I was going to eliminate polymorphism altogether for this
             | object but later figured out how to refactor so that this
             | particular call could be called once a millisecond. Then if
             | more work was needed, it would dispatch a task to a
             | dedicated CPU.
             | 
             | This was an incredibly performant improvement which made a
             | significant difference to my P&L.
        
               | mgaunard wrote:
               | Could just be inefficient spilling caused by ABI
               | requirements due to the inability to inline.
               | 
               | In general if you're manipulating values that fit into
               | registers and work on a platform with a shitty ABI,you
               | need to be very careful of what your function call
               | boundaries look like.
               | 
               | The most obvious example is SIMD programming on Windows
               | x86 32-bit.
        
             | pixelpoet wrote:
             | Vfuncs are only fast when they can be predicted:
             | https://forwardscattering.org/post/28
        
               | mgaunard wrote:
               | Same as any other branch. They're fast if predicted
               | correctly and slow if not.
               | 
               | If they cannot be predicted, write your code accordingly.
        
           | account42 wrote:
           | > Devirtualization helps performance because you are able to
           | benefit from inheritance and not have to pay a performance
           | penalty for that. Without the final keyword, a performance
           | oriented project would need to be architected to not use
           | inheritance at all, or in the very least in code in the hot
           | path, because that sneaks gratuitous pointer dereferences all
           | over the place, which require running extra operations and
           | has a negative impact on caching.
           | 
           |  _virtual_ inheritance. Regular old inheritance does not need
           | or benefit from devirtualization. This is why the CRTP
           | exists.
        
             | chipdart wrote:
             | > This is why the CRTP exists.
             | 
             | CRTP does not exist for that. CRTP was one of the many
             | happy accidents in template metaprogramming that happened
             | to be discovered when doing recursive templates.
             | 
             | Also, you've missed the whole point. CRTP is a way to
             | rearchitect your code to avoid dereferencing pointers to
             | virtual members in inheritance. The whole point is that
             | with final you do not need to pull tricks: just tell the
             | compiler that you don't want the class to be inherited, and
             | the compiler picks up from there and does everything for
             | you.
        
               | account42 wrote:
               | If that's your point then it is simply wrong. Final does
               | not allow the compiler to devirtualize calls through a
               | base pointer, it only eliminates the virtualness for
               | calls through pointers to the (final) derived type. The
               | compiler can devirtualize calls through base pointers in
               | others ways (by deducing the possible derived types via
               | whole program optimization or PGO) but final does not
               | help with that.
        
               | chipdart wrote:
               | > If that's your point then it is simply wrong. Final
               | does not allow the compiler to devirtualize calls through
               | a base pointer, it only eliminates the virtualness for
               | calls through pointers to the (final) derived type.
               | 
               | Please read my post. That's not my claim. I think I was
               | very clear.
        
             | scaredginger wrote:
             | Maybe a nitpick, but virtual inheritance is a term used for
             | something else entirely.
             | 
             | What you're talking about is dynamic dispatch
        
           | oasisaimlessly wrote:
           | > In classes with larger inheritance chains, you can easily
           | have more than one pointer dereferencing taking place before
           | you call a virtual members function.
           | 
           | This is not a thing in C++; vtables are flat, not nested.
           | Function pointers are always 1 dereference away.
        
         | bdjsiqoocwk wrote:
         | Whats devirtualization in C++?
         | 
         | Funny how things work. From working with Julia I've built a
         | good intuition for guessing when functions would be inlined.
         | And yet, I've never heard the word devirtualization until now.
        
           | saagarjha wrote:
           | In C++ virtual functions are polymorphic and indirected, with
           | the target not known to the compiler. Devirtualization gives
           | the compiler this information (in this case a final method
           | cannot be overridden and branch to something else).
        
       | andrewla wrote:
       | I'm surprised that it has any impact on performance at all, and
       | I'd love to see the codegen differences between the applications.
       | 
       | Mostly the `final` keyword serves as a compile-time assertion.
       | The compiler (sometimes linker) is perfectly capable of seeing
       | that a class has no derived classes, but what `final` assures is
       | that if you attempt to derive from such a class, you will raise a
       | compile-time error.
       | 
       | This is similar to how `inline` works in practice -- rather than
       | providing a useful hint to the compiler (though the compiler is
       | free to treat it that way) it provides an assertion that if you
       | do non-inlinable operations (e.g. non-tail recursion) then the
       | compiler can flag that.
       | 
       | All of this is to say that `final` can speed up runtimes -- but
       | it does so by forcing you to organize your code such that the
       | guarantees apply. By using `final` classes, in places where
       | dynamic dispatch can be reduced to static dispatch, you force the
       | developer to not introduce patterns that would prevent static
       | dispatch.
        
         | bgirard wrote:
         | > The compiler (sometimes linker) is perfectly capable of
         | seeing that a class has no derived classes
         | 
         | How? The compiler doesn't see the full program.
         | 
         | The linker I'm less sure about. If the class isn't guaranteed
         | to be fully private wouldn't an optimizing linker have to be
         | conservative in case you inject a derived class?
        
         | GuB-42 wrote:
         | "inline" is confusing in C++, as it is not really about
         | inlining. Its purpose is to allow multiple definitions of the
         | same function. It is useful when you have a function defined in
         | a header file, because if included in several source files, it
         | will be present in multiple object files, and without "inline"
         | the linker will complain of multiple definitions.
         | 
         | It is also an optimization hint, but AFAIK, modern compiler
         | ignore it.
        
           | wredue wrote:
           | I believe the wording I've seen is that compilers may not
           | respect the inline keyword, not that it is ignored.
        
           | fweimer wrote:
           | GCC does not ignore inline for inlining purposes:
           | 
           | Need a way to make inlining heuristics ignore whether a
           | function is inline
           | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008
           | 
           | (Bug saw a few updates recently, that's how I remembered.)
           | 
           | As a workaround, if you need the linkage aspect of the inline
           | keyword, you currently have to write fake templates instead.
           | Not great.
        
           | lqr wrote:
           | 10 years ago it was already folklore that compilers ignore
           | the "inline" keyword when optimizing, but that was false for
           | clang/llvm: https://stackoverflow.com/questions/27042935/are-
           | the-inline-...
        
           | jacoblambda wrote:
           | The thing with `inline` as an optimisation is that it's not
           | about optimising by inlining directly. It's a promise about
           | how you intend to use the function.
           | 
           | It's not just "you can have multiple definitions of the same
           | function" but rather a promise that the function doesn't need
           | to be address/pointer equivalent between translation units.
           | This is arguably more important than inlining directly
           | because it means the compiler can fully deduce how the
           | function may be used without any LTO or other cross
           | translation unit optimisation techniques.
           | 
           | Of course you could still technically expose a pointer to the
           | function outside a TU but doing so would be obvious to the
           | compiler and it can fall back to generating a strictly
           | conformant version of the function. Otherwise however it can
           | potentially deduce that some branches in said function are
           | unreachable and eliminate them or otherwise specialise the
           | code for the specific use cases in that TU. So it potentially
           | opens up alternative optimisations even if there's still a
           | function call and it's not inlined directly.
        
           | ack_complete wrote:
           | > "inline" is confusing in C++, as it is not really about
           | inlining. Its purpose is to allow multiple definitions of the
           | same function.
           | 
           | No, its purpose was and is still to specify a preference for
           | inlining. The C++ standard itself says this:
           | 
           | > The inline specifier indicates to the implementation that
           | inline substitution of the function body at the point of call
           | is to be preferred to the usual function call mechanism.
           | 
           | https://eel.is/c++draft/dcl.inline
        
           | lelanthran wrote:
           | > It is useful when you have a function defined in a header
           | file, because if included in several source files, it will be
           | present in multiple object files, and without "inline" the
           | linker will complain of multiple definitions.
           | 
           | Traditionally you'd use `static` for that use case, wouldn't
           | you?
           | 
           | After all, `inline` can be ignored, `static` can't.
        
             | pjmlp wrote:
             | No, because that would make it internal to each object
             | file, while what you want is for all object files to see
             | the same memory location.
        
               | lelanthran wrote:
               | > No, because that would make it internal to each object
               | file, while what you want is for all object files to see
               | the same memory location.
               | 
               | I can see exactly one use for an effect like that: static
               | variables within the function.
               | 
               | Are there any other uses?
        
               | pjmlp wrote:
               | Global variables and the magic of a build system based on
               | C semantics.
        
         | wheybags wrote:
         | What if I dlopen a shared object that contains a derived class,
         | then instantiate it. You cannot statically verify that I won't.
         | Or you could swap out a normally linked shared object for one
         | that creates a subclass. Etc etc. This kind of stuff is why I
         | think shared object boundaries should be limited to the lowest
         | common denominator (basically c abi). Dynamic linking high
         | level languages was a mistake. The only winning move is not to
         | play.
        
         | lanza wrote:
         | > Mostly the `final` keyword serves as a compile-time
         | assertion. The compiler (sometimes linker) is perfectly capable
         | of seeing that a class has no derived classes
         | 
         | That's incorrect. The optimizer has to assume everything
         | escapes the current optimization unit unless explicitly told
         | otherwise. It needs explicit guarantees about the visibility to
         | figure out the extent of the derivations allowed.
        
         | sixthDot wrote:
         | > I'd love to see the codegen differences between the
         | applications
         | 
         | There are two applications, dynamic calls and dynamic casts.
         | 
         | Dynamic casts to final classes dont require to check the whole
         | inheritance chain. Recently done this in styx [0]. The gain may
         | appear marginal, e.g 3 or 4 dereferences saved but in programs
         | based on OOP you can easily have *Billions* of dynamic casts
         | saved.
         | 
         | [0]: https://gitlab.com/styx-
         | lang/styx/-/commit/62c48e004d5485d4f....
        
       | bluGill wrote:
       | I use final more for communication. Don't look for deeper derived
       | classes as there are none. that it results in slower code is an
       | annoying surprise.
        
       | p0w3n3d wrote:
       | I would say the most performance impact would give `constexpr`
       | followed by `const`. I wouldn't bet any money on `final` which in
       | C++ is a guard of inheritance, and C++ function invocation
       | address is resolved the `vtable` hence final wouldn't change
       | anything. Maybe the author was mistaken with `final` keyword in
       | Java
        
         | adrianN wrote:
         | In my experience the compiler is pretty good at figuring out
         | what is constant so adding const is more documentation for
         | humans, especially in C++, where const is more of a hint than a
         | hard boundary. Devirtualization, as can happen when you add a
         | final, or the optimizations enabled by adding a restrict to a
         | pointer, are on the other hand often essential for performance
         | in hot code.
        
           | bayindirh wrote:
           | Since "const" makes things read-only, being const correct
           | makes sure that you don't do funny things with the data you
           | shouldn't mutate, which in turn eliminates tons of data bugs
           | out of the gate.
           | 
           | So, it's an opt-in security feature first, and a compiler
           | hint second.
        
             | Lockal wrote:
             | How does const affects code generation in C/C++? Last time
             | I checked, const was purely informational. Compilers can't
             | eliminate reads for const pointer data, because const_cast
             | exists. Compilers can't eliminate double calls to const
             | methods, because inside function definition such functions
             | can still legally modify mutable variables (and have many
             | side effects).
             | 
             | What actually may help is __attribute__((pure)) and
             | __attribute__((const)), but I don't see them often in real
             | code (unfortunately).
        
               | account42 wrote:
               | Const affects code generation when used on _variables_.
               | If you have a `const int i` then the compiler can assume
               | that i never changes.
               | 
               | But you're right that this does not hold true for const
               | pointers or references.
               | 
               | > What actually may help is __attribute__((pure)) and
               | __attribute__((const)), but I don't see them often in
               | real code (unfortunately).
               | 
               | It's disppointing that these haven't been standardized.
               | I'd prefer different semantics though, e.g. something
               | that allows things like memoization or other forms of
               | caching that are technically side effects but where you
               | still are ok with allowing the compiler to remove /
               | reorder / eliminate calls.
        
           | lelanthran wrote:
           | > In my experience the compiler is pretty good at figuring
           | out what is constant so adding const is more documentation
           | for humans,
           | 
           | In the same TU, sure. But across TU boundaries the compiler
           | really can't figure out what should be const and what should
           | not, so `const` in parameter or return values allows the
           | compiler to tell the human _" You are attempting to make a
           | modification to a value that some other TU put into RO
           | memory."_, or issue similar diagnostics.
        
         | account42 wrote:
         | > followed by `const`
         | 
         | Const can only ever possibly have a performance impact when
         | used directly on variables. const pointers / references are
         | purely for the benefit of the programmer - the compiler can
         | assume nothing because the variable could be modified elsewhere
         | or through another pointer/reference and const_cast is legal
         | anyway unless the original variable was const.
        
       | ein0p wrote:
       | You should use final to express design intent. In fact I'd rather
       | it were the default in C++, and there was some sort of an
       | opposite ('derivable'?) keyword instead, but that ship has sailed
       | long time ago. Any measurable negative perf impact should be
       | filed as a bug and fixed.
        
         | cesarb wrote:
         | > In fact I'd rather it were the default in C++, and there was
         | some sort of an opposite ('derivable'?) keyword instead
         | 
         | Kotlin (which uses the equivalent of the Java "final" keyword
         | by default) uses the "open" keyword for that purpose.
        
         | josefx wrote:
         | Intent is nice and all that, but I would like a
         | "nonwithstanding" keyword instead that just lets me bypass that
         | kind of "intent" without having to copy paste the entire
         | implementation just to remove a pointless keyword or make a
         | destructor public when I need it.
        
         | jbverschoor wrote:
         | In general, I think things should be strict by default. Way
         | easier to optimize and less error prone.
        
         | leni536 wrote:
         | C++ doesn't have the fragile base problem, as members aren't
         | virtual my default. The only concern with unintended
         | inheritance is with polymorhpic deletion. "final" on class
         | definition disables some tricks thag you can do with private
         | inheritance.
         | 
         | Having said that "final" on member functions is great, and I
         | like to see that instead of "override".
        
           | pjmlp wrote:
           | All OOP languages have it, the issue is related to changing
           | the behaviour of the base class, and the change introducing
           | unforceen consequences on the inheritance tree.
           | 
           | Changing an existing method way of calling (regular, virtual,
           | static), changing visibility, overloading, introducing a name
           | that clashes downstream, introducing a virtual destructor,
           | making a data member non-copyable,...
        
             | leni536 wrote:
             | > All OOP languages have it, the issue is related to
             | changing the behaviour of the base class, and the change
             | introducing unforceen consequences on the inheritance tree.
             | 
             | C++ largely solves it by having tight encapsulation. As
             | long as you don't change anything that breaks your existing
             | interface, you should be good. And your interface is opt-
             | in, including public members and virtual functions.
        
               | pjmlp wrote:
               | Not when you change the contents of the class itself for
               | public and protected inheritance members, which is
               | exactly the whole issue of fragile base class.
               | 
               | It doesn't go away just because private members exist as
               | possible language feature.
        
               | leni536 wrote:
               | That's not a fragile base, that's just a fragile class.
               | You can break APIs for all kinds of users, including
               | derived classes.
               | 
               | Some APIs are aimed towards derived classes, like
               | protected members and virtual functions, but that doesn't
               | make the issue fundamentally different. It's just
               | breaking APIs.
               | 
               | Point is, in C++ you have to opt-in to make these API
               | surfaces, they are not the default.
        
               | pjmlp wrote:
               | I give up, word games to avoid acknowledging the same
               | happens.
        
           | jstimpfle wrote:
           | Now try a regular function, you will be blown away. No need
           | to type "final"...
        
       | jey wrote:
       | I wonder if LTO was turned on when using Clang? Might lead to a
       | performance improvement.
        
       | pineapple_sauce wrote:
       | What should be evaluated is removing indirection and tightly
       | packing your data. I'm sure you'll gain a better performance
       | improvement. virtual calls and shared_ptr are littered in the
       | codebase.
       | 
       | In this way: you can avoid the need for the `final` keyword and
       | do the optimization the keyword enables (de-virtualize calls).
       | 
       | >Yes, it is very hacky and I am disgusted by this myself. I would
       | never do this in an actual product
       | 
       | Why? What's with the C++ community and their disgust for macros
       | without any underlying reasoning? It reminds me of everyone
       | blindly saying "Don't use goto; it creates spaghetti code".
       | 
       | Sure, if macros are overly used: it can be hard to read and
       | maintain. But, for something simple like this, you shouldn't be
       | thinking "I would never do this in an actual product".
        
         | sfink wrote:
         | Macros that are giving you some value can be ok. In this case,
         | once the performance conclusion is reached, the only reason to
         | continue using a macro is if you really need the `final`ity to
         | vary between builds. Otherwise, just delete it or use the
         | actual keyword.
         | 
         | (But I'm worse than the author; if I'm just comparing
         | performance, I'd probably put `final` everywhere applicable and
         | then do separate compiles with `-Dfinal=` and
         | `-Dfinal=final`... I'd be making the assumption that it's
         | something I either always or never want eventually, though.)
        
         | bluGill wrote:
         | Macros in C are a text replace and so it is hard to see from a
         | debugger how th code got like that.
        
           | pineapple_sauce wrote:
           | Yes, I'm well aware of the definition of a macro in C and
           | C++. Macros are simpler than templates. You can expand them
           | with a compiler flag.
        
             | bluGill wrote:
             | when things get complex templete error messages are easier
             | to follow. nobody makes complex macros but if you tried.
             | (template error messeges are legendary for a reason. nested
             | macros are worse)
        
               | account42 wrote:
               | > nobody makes complex macros
               | 
               | http://boost.org/libs/preprocessor
        
         | jandrewrogers wrote:
         | In modern C++, macros are a viewed as a code smell because they
         | are strictly worse than alternatives in almost all situations.
         | It is a cultural norm; it is a bit like using "unsafe" in Rust
         | if not strictly required for some trivial case. The C++
         | language has made a concerted effort to eliminate virtually all
         | use cases for macros since C++11 and replace them with type-
         | safe first-class features in the language. It is a bit of a
         | legacy thing at this point, there are large modern C++
         | codebases with no macros at all, not even for things like
         | logging. While macros aren't going away, especially in older
         | code, the cultural norm in modern C++ has tended toward macros
         | being a legacy foot-gun and best avoided if at all possible.
         | 
         | The main remaining use case for the old C macro facility I
         | still see in new code is to support conditional compilation of
         | architecture-specific code e.g. ARM vs x86 assembly routines or
         | intrinsics.
        
           | sgerenser wrote:
           | But how would one conditionally enable or disable the "final"
           | keyword on class members without a preprocessor macro, even
           | in C++23?
        
             | jandrewrogers wrote:
             | Macros are still useful for conditional compilation, as in
             | this case. They've been sunsetted for anything that looks
             | like code generation, which this isn't. I was more
             | commenting on the reflexive "ick" reaction of the author to
             | the use of macros (even when appropriate) because avoiding
             | them has become so engrained in C++ culture. I'm a macro
             | minimalist but I would use them here.
             | 
             | Many people have a similar reaction to the use of "goto",
             | even though it is absolutely the right choice in some
             | contexts.
        
       | gpderetta wrote:
       | 1% is nothing to scoff of. But I suspect that the variability of
       | compilation (specifically quirks of instruction selection,
       | register allocation and function alignment) more than mask any
       | gains.
       | 
       | The clang regression might be explainable by final allowing some
       | additional inlining and clang making an hash of it.
        
       | jcalvinowens wrote:
       | That's interesting. Maybe final enabled more inlining, and clang
       | is being too aggressive about it for the icache sizes in play
       | here? I'd love to see a comparison of the generated code.
       | 
       | I'm disappointed the author's conclusion is "don't use final",
       | not "something is wrong with clang".
        
         | ot wrote:
         | Or "something is wrong with my benchmark setup", which is also
         | a possibility :)
         | 
         | Without a comparison of generated code, it could be anything.
        
       | indigoabstract wrote:
       | If it does have a noticeable impact, that would be surprising, a
       | bit like going back to the days when 'inline' was supposed to
       | tell the compiler to inline the designated functions (no longer
       | its main use case nowadays).
        
       | sfink wrote:
       | tldr: sprinkled a keyword around in the hopes that it "does
       | something" to speed things up, tested it, got noisy results but
       | no miraculous speedup.
       | 
       | I started skimming this article after a while, because it seemed
       | to be going into the weeds of performance comparison without ever
       | backing up to look at what the change might be doing. Which meant
       | that I couldn't tell if I was going to be looking at the usual
       | random noise of performance testing or something real.
       | 
       | For `final`, I'd want to at least see if it changing the
       | generated code by replacing indirect vtable calls with direct or
       | inlined calls. It might be that the compiler is already figuring
       | it out and the keyword isn't doing anything. It might be that the
       | compiler _is_ changing code, but the target address was already
       | well-predicted and it 's perturbing code layout enough that it
       | gets slower (or faster). There could be something interesting
       | here, but I can't tell without at least a little assembly output
       | (or perhaps a relevant portion of some intermediate
       | representation, not that I would know which one to look at).
       | 
       | If it's not changing anything, then perhaps there could be an
       | interesting investigation into the variance of performance
       | testing in this scenario. If it's changing something, then there
       | could be an interesting investigation into when that makes things
       | faster vs slower. As it is, I can't tell what I should be looking
       | for.
        
         | sgerenser wrote:
         | This is what I was waiting for too. Especially with the large
         | regression on Clang/Ubuntu. Maybe he uncovered a Clang/LLVM
         | codegen bug, but you'd need to compare the generated assembly
         | to know.
        
         | akoboldfrying wrote:
         | >changing the generated code by replacing indirect vtable calls
         | with direct or inlined calls
         | 
         | It can't possibly be doing this, if the raytracing code is like
         | any other raytracer I've ever seen -- since it must be looping
         | through a list of concrete objects that implement some shared
         | interface, calling intersectRay() on each one, and the
         | existence of those derived concrete object types means that
         | that shared interface _can 't_ be made final, and that's the
         | only thing that would enable devirtualisation -- it makes no
         | difference whether the concrete derived types themselves are
         | final or not.
        
         | drivebycomment wrote:
         | +1. On modern hardware and software systems, performance is
         | effectively stochastic to some degree, as small random
         | perturbations to the input (code, data, environments, etc) can
         | have arbitrary effects for the performance. This means you
         | can't draw a direct causal chain / mechanism from what you
         | changed to the performance change - when it matters, you do
         | need to do a deeper analysis and investigation to find the
         | actual and full causal chain. I.e. a correlation is not a
         | causation, and especially more so on modern hardware and
         | software systems.
        
       | jeffbee wrote:
       | It's difficult to discuss this stuff because the impact can be
       | negligible or negative for one person, but large and consistently
       | positive for another. You can only usefully discuss it on a given
       | baseline, and for something like final I would hope that baseline
       | would be a project that already enjoys PGO, LTO, and BOLT.
        
       | tombert wrote:
       | I don't do much C++, but I have definitely found that engineers
       | will just assert that something is "faster" without any evidence
       | to back that up.
       | 
       | Quick example, I got in an argument with someone a few years ago
       | that claimed in C# that a `switch` was better than an `if(x==1)
       | elseif(x==2)...` because switch was "faster" and rejected my PR.
       | I mentioned that that doesn't appear to be true, we went back and
       | forth until I did a compile-then-decompile of a minimal test with
       | equality-based-ifs, and showed that the compiler actually
       | converts equality-based-ifs to `switch` behind the scenes. The
       | guy accepted my PR after that.
       | 
       | But there's tons of this stuff like this in CS, and I kind of
       | blame professors for a lot of it [1]. A large part of becoming a
       | decent engineer [2] for me was learning to stop trusting what
       | professors taught me in college. Most of what they said was fine,
       | but you can't _assume_ that; what they tell you could be out of
       | date, or simply never correct to begin with, and as far as I can
       | tell you have to _always_ test these things.
       | 
       | It doesn't help that a lot of these "it's faster" arguments are
       | often reductive because they only are faster in extremely minimal
       | tests. Sometimes a microbenchmark will show that something is
       | faster, and there's value in that, but I think it's important
       | that that can also be a small percentage of the total program;
       | compilers are obscenely good at optimizing nowadays, it can be
       | difficult to determine _when_ something will be optimized, and
       | your assertion that something is  "faster" might not actually be
       | true in a non-trivial program.
       | 
       | This is why I don't really like doing any kind of major
       | optimizations before the program actually works. I try to keep
       | the program in a reasonable Big-O and I try and minimize network
       | calls cuz of latency, but I don't bother with any kind of micro-
       | optimizations in the first draft. I don't mess with bitwise, I
       | don't concern myself on which version of a particular data
       | structure is a millisecond faster, I don't focus too much on
       | whether I can get away with a smaller sized float, etc. Once I
       | know that the program is correct, _then_ I benchmark to see if
       | any kind of micro-optimizations will actually matter, and often
       | they really don 't.
       | 
       | [1] That includes me up to about a year ago.
       | 
       | [2] At least I like to pretend I am.
        
         | BurningFrog wrote:
         | Even if one of these constructs is faster _it doesn 't matter_
         | 99% of the time.
         | 
         | Writing well structured readable code is typically far more
         | important than making it twice as fast. And those times can
         | rarely be predicted beforehand, so you should mostly not worry
         | about it until you see real performance problems.
        
           | tombert wrote:
           | I mostly focus on "using stuff that won't break", and yeah
           | "if it actually matters".
           | 
           | For example, much to the annoyance of a lot of people, I
           | don't typically use floating point numbers when I start out.
           | I will use the "decimal" or "money" types of the language, or
           | GMP if I'm using C. When I do that, I can be sure that I
           | won't have to worry about any kind of funky overflow issues
           | or bizarre rounding problems. There _might_ be a performance
           | overhead associated with it, but then I have to ask myself
           | "how often is this actually called?"
           | 
           | If the answer is "a billion times" or "once in every
           | iteration of the event loop" or something, then I will
           | probably eventually go back and figure out if I can use a
           | float or convert it to an integer-based thing, but in a lot
           | of cases the answer is "like ten or twenty times", and at
           | that point I'm not even 100% sure it would be even measurable
           | to change to the "faster" implementations.
           | 
           | What annoys me is that people will act like they really care
           | about speed, do all these annoying micro-optimizations, and
           | then forget that pretty much all of them get wiped out
           | immediately upon hitting the network, since the latency
           | associated with that is obscene.
        
           | apantel wrote:
           | The counter-argument to this is if you are building something
           | that is in the critical path of an application (for example,
           | parsing HTTP in a web server), you need to be performance-
           | minded from the beginning because design decisions lead to
           | design decisions. If you are building something in the
           | critical path of the application, the best thing to do is
           | build it from the ground up measuring the performance of what
           | you have as you go. This way, each time you add something you
           | will see the performance impact and usually there's a more
           | performant way of doing something that isn't more obscure. If
           | you do this as you build, early choices become constraints,
           | but because you chose the most performant thing at every
           | stage, the whole process takes you in the direction of a
           | highly-performant implementation.
           | 
           | Why should you care about performance?
           | 
           | I can give you my personal experience: I've been working on a
           | Java web/application server for the past 15 years and a
           | typical request (only reading, not writing to the db) would
           | take maybe 4-5 ms to execute. That includes HTTP request
           | parsing, JSON parsing, session validation, method execution,
           | JSON serialization, and HTTP response dispatch. Over the past
           | 9 months I have refactored the entire application for
           | performance and a typical request now takes about 0.25 ms or
           | 250 microseconds. The computer is doing so much less work to
           | accomplish the same tasks, it's almost silly how much work it
           | was doing before. And the result is the machine can handle
           | 20x more requests in the same amount of time. If it could
           | handle 200 requests per second per core before, now it can
           | handle 4000. That means the need to scale is felt 20x less
           | intensely, which means less complexity around scaling.
           | 
           | High performance means reduced scaling requirements.
        
             | tombert wrote:
             | But even that sort of depends right? Hardware is often
             | pretty cheap in comparison to dev-time. I really depends on
             | the project, what kind of servers you're using, the nature
             | of the application etc, but I think a lot of the time it
             | might be cheaper to just pay for 20x the servers than it
             | would be to pay a human to go find a critical path.
             | 
             | I'm not saying you completely throw caution to the wind,
             | I'm just saying that there's a finite amount of human
             | resources and it can really vary how you want to allocate
             | them. Sometimes the better path is to just throw money at
             | the problem.
             | 
             | It really depends.
        
               | apantel wrote:
               | I think it depends on what you're building and who's
               | building it. We're all benefitting from the fact that the
               | designers of NGINX made performance a priority. We like
               | using things that were designed to be performant. We like
               | high-FPS games. We like fast internet.
               | 
               | I personally don't like the idea of throwing compute at a
               | slow solution. I like when the extra effort has been put
               | into something. The good feeling I get from interacting
               | with something that is optimal or excellent is an end in
               | itself and one of the things I live for.
        
               | tombert wrote:
               | Sure, though I've mentioned a few times in this thread
               | now that the thing that bothers me more than CPU
               | optimizations is not taking into account latency,
               | particularly when hitting the network, and I think
               | focusing on that will generally pay higher dividends than
               | trying to optimize for processing.
               | 
               | CPUs are ridiculously fast now, and compilers are really
               | really good now too. I'm not going to say that processing
               | speed is a "solved" problem, but I am going to say that
               | in a lot of performance-related cases the CPU processing
               | is probably not your problem. I will admit that this kind
               | of pokes holes in my previous response, because
               | introducing more machines into the mix will almost
               | certainly increase latency, but I think it more or less
               | holds depending on context.
               | 
               | But I think it really is a matter of nuance, which you
               | hinted at. If I'm making an admin screen that's going to
               | have like a dozen users max, then a slow, crappy solution
               | is probably fine; the requests will be served fast enough
               | to where no one will notice anyway, and you can probably
               | even get away with the cheapest machine/VM. If I'm making
               | an FPS game that has 100,000 concurrent users, then it
               | almost certainly will be beneficial to squeeze out as
               | much performance out of the machine as possible, both CPU
               | _and_ latency-wise.
               | 
               | But as I keep repeating everywhere, you have to measure.
               | You cannot assume that your intuition is going to be
               | right, particularly at-scale.
        
               | apantel wrote:
               | I absolutely agree that latency is the real thing to
               | optimize for. In my case, I only leave the application to
               | access the db, and my applications tend not to be write-
               | heavy. So in my case latency-per-request == how much work
               | the computer has to do, which is constrained to one core
               | because the overhead of parallelizing any part of the
               | pipeline is greater than the work required. See, in that
               | sense, we're already close to the performance ceiling for
               | per-request processing because clock speeds aren't going
               | up. You can't make the processing of a given request
               | faster by throwing more hardware at it. You can only make
               | it faster by creating less work for the hardware to do.
               | 
               | (Ironically, HN is buckling under load right now, or some
               | other issue.)
        
               | oivey wrote:
               | It almost certainly would require more than 20x servers
               | because setting up horizontal scaling will have some sort
               | of overhead. Not only that, there is the significant
               | engineering effort to develop and maintain the code to
               | scale.
               | 
               | If your problem can fit on one server, it can massively
               | reduce engineering and infrastructure costs.
        
             | neonsunset wrote:
             | Please accept a high five from a fellow "it does so little
             | work it must have sub-millisecond request latency"
             | aficionado (though I must admit I'm guilty of abusing
             | memory caches to achieve this).
        
               | apantel wrote:
               | Caches, precomputed values, lookup tables -- it's all
               | good as long as it's well-organized and maintainable.
        
           | neonsunset wrote:
           | This attitude is part of the problem. Another part of the
           | problem is having no idea which things actually end up
           | costing performance and how much.
           | 
           | It is why many language ecosystems suffered from performance
           | issues for a really long time even if completely unwarranted.
           | 
           | Is changing ifs to switch or vice versa, as outlined in the
           | post above, a waste of time? Yes, unless you are writing some
           | encoding algorithm or a parser, it will not matter. The
           | compiler will lower trivial statements to the same codegen
           | and it will not impact the resulting performance anyway even
           | if there was difference given a problem the code was solving.
           | 
           | However, there are things that _do_ cost like interface spam,
           | abusing lambdas writing needlessly complex wokflow-style
           | patterns (which are also less readable and worse in 8 out of
           | 10 instances), not caching objects that always have the same
           | value, etc.
           | 
           | These kinds of issues, for example, plagued .NET ecosystem
           | until more recent culture shift where it started to be cool
           | once again to focus on performance. It wasn't being helped by
           | the notion of "well-structured code" being just idiotic
           | "clean architecture" and "GoF patterns" style dogma applied
           | to smallest applications and simplest of business domains.
           | 
           | (it is also the reason why picking slow languages in general
           | is a really bad idea - _everything_ costs more and you have
           | way less leeway for no productivity win - Ruby and Python,
           | and JS with Node.js are less productive to write in than C#
           | /F#, Kotlin/Java or Go(under some conditions))
        
             | tombert wrote:
             | I mean, that's kind of why I tried to emphasize measuring
             | things yourself instead of depending on tribal knowledge.
             | 
             | There are plenty of cases where even the "slow"
             | implementation is more than fast enough, and there are also
             | plenty of cases where the "correct" solution (from a big-O
             | or intuition perspective) is actually slower than the dumb
             | case. Intuition _helps_ , you _have_ to measure and /or
             | look at the compiled results if you want to ensure correct
             | numbers.
             | 
             | An example that really annoys me is how every whiteboard
             | interview ends up being "interesting ways to use a
             | hashmap", which isn't inherently an issue, but they will
             | usually be so small-scoped that an iterative "array of
             | pairs" might actually be cheaper than paying the up-front
             | cost of hashing and potentially dealing with collisions.
             | Interviews almost always ignore constant factors, and
             | that's fair enough, but in reality constant factors _can_
             | matter, and we 're training future employees to ignore
             | that.
             | 
             | I'll say it again: as far as I can tell, you _have_ to
             | measure if you want to know if your result is  "faster".
             | "Measuring" might involve memory profilers, or dumb timers,
             | or a mixture of both. Gut instincts are often wrong.
        
         | leetcrew wrote:
         | agreed, especially in cases like this. final is primarily a way
         | to prohibit overriding methods and extending classes, and it
         | indicates to the reader that they should not be doing this. use
         | it when it makes conceptual sense.
         | 
         | that said, c++ is usually a language you use when you care
         | about performance, at least to an extent. it's worth
         | understanding features like nrvo and rewriting functions to
         | allow the compiler to pick the optimization if it doesn't hurt
         | readability too much.
        
         | wvenable wrote:
         | In my opinion, the only things that really matter are
         | algorithmic complexity and readability. And even algorithmic
         | complexity is usually only an issue a certain scales. Whether
         | or not an 'if' is faster than a 'switch' is the micro of micro
         | optimizations -- you better have a good reason to care. The
         | question I would have for you is was your bunch of ifs more
         | readable than a switch would be.
        
           | doctor_phil wrote:
           | But a switch and an if-else *is* a matter of algorithmic
           | complexity. (Well, at least could be for a naive compiler). A
           | switch could be converted to a constant time jump, but the
           | if-else would be trying each case linearly.
        
             | cogman10 wrote:
             | Yup.
             | 
             | That said, the linear test is often faster due to CPU
             | caches, which is why JITs will often convert switches to
             | if/elses.
             | 
             | IMO, switch is clearer in general and potentially faster
             | (at very least the same speed) so it should be preferred
             | when dealing with 3+ if/elseif statements.
        
               | tombert wrote:
               | Hard disagree that it's "clearer". I have had to deal
               | with a ton of bugs with people trying to be clever with
               | the `break` logic, or forgetting to put `break` in there
               | at all.
               | 
               | if statements are dumber, and maybe arguably uglier, but
               | I feel like they're also more clear, and people don't try
               | and be clever with them.
        
               | cogman10 wrote:
               | Updates to languages (don't know where C# is on this)
               | have different types of switch statements that eliminate
               | the `break` problem.
               | 
               | For example, with java there's enhanced switch that looks
               | like this                   var val = switch(foo) {
               | case 1, 2, 3 -> bar;          case 4 -> baz;
               | default -> {            yield bat();          }         }
               | 
               | The C style switch break stuff is definitely a language
               | mistake.
        
               | wvenable wrote:
               | C# has both switch expressions like this and also break
               | statements are not optional in traditional switch
               | statements so it actually solves both problems. You can't
               | get too clever with switch statements in C#.
               | 
               | However most languages have pretty permissive switch
               | statements just like C.
        
               | tombert wrote:
               | Yeah, fair, it's been awhile since I've done any C#, so
               | my memory is a bit hazy with the details. I've been
               | burned C with switch statements so I have a pretty strong
               | distaste for them.
        
               | smaudet wrote:
               | I think using C as your language with which to judge
               | language constructs is hardly fair - one of its main
               | strengths has been as a fairly stable, unchanging code-
               | to-compiler contract, i.e. little to none syntax change
               | or improvements.
               | 
               | So no offense, but I would revisit the wider world of
               | language constructs before claiming that switch
               | statements are "all bad". There are plenty of bad
               | languages or languages with poor implementations of
               | syntax, that do not make the fundamental language
               | construct bad.
        
               | neonsunset wrote:
               | C# has switch statements which are C/C++ style switches
               | and switch expressions which are like Rust's match except
               | no control flow statements inside:                   var
               | len = slice switch         {             null => 0,
               | "Hello" or "World" => 1,             ['@', ..var tags] =>
               | tags.Length,             ['{', ..var body, '}'] =>
               | body.Length,             _ => slice.Length,         };
               | 
               | (it supports a lot more patterns but that wouldn't fit)
        
               | gloryjulio wrote:
               | This is just forcing return value. You either have to
               | break or return at the branches. To me they all look
               | equivalent
        
               | SAI_Peregrinus wrote:
               | I always set -Werror=implicit-fallthrough, among others.
               | That prevents fallthrough unless explicitly annotated.
               | Sadly these will forever remain optional warnings
               | requiring specific compiler flags, since requiring them
               | could break compiling broken legacy code.
        
               | neonsunset wrote:
               | Any sufficiently advanced compiler will rewrite those
               | arbitrarily depending on its heuristics. What authors
               | usually forget is that there is defined behavior and
               | specification which the compiler abides by, but it is
               | otherwise free to produce any codegen that preserves the
               | defined program order. Branch reordering, generating jump
               | tables, optimizing away or coalescing checks into
               | branchless forms are all very common. When someone says
               | "oh I write C because it lets you tell CPU how exactly to
               | execute the code" is simply a sign that a person never
               | actually looked at disassembly and has little to no idea
               | how the tool they use works.
        
               | cogman10 wrote:
               | A complier will definitely try this, but it's important
               | to note that if/else blocks tell the compiler that "you
               | will run these evaluations in order". Now, if the
               | compiler can detect that the evaluations have no side
               | effects (which, in this simple example with just integer
               | checks, is fairly likely) then yeah I can see a jump
               | table getting shoved in as an optimization.
               | 
               | However, the moment you add a side effect or something
               | more complicated like a method call, it becomes really
               | hard for the complier to know if that sort of
               | optimization is safe to do.
               | 
               | The benefit of the switch statement is that it's already
               | well positioned for the compiler to optimize as it does
               | not have the "you must run these evaluations in order"
               | requirement. It forces you to write code that is fairly
               | compiler friendly.
               | 
               | All that said, probably a waste of time debating :D.
               | Ideally you have profiled your code and the profiler has
               | told you "this is the slow block" before you get to the
               | point of worrying about how to make it faster.
        
               | tombert wrote:
               | I agree with what you said but in this particular case,
               | it actually was a direct integer equality check, there
               | was zero risk of hitting side effects and that was
               | plainly obvious to me, the checker, and compiler.
        
               | cogman10 wrote:
               | And to your original comment, I think the reviewer was
               | wrong to reject the PR over that. Performance has to be
               | measured before you can use it to reject (or create...) a
               | PR. If someone hasn't done that then unless it's
               | something obvious like "You are making a ton of tiny heap
               | allocations in a tight loop" then I think nitpicking
               | these sorts of things is just wrong.
        
             | saurik wrote:
             | While I personally find the if statements harder to
             | immediately mentally parse/grok--as I have to prove to
             | myself that they are all using the same variable and are
             | all chained correctly in a way that is visually obvious for
             | the switch statement--I don't find "but what if we use a
             | naive compiler" at all a useful argument to make as, well,
             | we aren't using a naive compiler, and, if we were, there
             | are a ton of other things we are going to be sad about the
             | performance of leading us down a path of re-implementing a
             | number of other optimizations. The goal of the compiler is
             | to shift computational complexity from runtime to compile
             | time, and figuring out whether the switch table or the
             | comparisons are the right approach seems like a legitimate
             | use case (which maybe we have to sometimes disable, but
             | probably only very rarely).
        
               | smaudet wrote:
               | Per my sibling comment, I think the argument is not about
               | speed, but simplicity.
               | 
               | Awkward switch syntax aside, the switch is simpler to
               | reason about. Fundamentally we should strive to keep our
               | code simple to understand and verify, not worry about
               | compiler optimizations (on the first pass).
        
               | saurik wrote:
               | Right, and there I would say we even agree, per my first
               | sentence; however, I wanted to reply not to you, but to
               | doctor_phil, who was explicitly disagreeing about speed.
        
             | bregma wrote:
             | But what if, and stick with me here, a compiler is capable
             | of reading and processing your code and through simple
             | scalar evolution of the conditionals and phi-reduction, it
             | can't tell the difference between a switch statement and a
             | sequence of if statements by the time it finishes its
             | single static analysis phase?
             | 
             | It turns out the algorithmic complexity of a switch
             | statement and the equivalent series of if-statements is
             | identical. The bijective mapping between them is close to
             | the identity function. Does a naive compiler exist that
             | doesn't emit the same instructions for both, at least
             | outside of toy hobby project compilers written by amateurs
             | with no experience?
        
               | smaudet wrote:
               | The issue with if statements (for compiled languages) is
               | not one of "speed" but of correctness.
               | 
               | If statements are unbounded, unconstrained logic
               | constructs, whereas switch statements are type-checkable.
               | The concern about missing break statements here is
               | irrelevant, where your linter/compiler can warn about
               | missing switch cases they can easily warn about non-
               | terminated (non-explicitly marked as fall-through) cases.
               | 
               | For non-compiled languages (so branch prediction is not
               | possible because the code is not even loaded), switch
               | statements also provide a speed-up, i.e. the parser can
               | immediately evaluate the branch to execute vs being
               | forced to evaluate intermediate steps (and the conditions
               | to each if statement can produce side-effects e.g.
               | if(checkAndDo()) { ... } else if (checkAndDoB()) { ... }
               | else if (checkAndDoC()) { ... }
               | 
               | Which, of course, is a potential use of if statements
               | that switches cannot use (although side-effects are
               | usually bad, if you listened to your CS profs)... And
               | again a sort of "static analysis" guarantee that switches
               | can provide that if statements cannot.
        
             | adrianN wrote:
             | Both the switch and the if have O(1) instructions, so both
             | are the same from an algorithmic complexity perspective.
        
             | yau8edq12i wrote:
             | Unless the number of "else if" statements somehow grows
             | e.g. linearly with the size of your input, which isn't
             | plausible, the "else if" statements also execute in O(1)
             | time.
        
             | Gazoche wrote:
             | It's linear with respect to the number of cases, not the
             | size of inputs. It's still O(1) in the sense of algorithmic
             | complexity.
        
           | tombert wrote:
           | Yeah, and it's not like I didn't know how to do the stuff I
           | was doing with a switch, I just don't like switches because
           | I've forgotten to add break statements and had code that
           | appeared correct but actually a month down the line. I've
           | also seen other people make the same mistakes. ifs, in my
           | opinion at least, are a bit harder to screw up, so I will
           | always prefer them.
           | 
           | But I agree, algorithmic complexity is generally the only
           | thing I focus on, and even then it's almost always a case of
           | "will that actually matter?" If I know that `n` is never
           | going to be more than like `10`, I might not bother trying to
           | optimize an O(n^2) operation.
           | 
           | What I feel often gets ignored in these conversations is
           | latency; people obsess over some "optimization" they learned
           | in college a decade ago, and ignore the 200 HTTP or Redis
           | calls being made ten lines below, despite the fact that the
           | latter will have a substantially higher impact on
           | performance.
        
             | dllthomas wrote:
             | > in my opinion at least, are a bit harder to screw up, so
             | I will always prefer them
             | 
             | My experience is the opposite - a sizeable chain of ifs has
             | more that can go wrong precisely because it is more
             | flexible. If I'm looking at a switch, I immediately know,
             | for instance, that none of the tests modifies anything.
             | 
             | Meanwhile, while a missing break can be a brutal error in a
             | language that allows it, it's usually trivial to set up
             | linting to require either an explicit break or a comment
             | indicating fallthrough.
        
           | jpc0 wrote:
           | ... really matter are algorithmic complexity ...
           | 
           | This is not entirely true either... Measure. There are many
           | cases where the optimiser will vectorise a certian algorithm
           | but not another... In many cases On^2 vectorised may be
           | significantly faster than On or Onlogn even for very large
           | datasets depending on your data...
           | 
           | Make your algorithms generic and it won't matter which one
           | you use, if you find that one is slower swap it for the
           | quicker one. Depending on CPU arch and compiler optimisations
           | the fastest algorithm may actually change multiple times in a
           | codebases lifetime even if the usage pattern doesn't change
           | at all.
        
             | bluGill wrote:
             | While you are not wrong, if you have a decent language you
             | will discover all the useful algorithms are already in your
             | standard library and so it isn't a worry. Your code should
             | mostly look like apply this existing algorithm to some new
             | data structure.
        
               | jpc0 wrote:
               | I don't disagree with you at all on this. However you may
               | need to combine several to get to an end result. And if
               | that happens a few times in a codebase, well makes sense
               | to factor that into a library.
        
         | saghm wrote:
         | > But there's tons of this stuff like this in CS
         | 
         | Reminds me of the classic
         | https://stackoverflow.com/questions/24848359/which-is-faster...
        
           | sgerenser wrote:
           | Never saw that before, that is indeed a classic.
        
         | jollyllama wrote:
         | I've encountered similar situations before. It's insane to me
         | when people hold up PRs over that kind of thing.
        
         | dosshell wrote:
         | > I can get away with a smaller sized float
         | 
         | When talking about not assuming optimizations...
         | 
         | 32bit float is slower than 64bit float on reasonable modern
         | x86-64.
         | 
         | The reason is that 32bit float is emulated by using 64bit.
         | 
         | Of course if you have several floats you need to optimize
         | against cache.
        
           | tombert wrote:
           | Sure, I clarified this in a sibling comment, but I kind of
           | meant that I will use the slower "money" or "decimal" types
           | by default. Usually those are more accurate and less error-
           | prone, and then if it actually matters I might go back to a
           | floating point or integer-based solution.
        
           | sgerenser wrote:
           | I think this is only true if using x87 floating point, which
           | anything computationally intensive is generally avoiding
           | these days in favor of SSE/AVX floats. In the latter case,
           | for a given vector width, the cpu can process twice as many
           | 32 bit floats as 64 bit floats per clock cycle.
        
             | dosshell wrote:
             | Yes, as I wrote, it is only true for one float value.
             | 
             | SIMD/MIMD will benefit of working on smaller width. This is
             | not only true because they do more work per clock but
             | because memory is slow. Super slow compared to the cpu.
             | Optimization is alot about cache misses optimization.
             | 
             | (But remember that the cache line is 64 bytes, so reading a
             | single value smaller than that will take the same time. So
             | it does not matter in theory when comparing one f32 against
             | one f64)
        
           | jcranmer wrote:
           | Um... no. This is 100% completely and totally wrong.
           | 
           | x86-64 requires the hardware to support SSE2, which has
           | native single-precision and double-precision instructions for
           | floating-point (e.g., scalar multiply is MULSS and MULSD,
           | respectively). Both the single precision and the double
           | precision instructions will take the same time, except for
           | DIVSS/DIVSD, where the 32-bit float version is slightly
           | faster (about 2 cycles latency faster, and reciprocal
           | throughput of 3 versus 5 per Agner's tables).
           | 
           | You might be thinking of x87 floating-point units, where all
           | arithmetic is done internally using 80-bit floating-point
           | types. But all x86 chips in like the last 20 years have had
           | SSE units--which are faster anyways. Even in the days when it
           | was the major floating-point units, it wasn't any slower,
           | since all floating-point operations took the same time
           | independent of format. It might be slower if you insisted
           | that code compilation strictly follow IEEE 754 rules, but the
           | solution everybody did was to _not_ do that and that 's why
           | things like Java's strictfp or C's FLT_EVAL_METHOD were born.
           | Even in _that_ case, however, 32-bit floats would likely be
           | faster than 64-bit for the simple fact that 32-bit floats can
           | safely be emulated in 80-bit without fear of double rounding
           | but 64-bit floats cannot.
        
             | dosshell wrote:
             | I agree with you. It should take the same time when
             | thinking more about it. I remember learning this in ~2016
             | and I did performance test on Skylake which confirmed
             | (Windows VS2015). I think I remember that i only tested
             | with addsd/addss. Definitely not x87. But as always, if the
             | result can not be reproduced... I stand corrected until
             | then.
        
               | dosshell wrote:
               | I tried to reproduce it on Ivybridge (Windows VS20122)
               | and failed (mulss and muldd) [0]. single and double
               | precision takes the same time. I also found a behavior
               | where the first batch of iterations takes more time
               | regardless of precision. It is possible that this tricked
               | me last time.
               | 
               | [0] https://gist.github.com/dosshell/495680f0f768ae84a106
               | eb054f2...
               | 
               | Sorry for the confusion and spreading false information.
        
         | jandrewrogers wrote:
         | A significant part of it is that what engineers believe was
         | effectively true at one time. They simply haven't revisited
         | those beliefs or verified their relevance in a long time. It
         | isn't a terrible heuristic for life in general to assume that
         | what worked ten years ago will work today. The rate at which
         | the equilibriums shift due to changes in hardware and software
         | environments when designing for system performance is so rapid
         | that you need to make a continuous habit of checking that your
         | understanding of how the world works maps to reality.
         | 
         | I've solved a lot of arguments with godbolt and simple
         | performance tests. Some topics are recurring themes among
         | software engineers e.g.:
         | 
         | - compilers are almost always better at micro-optimizations
         | than you are
         | 
         | - disk I/O is almost never a bottleneck in competent designs
         | 
         | - brute-force sequential scans are often optimal algorithms
         | 
         | - memory is best treated as a block device
         | 
         | - vectorization can offer large performance gains
         | 
         | - etc...
         | 
         | No one is immune to this. I am sometimes surprised at the
         | extent to which assumptions are no longer true when I revisit
         | optimization work I did 10+ years ago.
         | 
         | Most performance these days is architectural, so getting the
         | initial design right often has a bigger impact than micro-
         | optimizations and localized Big-O tweaks. You can always go
         | back and tweak algorithms or codegen later but architecture is
         | permanent.
        
           | neonsunset wrote:
           | .NET is a particularly bad case for this because it was a
           | decade of few performance improvements, which caused a
           | certain intuition to develop within the industry, then 6-8
           | years of significant changes each year (with most wins
           | compressed to the last 4 years or so). Companies moving from
           | .NET Framework 4.6/7/8 to .NET 8 experience a 10x _average_
           | performance improvement, which naturally comes with rendering
           | a lot of performance know-how obsolete overnight.
           | 
           | (the techniques that used to work were similar to earlier
           | Java versions and overall very dynamic languages with some
           | exceptions, the techniques that still work and now are
           | required today are the same as in C++ or Rust)
        
             | throwaway2037 wrote:
             | .NET 4.6 to .NET 8 is a 10x "average" performance
             | improvement. I find this hard to believe. In what
             | scenarios? I tried to Google for it and found very little
             | hard evidence.
        
               | neonsunset wrote:
               | In general purpose scenarios, particularly in codebases
               | which have high amount of abstractions, use ASP.NET Core
               | and EF Core, parse and de/serialize text with the use of
               | JSON, Regex and other options, have network and file IO,
               | and are deployed on many-core hosts/container images.
               | 
               | There are a few articles on msft devblogs that cover
               | from-netframework migration to older versions (Core 3.1,
               | 5/6/7):
               | 
               | - https://devblogs.microsoft.com/dotnet/bing-ads-
               | campaign-plat...
               | 
               | - https://devblogs.microsoft.com/dotnet/microsoft-graph-
               | dotnet...
               | 
               | - https://devblogs.microsoft.com/dotnet/the-azure-cosmos-
               | db-jo...
               | 
               | - https://devblogs.microsoft.com/dotnet/one-service-
               | journey-to...
               | 
               | - https://devblogs.microsoft.com/dotnet/microsoft-
               | commerce-dot...
               | 
               | The tl;dr is depending on codebase the latency reduction
               | was anywhere from 2x to 6x, varying per percentile, or
               | the RPS was maintained with CPU usage dropping by ~2-6x.
               | 
               | Now, these are codebases of likely above average quality.
               | 
               | If you consider that moving 6 -> 8 yields another up to
               | 15-30% on average through improved and enabled by default
               | DynamicPGO, and if you also consider that the average
               | codebase is of worse quality than whatever msft has,
               | meaning that DPGO-reliant optimizations scale way better,
               | it is not difficult to see the 10x number.
               | 
               | Keep in mind that while particular regular piece of
               | enterprise code could have improved within bounds of
               | "poor netfx codegen" -> "not far from LLVM with FLTO and
               | PGO", the bottlenecks have changed significantly where
               | previously they could have been in lock contention
               | (within GC or user code), object allocation, object
               | memory copying, e.g. for financial domains - anything
               | including possibly complex Regex queries on imported
               | payment reports (these alone have now difference anywhere
               | between 2 and >1000[0]), and for pretty much every code
               | base also in interface/virtual dispatch for layers upon
               | layers of "clean architecture" solutions.
               | 
               | The vast majority of performance improvements (both
               | compiler+gc and CoreLib+frameworks), which is difficult
               | to think about, given it was 8 years, address the above
               | first and foremost. At my previous employer the migration
               | from NETFX 4.6 to .NET Core 3.1, while also deploying to
               | much more constrained container images compared to beefy
               | Windows Server hosts, reduced latency of most requests by
               | the same factor of >5x (certain request type went from 2s
               | to 350ms). It was my first wow moment when I decided to
               | stay with .NET rather than move over to Go back then (was
               | never a fan of syntax though, and other issues, which
               | subsequently got fixed in .NET, that Go still has, are
               | not tolerable for me).
               | 
               | [0] Cumulative of
               | 
               | https://devblogs.microsoft.com/dotnet/regex-performance-
               | impr...
               | 
               | https://devblogs.microsoft.com/dotnet/regular-expression-
               | imp...
               | 
               | https://devblogs.microsoft.com/dotnet/performance-
               | improvemen...
        
               | rerdavies wrote:
               | Cheating.
               | 
               | All of the 6x performance improvement cases seem to be
               | related to using the .net based Kestrel web server
               | instead of IIS web server, which requires marshalling and
               | interprocess communication. Several of the 2x gains
               | appear to be related to using a different database
               | backend. Claims that regex performance has improved a
               | thousand-fold.... seem more troubling than cause for
               | celebration. Were you not precompiling your regex's in
               | the older code? That would be a bug.
               | 
               | Somewhere in there, there might be 30% improvements in
               | .net codegen (it's hard to tell). Profile Guided
               | Optimization (PGO) seems to provide a 35% performance
               | improvement over older versions of .net with PGO
               | disabled. But that's dishonest. PGO was around long
               | before .net Core. And claiming that PGO will provide 10x
               | performance because our code is worse than Microsoft's
               | code insults both our code and our intelligence.
        
               | ygra wrote:
               | Not sure about the 10x, either, and if true it would
               | involve more than just the JIT changes. But changing
               | ASP.NET to ASP.NET Core at the same time and the web
               | server as well as other libraries may make it plausible.
               | For certain applications moving from .NET Framework to
               | .NET isn't so simple when they have dependencies and
               | those have changed their API significantly. And in that
               | case most of the newer stuff seems to be built with
               | performance in mind. So you gain 30 % from the JIT, 2x
               | from Kestrel, and so on. Perhaps.
               | 
               | With a Roslyn-based compiler at work I saw 20 % perf
               | improvement just by switching from .NET Core 3.1 to .NET
               | 6. No idea how slow .NET Framework was, though. I
               | probably can't target the code to that anymore.
               | 
               | But for regex even with precompilation, the compiler got
               | a lot better at transforming the regex into an equivalent
               | regex that performs better (automatic atomic grouping to
               | reduce unnecessary backtracking when it's statically
               | known that backtracking won't create more matches for
               | example) and it also benefits a lot from the various
               | vectorized implementations of Index of, etc. Typically
               | with each improvement of one of those core methods for
               | searching stuff in memory there's a corresponding change
               | that uses them in regex.
               | 
               | So where in .NET Framework a regex might walk through a
               | whole string character by character multiple times with
               | backtracking it might be replaced with effectively an
               | EndsWith and LastIndexOfAny call in newer versions.
        
               | neonsunset wrote:
               | Roslyn didn't have much of changes in terms of
               | optimizations - it compiles C# to IL so does very little
               | of that, save for switches and certain new or otherwise
               | features like collection literals. You are probably
               | talking about RyuJIT, also called just JIT nowadays :D
               | 
               | (the distinction becomes important for targets serviced
               | by Mono, so to outline the difference Mono is usually
               | specified, while CoreCLR and RyuJIT may not be, it also
               | doesn't help that JIT, that is, the IL to machine code
               | compiler, also services NativeAOT, so it gets more
               | annoying to be accurate in a conversation without saying
               | the generic ".net compiler", some people refer to it as
               | JIT/ILC)
        
               | ygra wrote:
               | No, I meant that we've written a compiler, based on
               | Roslyn, whose runtime for compiling the code has improved
               | by 20 % when switching to .NET 6.
               | 
               | And indeed, on the C# -> IL side there's little that's
               | being actually optimized. Besides collection literals
               | there's also switch statements/expressions over strings,
               | along with certain pattern matching constructs that get
               | improved on that side.
        
               | neonsunset wrote:
               | Interesting! (I was way off the mark, not reading
               | carefully, ha)
               | 
               | Is it a public project?
        
               | ygra wrote:
               | Nope, completely internal and part of how we offer
               | essentially the same product on multiple platforms with
               | minimal integration work. And existing C# - anything
               | compilers are typically too focused on compiling a whole
               | application instead of offering a library with a stable
               | and usable API on the other end, so we had to roll our
               | own.
        
               | neonsunset wrote:
               | No. _Dynamic_ PGO was first introduced in .NET 6 but was
               | not mature and needed two releases worth of work to
               | become enabled by default. It needs no user input and is
               | similar to what OpenJDK Hotspot has been doing for some
               | time and then a little more. It also is required for
               | major features that were strictly not available
               | previously: guarded devirtualization of virtual and
               | interface calls and delegate inlining.
               | 
               | Also, IIS hosting through Http.sys is still an option
               | that sees separate set of improvements, but that's not
               | relevant in most situations given the move to .NET 8 from
               | Framework usually also involves replacing Windows Server
               | host with a Linux container (though it works perfectly
               | fine on Windows as well).
               | 
               | On Regex, compiled and now source generated automata has
               | seen _a lot_ of work in all recent releases, it is night
               | and day to what it was before - just read the articles.
               | Previously linear scans against heavy internal data
               | structures (matching by hashset) and heavy transient
               | allocations got replaced with bloom-filter style SIMD
               | search and other state of the art text search
               | algorithms[0], on a completely opposite end of a
               | performance spectrum.
               | 
               | So when you have compiler improvements multiplied by
               | changes to CoreLib internals multiplied by changes to
               | frameworks built on top - it's achievable with relative
               | ease. .NET Framework, while performing adequately, was
               | still _that_ slow compared to what we got today.
               | 
               | [0] https://github.com/dotnet/runtime/tree/main/src/libra
               | ries/Sy...
        
               | rerdavies wrote:
               | Sure. But static PGO was introduced in .Net Framework
               | 4.7.0. And we're talking about apps in production, so
               | there's no excuse NOT to use static PGO on the .net
               | framework 4.7.0 version.
               | 
               | And you have misrepresented the contents of the blogs.
               | The projects discussed in the blogs are typically
               | claiming ~30% improvements (perhaps because they weren't
               | using static PGO in their 4.7.0 incarnation), with two
               | dramatic outliers that seem to be related to migrating
               | from IIS to Kestrel.
        
               | neonsunset wrote:
               | It's a moot point. Almost no one used static PGO and its
               | feature set was way more limited - it did not have
               | devirtualization which provides the biggest wins. Though
               | you are welcome to disagree it won't change the reality
               | of the impact .NET 8 release had on real world code.
               | 
               | It's also convenient to ignore the rest of the content at
               | the links but it seems you're more interested in proving
               | your argument so the data I provided doesn't matter.
        
               | andyayers wrote:
               | Something closer to a "pure codegen/runtime" example
               | perhaps: I have data showing Roslyn (the C# compiler,
               | itself written in C#) speeds up between ~2x and ~3x
               | running on .NET 8 vs .NET 4.7.1. Roslyn is built so that
               | it can run either against full framework or core, so it's
               | largely the same application IL.
        
           | tombert wrote:
           | Yep, completely agree with you on this. Intuition is often
           | wrong, or at least outdated.
           | 
           | When I'm building stuff I try my best to focus on
           | "correctness", and try to come up with an algorithm/design
           | that will encompass all realistic use cases. If I focus on
           | that, it's relatively easy to go back and convert my
           | `decimal` type to a float64, or even convert an if statement
           | into a switch if it's actually faster.
        
         | klyrs wrote:
         | > A large part of becoming a decent engineer [2] for me was
         | learning to stop trusting what professors taught me in college
         | 
         | When I was taught about performance, it was all about
         | benchmarking and profiling. I never needed to trust what my
         | professors taught, because they taught me to dig in and find
         | the truth for myself. This was taught alongside the big-O
         | stuff, with several examples where "fast" algorithms are slower
         | on small inputs.
        
           | TylerE wrote:
           | How do you even get meaningful profiling out of most modern
           | langs? It seems the vast majority of time and calls gets
           | spent inside tiny anonymous functions, GC allocations, and
           | stuff like that.
        
             | klyrs wrote:
             | I don't use most modern langs! And especially if I'm doing
             | work where performance is critical, I won't kneecap myself
             | by using a language that I can't reasonably profile.
        
             | neonsunset wrote:
             | This is easy in most modern programming languages.
             | 
             | JVM ecosystem has IntelliJ Idea profiler and similar
             | advanced tools (AFAIK).
             | 
             | .NET has VS/Rider/dotnet-trace profilers (they are very
             | detailed) to produce flamegraphs.
             | 
             | Then there are native profilers which can work with any AOT
             | compiled language that produces canonically symbolicated
             | binaries: Rust, C#/F#(AOT mode), Go, Swift, C++, etc.
             | 
             | For example, you can do `samply record ./some_binary`[0]
             | and then explore multi-threaded flamegraph once completed
             | (I use it to profile C#, it's more convenient than dotTrace
             | for preliminary perf work and is usually more than
             | sufficient).
             | 
             | [0] https://github.com/mstange/samply
        
               | TylerE wrote:
               | I mean sure, but I've never seen much in a flamegraph
               | besides noise.
        
               | neonsunset wrote:
               | My experience is complete opposite. You just need to
               | construct a realistic load test for the code and the
               | bottlenecks will stand out (more often than not).
               | 
               | Also there is learning curve to grouping and aggregating
               | data.
        
         | trueismywork wrote:
         | There's not yet a culture of writing reproducible benchmarks to
         | gage these effects.
        
         | zmj wrote:
         | .NET is a little smarter about switch code generation these
         | days: https://github.com/dotnet/roslyn/pull/66081
        
         | KerrAvon wrote:
         | > `if(x==1) elseif(x==2)...` because switch was "faster" and
         | rejected my PR
         | 
         | Yeah, that's never been true. Old compilers would often compile
         | a switch to __slower__ code because they'd tend to always go to
         | a jump table implementation.
         | 
         | A better reason to use the switch is because it's better style
         | in C-like languages. Using an if statement for that sort of
         | thing looks like Python; it makes the code harder to maintain.
        
           | wzdd wrote:
           | And it's better style because it better conveys intent. An
           | if-else chain in C/C++ implies there's something important
           | about the ordering of cases. Though I'd say that for a very
           | small number of cases it's fine.
           | 
           | (Also, Python has a switch-like construct now.)
        
         | mynameisnoone wrote:
         | Yep. "Profiling or it didn't happen." The issue is that it's
         | essentially impossible for even the most neckbeard of us to
         | predict with a high degree of accuracy and precision the
         | performance on modern systems impact of change A vs. change B
         | due to the unpredictable nature of the many variables that are
         | difficult to control including compiler optimization passes,
         | architecture gotchas (caches, branch misses), and interplay of
         | quirks on various platforms. Therefore, irreducible and
         | necessary work to profile the differences become the primary
         | viable path to resolving engineering decision points.
         | Hopefully, LLMs now and in the future will be able to help
         | build out boilerplate roughly in the direct of creating such
         | profiling benchmarks and fixtures.
         | 
         | PS: I'm presently revisiting C++14 because it's the most
         | universal statically-compiled language to quickly answer
         | interview problems. It would be unfair to impose Rust, Go,
         | Elixir, or Haskell on an interviewer software engineer.
        
           | pjmlp wrote:
           | I would say it would be safer to go up to C++17, and there
           | are some goodies there, specially for better compile time
           | stuff.
        
         | ot1138 wrote:
         | >I don't do much C++, but I have definitely found that
         | engineers will just assert that something is "faster" without
         | any evidence to back that up.
         | 
         | Very true, though there is one case where one can be highly
         | confident that this is the case: code elimination.
         | 
         | You can't get any faster than not doing something in the first
         | place.
        
           | konstantinua00 wrote:
           | inb4 instruction (cache) alignment screws everythin up
        
       | JackYoustra wrote:
       | I really wish he'd listed all the flags he used. To add on to the
       | flags already listed by some other commenters, `-mcpu` and
       | related flags are really crucial in these microbenchmarks: over
       | such a small change and such a small set of tight loops, you
       | could just be regression on coincidences in the microarchitecture
       | scheduler vs higher level assumptions.
        
         | j_not_j wrote:
         | And he didn't repeat each test case 5 or 9 times, and take the
         | median (or even an average).
         | 
         | There will be operating system noise that can be in the multi-
         | percent range. This is defined as various OS services that run
         | "in the background" taking up cpu time, emptying cache lines
         | (which may be most important), and flushing a few translate
         | lookaside entries.
         | 
         | Once you recognize the variability from run to run, claiming
         | "1%" becomes less credible. Depending on the noise level, of
         | course.
         | 
         | Linux benchmarks like SPECcpu tend to be run in "single-user
         | mode" meaning almost no background processes are running.
        
       | mgraczyk wrote:
       | The main case where I use final and where I would expect benefits
       | (not covered well by the article) is when you are using an
       | external library with pure virtual interfaces that you implement.
       | 
       | For example, the AWS C++ SDK uses virtual functions for
       | everything. When you subclass their classes, marking your classes
       | as final allows the compiler to devirtualize your own calls to
       | your own functions (GCC does this reliably).
       | 
       | I'm curious to understand better how clang is producing worse
       | code in these cases. The code used for the blog post is a bit too
       | complicated for me to look at, but I would love to see some
       | microbenchmarks. My guess is that there is some kind of icache or
       | code side problem. where inlining more produces worse code.
        
         | cogman10 wrote:
         | Could easily just be a bad optimization pathway.
         | 
         | `final` tells the compiler that nothing extends this class.
         | That means the compiler can theoretically do things like
         | inlining class methods and eliminate virtual method calls
         | (perhaps duplicating the method)?
         | 
         | However, it's quite possible that one of those optimizations
         | makes the code bigger or misaligns things with the cache in
         | unexpected ways. Sometimes, a method call can bet faster than
         | inlining. Especially with hot loops.
         | 
         | All this being said, I'd expect final to offer very little
         | benefit over PGO. Its main value is the constraint it imposes
         | and not the optimization it might enable.
        
         | lpapez wrote:
         | > For example, the AWS C++ SDK uses virtual functions for
         | everything. When you subclass their classes, marking your
         | classes as final allows the compiler to devirtualize your own
         | calls to your own functions (GCC does this reliably).
         | 
         | I want to ask, and I sincerely mean no snark, what is the
         | point?
         | 
         | When working with AWS through an SDK your code will spend most
         | of the time waiting on network calls.
         | 
         | What is the point of devirtualizing your function calls to save
         | an indirection when you will be spending several orders of
         | magnitude more time just waiting for the RPC to resolve?
         | 
         | It just doesn't seem like something even worth thinking about
         | at all.
        
           | mgraczyk wrote:
           | Yeah that's was just the first public C++ library with this
           | pattern that popped into my head. I just make all my classes
           | final out of habit and don't think about it. I remove final
           | if I want to subclass, but that almost never happens.
        
       | jeffbee wrote:
       | I profiled this project and there are abundant opportunities for
       | devirtualization. The virtual interface `IHittable` is the hot
       | one. However, the WITH_FINAL define is not sufficient, because
       | the hot call is still virtual. At `hit_object |=
       | _objects[node->object_index()]->hit` I am still seeing ` mov
       | (%rdi),%rax; call *0x18(%rax)` so the application of final here
       | was not sufficient to do the job. Whatever differences are being
       | measures are caused by bogons.
        
         | gpderetta wrote:
         | I haven't looked at the code, but if you have multiple leaves,
         | even marking all of them as final won't help if the call is
         | through a base class.
        
           | jeffbee wrote:
           | Yeah the practical cases for devirtualization are when you
           | have a base class, a derived class that you actually use, and
           | another derived class that you use in tests. For your release
           | binary the tests aren't visible so that can all be
           | devirtualized.
           | 
           | In cases where you have Dog and Goose that both derive from
           | Animal and then you have std::vector<Animal>, what is the
           | compiler supposed to do?
        
             | kccqzy wrote:
             | The compiler simply knows that the actual dynamic type is
             | Animal because it is not a pointer. You need Animal* to
             | trigger all the fun virtual dispatch stuff.
        
               | froh wrote:
               | I intuit vector<Animal*> is what was meant...
        
               | jeffbee wrote:
               | Yes. I reflexively avoid asterisks on this site because
               | they can _hose your formatting_.
        
         | akoboldfrying wrote:
         | An interface, like IHittable, can't possibly be made final
         | since its whole _purpose_ is to enable multiple different
         | concrete subclasses that implement it.
         | 
         | As you say, that's the hot one -- and making the concrete
         | subclasses themselves "final" enables no devirtualisations
         | since there are no opportunities for it.
        
       | lanza wrote:
       | If you're measuring a compiler you need to post the flags and
       | version used. Otherwise the entire experiment is in the noise.
        
       | LorenDB wrote:
       | Man, I wish this blog had an RSS feed.
        
       | magnat wrote:
       | > I created a "large test suite" to be more intensive. On my dev
       | machine it needed to run for 8 hours.
       | 
       | During such long and compute-intensive tests, how are thermal
       | considerations mitigated? Not saying that this was case here, but
       | I can see how after saturating all cores for 8 hours, the whole
       | PC might get hot to the point CPU starts throttling, so when you
       | reboot to next OS or start another batch, overall performance
       | could be a bit lower.
        
         | lastgeniusua wrote:
         | having recently done similar day-and-night long suites of
         | benchmarks (on a laptop in heat dissipation conditions worse
         | than on any decent desktop), I've found that there is no
         | correlation between the order the benchmarks are run in and
         | their performance (or energy consumption!). i would therefore
         | assume that a non-overclocked processor would not exhibit the
         | patterns you are thinking of here
        
       | leni536 wrote:
       | This is the gist of the difference in code generation when final
       | is involved:
       | 
       | https://godbolt.org/z/7xKj6qTcj
       | 
       | edit: And a case involving inlining:
       | 
       | https://godbolt.org/z/E9qrb3hKM
        
       | fransje26 wrote:
       | I'm actually more worried about Clang being close to 100% slower
       | than GCC on Linux. That doesn't seem right.
       | 
       | I am prepared to believe that there is some performance
       | difference between the two, varying per case, but I would expect
       | a few percent difference, not twice the run time..
        
       | mastax wrote:
       | Changes in the layout of the binary can have large impacts on the
       | program performance [0] so it's possible that the unexpected
       | performance decrease is caused by unpredictable changes in the
       | layout of the binary between compilations. I think there is some
       | tool which helps ensure layout is consistent for benchmarking,
       | but I can't remember what it's called.
       | 
       | [0]: https://research.facebook.com/publications/bolt-a-
       | practical-...
        
       | akoboldfrying wrote:
       | I would expect "final" to have no effect on this type of code at
       | all. That it does in some cases cause measurable differences I
       | put down to randomly hitting internal compiler thresholds
       | (perhaps one of the inlining heuristics is "Don't inline a
       | function with more than 100 tokens", and the "final" keyword
       | pushes a couple of functions to 101).
       | 
       | Why would I expect no performance difference? I haven't looked at
       | the code, but I would expect that for each pixel, it iterates
       | through an array/vector/list etc. of objects that implement some
       | common interface, and calls one or more methods (probably
       | something called intersectRay() or similar) on that interface.
       | _By design, that interface cannot be made final, and that 's what
       | counts._ Whether the concrete derived classes are final or not
       | makes no difference.
       | 
       | In order to make this a good test of "final", the pointer type of
       | that container should be constrained to a concrete object type,
       | like Sphere. Of course, this means the scene is limited to
       | spheres.
       | 
       | The only case where final can make a difference, by
       | devirtualising a call that couldn't otherwise be devirtualised,
       | is when you hold a pointer to that type, _and_ the object it
       | points at was allocated  "uncertainly", e.g., by the caller. (If
       | the object was allocated in the same basic block where the method
       | call later occurs, the compiler already knows its runtime type
       | and will devirtualise the call anyway, even without "final".)
        
         | koyote wrote:
         | > (perhaps one of the inlining heuristics is "Don't inline a
         | function with more than 100 tokens", and the "final" keyword
         | pushes a couple of functions to 101).
         | 
         | That definitely is one of the heuristics in MSVC++.
         | 
         | We have some performance critical code and at one point we
         | noticed a slowdown of around ~4% in a couple of our performance
         | tests. I investigated but the only change to that code base
         | involved fixing up an error message (i.e. no logic difference
         | and not even on the direct code path of the test as it would
         | not hit that error).
         | 
         | Turns out that:                   int some_func() {
         | if (bad)             throw std::exception("Error");
         | return some_int;         }
         | 
         | Inlined just fine, but after adding more text to the exception
         | error message it no longer inlined, causing the slow-down. You
         | could either fix it with __forceinline or by moving the
         | exception to a function call.
        
           | Maxatar wrote:
           | Since the inlining is performed in MSVC's backend, as opposed
           | to its frontend, and hence operates strictly on MSVC's
           | intermediate representation which lacks information about
           | tokens or the AST, it's unlikely due to tokens.
           | 
           | std::exception does not take a string in its constructor, so
           | most likely you used std::runtime_error. std::runtime_error
           | has a pretty complex constructor if you pass into it a long
           | string. If it's a small string then there's no issue because
           | it stores its contents in an internal buffer, but if it's a
           | longer string then it has to use a reference counting scheme
           | to allow for its copy constructor to be noexcept.
           | 
           | That is why you can see different behavior if you use a long
           | string versus a short string. You can also see vastly
           | different codegen with plain std::string as well depending on
           | whether you pass it a short string literal or a long string
           | literal.
        
             | koyote wrote:
             | > std::exception does not take a string in its constructor
             | 
             | You're right, I used it as a short-hand for our internal
             | exception function, forgetting that the std one does not
             | take a string. Our error handling function is a simple
             | static function that takes an std::string and throws a
             | newly constructed object with that string as a field.
             | 
             | But yes, it could very well have been that the string
             | surpassed the short string optimisation threshold or
             | something similar. I did verify the assembly before and
             | after and the function definitely inlined before and no
             | longer inlined after. Moving the 'throw' (and, importantly,
             | the string literal) into a separate function that was
             | called from the same spot ensured it inlined again and the
             | performance was back to normal.
        
             | akoboldfrying wrote:
             | Wow, I had no idea. And I thought I knew about most of
             | C++'s weirdnesses.
        
         | simonask wrote:
         | Actually, the compiler can only implicitly devirtualize under
         | very specific circumstances. For example, it cannot
         | devirtualize if there was previously a non-inlined call through
         | the same pointer.
         | 
         | The reason is placement new. It is legal (given that certain
         | invariants are upheld) in C++ to say `new(this) DerivedClass`,
         | and compilers must assume that each method could potentially
         | have done this, changing the vtable pointer of the object.
         | 
         | The `final` keyword somewhat counteracts this, but even GCC
         | still only opportunistically honors it - i.e. it inserts a
         | check if the vtable is the expected value before calling the
         | devirtualized function, falling back on the indirect call.
        
           | akoboldfrying wrote:
           | Fascinating, though a little sad. Are there any important
           | kinds of behaviour that can only be implemented via this
           | `new(this) DerivedClass` chicanery? Because if not, it seems
           | a shame to make the optimiser pay such a heavy price just to
           | support it.
        
       | ndesaulniers wrote:
       | As an LLVM developer, I really wish the author filed a bug report
       | and waited for some analysis BEFORE publishing an article (that
       | may never get amended) that recommends not using this keyword
       | with clang for performance reasons. I suspect there's just a bug
       | in clang.
        
         | saagarjha wrote:
         | Bug, misunderstanding, weird edge case...
        
         | fransje26 wrote:
         | Is there any logical reason why Clang is 50% slower than GCC on
         | Ubuntu?
        
       | pklausler wrote:
       | Mildly related programming language trivia:
       | 
       | Fortran has virtual functions ("type bound procedures"), and
       | supports a NON_OVERRIDABLE attribute on them that is basically
       | "final". (FINAL exists but means something else.). But it also
       | has a means for localizing the non-overridable property.
       | 
       | If a type bound procedure is declared in a module, and is
       | PRIVATE, then overrides in subtypes ("extended derived types")
       | work as usual for subtypes in the same module, but can't be
       | affected by overrides that appear in other modules. This allows a
       | compiler to notice when a type has no subtypes in the same
       | module, and basically infer that it is non-overridable locally,
       | and thus resolve calls at compilation time.
       | 
       | Or it would, if compilers implemented this feature correctly.
       | It's not well described in the standard, and only half of the
       | Fortran compilers in the wild actually support it. So like too
       | many things in the Fortran world, it might be useful, but it's
       | not portable.
        
       | MathMonkeyMan wrote:
       | I think it was Chandler Carruth who said "If you're not
       | measuring, then you don't care about performance." I agree, and
       | by that measure, nobody I've ever worked with cares about
       | performance.
       | 
       | The best I'll see is somebody who cooked up a naive
       | microbenchmark to show that style 1 takes fewer wall nanoseconds
       | than style 2 on his laptop.
       | 
       | People I've worked with don't use profilers, claiming that they
       | can't trust it. Really they just can't be bothered to run it and
       | interpret the output.
       | 
       | The truth is, most of us don't write C++ because of performance;
       | we write C++ because that's the language the code is written in.
       | 
       | The performance gained by different C++ techniques seldom
       | matters, and when it does you have to measure. Profiler reports
       | almost always surprise me the first few times -- your mental
       | model of what's going on and what matters is probably wrong.
        
         | scottLobster wrote:
         | It matters to some degree. If it's just a simple technique you
         | can file away and repeat as muscle memory, well that means your
         | code is that much better.
         | 
         | From a user perspective it could be the difference between
         | software that's pleasant to use and software that's annoying to
         | use. From a philosophical perspective it's the difference
         | between software that functions vs software that works well.
         | 
         | Of course it depends on your context as to whether this is
         | valued, but I wouldn't dismiss it. Once person's micro-
         | optimization is another person's polish.
        
       | chris_wot wrote:
       | Surely "final" is a conceptual thing... in other words, you don't
       | want anyone else to derive from the class for good reasons. It's
       | for conceptual understanding, surely?
        
       | manlobster wrote:
       | This seems like a reasonable use of the preprocessor to me. I've
       | seen similar use in high-quality codebases. I wonder why the
       | author is so disgusted by it.
        
       | headline wrote:
       | re: final macro
       | 
       | > I would never do this in an actual product
       | 
       | what, why?
        
       | alex_smart wrote:
       | One thing that wasn't mentioned in the article that I wished it
       | did was the size of the compiled binary with and without final.
       | Only reason I would expect the final version to be slower is that
       | we are emitting more code because of inlining and that is
       | resulting in a larger portion of instruction cache misses.
       | 
       | Also, now that I think of it, they should have run the code under
       | perf and compared the stats.
        
         | account42 wrote:
         | Yeah, really unsatisfying that there was no attempt to explain
         | _why_ it might be slower since it just gives the compiler more
         | information to decide on optimizations which in theory should
         | only make thins faster.
        
       | kasajian wrote:
       | I'm surprised by this article. the author genuinely believes that
       | a language construct to benefit performance was added to the
       | language without anyone ever running any metrics to verify. "just
       | trust me bro", is the quote.
       | 
       | It's is an insane level of ignorance about how these things are
       | decided by the standards committee.
        
         | kreetx wrote:
         | And yet, results from current compilers show that results are
         | mixed, in summary not making programs faster.
        
       | kookamamie wrote:
       | > And probably, that reason is performance.
       | 
       | That's the first problem I see with the article. C++ isn't a fast
       | language, as it is. There are far too many issues with e.g.
       | aliasing rules, lack of proper vectorization (for the runtime
       | arch), etc.
       | 
       | If you wish to have a relatively good performance for your code,
       | try ISPC, which still allows you to get great performance with
       | vectorization up to AVX-512, without turning to intrisics.
        
         | chipdart wrote:
         | > That's the first problem I see with the article. C++ isn't a
         | fast language, as it is. There are far too many issues with
         | e.g. aliasing rules, lack of proper vectorization (for the
         | runtime arch), etc.
         | 
         | That's a bold statement due to the way it heavily contrasts
         | with reality.
         | 
         | C++ is ever present in high performance benchmarks as either
         | the highest performing language or second only to C. It's weird
         | seeing someone claim with a straight face that "C++ isn't a
         | fast language, as it is".
         | 
         | To make matters worse, you go on confusing what a programming
         | language is, and confusing implementation details with language
         | features. It's like claiming that C++ isn't a language for
         | computational graphics just because no C++ standard dedicates a
         | chapter to it.
         | 
         | Just like every engineering domain,you need to have deep
         | knowledge on details to milk the last drop of performance
         | improvements out of a program. Low-latency C++ is a testament
         | of how the smallest details can be critical of performance. But
         | you need to be completely detached from reality to claim that
         | C++ isn't a fast language.
        
           | kookamamie wrote:
           | > That's a bold statement due to the way it heavily contrasts
           | with reality.
           | 
           | I'm ready to back this up. And no, I'm not confusing things -
           | I work in HPC (realtime computer vision) and in reality the
           | only thing we'd use C++ for is "glue", i.e. binding
           | implementations of the actual algorithms implemented in other
           | languages together.
           | 
           | Implementations could be e.g. in CUDA, ISPC, neural-inference
           | via TensorRT, etc.
        
             | jpc0 wrote:
             | "We use extreme vectorisation and can't do it in native C++
             | therefore the language is slow"
             | 
             | You a junior or something? For 99% of use cases C++
             | autovectorisation does plenty and will outperform the same
             | code written in higher level languages. You are literally
             | in the 1% and conflating your use case for that of the
             | general case...
        
             | chipdart wrote:
             | I've worked in computer vision and real time image
             | processing. We use C++ extensively in the field due to it's
             | high performance. OpenCV is the tool of the trade. Both iOS
             | and Android support C++ modules for performance reasons.
             | 
             | But to add to all the nonsense,you claim otherwise.
             | 
             | Frankly, your comments lack any credibility, which is
             | confirmed by your lame appeal to authority.
        
       | teeuwen wrote:
       | I do not see how the final keyword would make a difference in
       | performance at all in this case. The compiler should be able to
       | build an inheritance tree and determine by itself which classes
       | are to be treated as final.
       | 
       | Now for libraries, this is a different story. There I can imagine
       | final keyword could have an impact.
        
         | connicpu wrote:
         | But dynamically loaded libraries exist, so even if it knows the
         | class is the most derived version out of all classes that exist
         | in all of the statically-linked code through LTO or something,
         | unless it can see the instantiation site it won't be able to
         | devirtualize the function calls without the class being marked
         | as final.
        
         | pjmlp wrote:
         | Only if the complete source code is available to the compiler.
        
       | juliangmp wrote:
       | >Personally, I'm not turning it on. And would in fact, avoid
       | using it. It doesn't seem consistent.
       | 
       | I feel like we'd have to repeat these tests quite a few times to
       | get to a decent conclusion. Hell small variations in performance
       | could be caused by all sorts of things outside the actual
       | program.
        
         | kreetx wrote:
         | AFAIU, these tests were ran 30 times each and apparently some
         | took minutes to run, so it's unlikely that you'll get any
         | different conclusions.
        
       | lionkor wrote:
       | The only thing worse than no benchmark is a bad benchmark.
       | 
       | I don't think this really shows what `final` does, not to code
       | generation, not to performance, not to the actual semantics of
       | the program. There is no magic bullet - if putting `final` on
       | every single class would always make it faster, it wouldn't be a
       | keyword, it'd be a compiler optimization.
       | 
       | `final` does one specific thing: It tells a compiler that it can
       | be sure that the given object is not going to have anything
       | derive _from it_.
        
         | opticfluorine wrote:
         | Not disagreeing with your point, but it couldn't be a compiler
         | optimization, could it? The compiler isn't able to infer that
         | the class will not be inherited anywhere else, since another
         | compilation unit unknown to the class could inherit.
        
           | ftrobro wrote:
           | I assume it could be or is part of link time optimization
           | when compiling an application rather than a library?
        
           | vedantk wrote:
           | Possibly not in the default c++ language mode, but check out
           | -fwhole-program-vtables. It can be a useful option in cases
           | where all relevant inheritance relationships are known at
           | compile time.
           | 
           | https://reviews.llvm.org/D16821
        
             | bluGill wrote:
             | Which is good, but may not apply. I have an application
             | where I can't do that because we support plugins and so a
             | couple classes will get overridden outside of the
             | compilation (this was in hindsight a bad decision, but too
             | late to change now). Meanwhile most classes will never be
             | overriden and so I use final to saw that. We are also a
             | multi-repo project (which despite the hype I think is
             | better for us than mono-repo), another reason why -f-whole-
             | program-vtables would be difficult to use - but we could
             | make it work with effort if it wasn't for the plugins.
        
         | paulddraper wrote:
         | > `final` does one specific thing: It tells a compiler that it
         | can be sure that the given object is not going to have anything
         | derive from it.
         | 
         | ...and the compiler can optimize using that information.
         | 
         | (It could also do the same without the keyword, with LTO.)
        
           | bluGill wrote:
           | LTO can only apply in specific situations though, if there is
           | any possibility that a plugin derived from the class LTO can
           | do nothing.
        
         | Nevermark wrote:
         | 'Final' cannot be assumed without complete knowledge of all
         | final linking cases, and knowledge that this will not change in
         | the future. The latter can never be assumed by a compiler
         | without indication.
         | 
         | "In theory" adding 'final' only gives a compiler more
         | information, so should only result in same or faster code.
         | 
         | In practice, some optimizations improve performance for more
         | expected or important cases (in the compiler writer's
         | estimation), with worse outcomes in other less expected, less
         | important cases. Without a clear understanding the when and how
         | of these 'final' optimizations, it isn't clear without
         | benchmarking after the fact, when to use it, or not.
         | 
         | That makes any given test much less helpful. Since all we know
         | is 'final' was not helpful in this case. We have no basis to
         | know how general these results are.
         | 
         | But it would be deeply strange if 'final' was generally
         | unhelpful. Informationally it does only one purely helpful
         | thing: reduce the number of linking/runtime contexts the
         | compiler needs to worry about.
        
       | account42 wrote:
       | I'm amused at the AI advert spam in the comments here that can't
       | even be bothered to make the spam even vaguely normal looking
       | comments.
        
       | AtNightWeCode wrote:
       | Most benchmarks are wrong. I doubt this is correct. Final should
       | have been the default in the lang I think though.
       | 
       | There are tons of these suggestions. Like always using sealed in
       | C# or never use private in Java.
        
       ___________________________________________________________________
       (page generated 2024-04-23 23:01 UTC)