hngopher.com

       [HN Gopher] The Performance Impact of C++'s `final` Keyword
       ___________________________________________________________________
        
       The Performance Impact of C++'s `final` Keyword
        
       Author : hasheddan
       Score  : 99 points
       Date   : 2024-04-22 17:32 UTC (5 hours ago)
        
 (HTM) web link (16bpp.net)
 (TXT) w3m dump (16bpp.net)
        
       | mgaunard wrote:
       | What final enables is devirtualization in certain cases. The main
       | advantage of devirtualization is that it is necessary for
       | inlining.
       | 
       | Inlining has other requirements as well -- LTO pretty much covers
       | it.
       | 
       | The article doesn't have sufficient data to tell whether the
       | testcase is built in such a way that any of these optimizations
       | can happen or is beneficial.
        
         | i80and wrote:
         | If you already have LTO, can't the compiler determine this
         | information for devirtualization purposes on its own?
        
           | nickwanninger wrote:
           | At the level that LLVM's LTO operates, no information about
           | classes or objects is left, so LLVM itself can't really
           | devirtualize C++ methods in most cases
        
             | nwallin wrote:
             | You appear to be correct. Clang does not devirtualize in
             | LTO, but GCC does. Personally I consider this very strange.
             | $ cat animal.h cat.cpp main.cpp         // animal.h
             | #pragma once                  class animal {
             | public:           virtual ~animal() {}           virtual
             | void speak() = 0;         };                  animal&
             | get_mystery_animal();         // cat.cpp
             | #include "animal.h"         #include <cstdio>
             | class cat final : public animal {         public:
             | ~cat() override{}           void speak() override{
             | puts("meow");           }         };
             | static cat garfield{};                  animal&
             | get_mystery_animal() {           return garfield;         }
             | // main.cpp                  #include "animal.h"
             | int main() {           animal& a = get_mystery_animal();
             | a.speak();         }          $ make clean && CXX=clang++
             | make -j && objdump --disassemble=main -C lto_test
             | rm -f *.o lto_test         clang++ -c -flto -O3 -g cat.cpp
             | -o cat.o         clang++ -c -flto -O3 -g main.cpp -o main.o
             | clang++ -flto -O3 -g cat.o main.o -o lto_test
             | lto_test:     file format elf64-x86-64
             | Disassembly of section .init:                  Disassembly
             | of section .plt:                  Disassembly of section
             | .plt.got:                  Disassembly of section .text:
             | 00000000000011b0 <main>:             11b0: 50
             | push   %rax             11b1: 48 8b 05 58 2e 00 00  mov
             | 0x2e58(%rip),%rax        # 4010 <garfield>
             | 11b8: 48 8d 3d 51 2e 00 00  lea    0x2e51(%rip),%rdi
             | # 4010 <garfield>             11bf: ff 50 10
             | call   *0x10(%rax)             11c2: 31 c0
             | xor    %eax,%eax             11c4: 59
             | pop    %rcx             11c5: c3                    ret
             | Disassembly of section .fini:          $ make clean &&
             | CXX=g++ make -j && objdump --disassemble=main -C
             | lto_test|sed -e 's,^,    ,'         rm -f *.o lto_test
             | g++ -c -flto -O3 -g cat.cpp -o cat.o         g++ -c -flto
             | -O3 -g main.cpp -o main.o         g++ -flto -O3 -g cat.o
             | main.o -o lto_test                  lto_test:     file
             | format elf64-x86-64                           Disassembly
             | of section .init:                  Disassembly of section
             | .plt:                  Disassembly of section .plt.got:
             | Disassembly of section .text:
             | 0000000000001090 <main>:             1090: 48 83 ec 08
             | sub    $0x8,%rsp             1094: 48 8d 3d 75 2f 00 00
             | lea    0x2f75(%rip),%rdi        # 4010 <garfield>
             | 109b: e8 50 01 00 00        call   11f0 <cat::speak()>
             | 10a0: 31 c0                 xor    %eax,%eax
             | 10a2: 48 83 c4 08           add    $0x8,%rsp
             | 10a6: c3                    ret
             | Disassembly of section .fini:
        
           | wiml wrote:
           | If your runtime environment has dynamic linking, then the LTO
           | pass can't always be sure that a subclass won't be introduced
           | later that overrides the method.
        
             | i80and wrote:
             | Aha! That makes sense. I wasn't thinking of that case.
             | Thanks!
        
             | gpderetta wrote:
             | You can tell the compiler it is indeed compiling the whole
             | program.
        
           | adzm wrote:
           | MSVC with LTO and PGO will inline virtual calls in some
           | situations along with a check for the expected vtable,
           | bypassing the inlined code and calling the virtual function
           | normally if it is an unexpected value.
        
           | bluGill wrote:
           | not if there is a shared libray or other plugin. Then you
           | coannot determine until runtime if there is an override.
        
           | ot wrote:
           | In general the compiler/linker cannot assume that derived
           | classes won't arrive later through a shared object.
           | 
           | You can tell it "I won't do that" though with additional
           | flags, like Clang's -fwhole-program-vtables, and even then
           | it's not that simple. There was an effort in Clang to better
           | support whole program devirtualization, but I haven't been
           | following what kind of progress has been made:
           | https://groups.google.com/g/llvm-dev/c/6LfIiAo9g68?pli=1
        
           | samus wrote:
           | This is one of the cases where JIT compiling can shine. You
           | can use a bazillion interfaces to decouple application code,
           | and the JIT will optimize the calls after it found out which
           | implementation is used. This works as long as there is only
           | one or two of them actually active at runtime.
        
         | Negitivefrags wrote:
         | See this is why I find this odd.
         | 
         | Is there a theory as to how devirtualisation could hurt
         | performance?
        
           | samus wrote:
           | Devirtualization maybe not necessarily, but inlining might
           | make code fail to fit into instruction caches.
        
           | hansvm wrote:
           | There's a cost to loading more instructions, especially if
           | you have more types of instructions.
           | 
           | The main advantages to inlining are (1) avoiding a jump and
           | other function call overhead, (2) the ability to push down
           | optimizations.
           | 
           | If you execute the "same" code (same instructions, different
           | location) in many places that can cause cache evictions and
           | other slowdowns. It's worse if some minor optimizations were
           | applied by the inlining, so you have more types of
           | instructions to unpack.
           | 
           | The question, roughly, is whether the gains exceed the costs.
           | This can be a bit hard to determine because it can depend on
           | the size of the whole program and other non-local parameters,
           | leading to performance cliffs at various stages of
           | complexity. Microbenchmarks will tend to suggest inlining is
           | better in more cases that it actually is.
           | 
           | Over time you get a feel for which functions should be
           | inlined. E.g., very often you'll have guard clauses or
           | whatnot around a trivial amount of work when the caller is
           | expected to be able to prove the guarded information at
           | compile-time. A function call takes space in the generated
           | assembly too, and if you're only guarding a few instructions
           | it's usually worth forcing an inline (even in places where
           | the compiler's heuristics would choose not to because the
           | guard clauses take up too much space), regardless of the
           | potential cache costs.
        
           | masklinn wrote:
           | Code bloat causing icache evictions?
        
           | cogman10 wrote:
           | Through inlining.
           | 
           | If you have something like a `while` loop and that while
           | loop's instructions fit neatly on the cache line, then
           | executing that loop can be quiet fast even if you have to
           | jump to different code locations to do the internals.
           | However, if you pump in more instructions in that loop you
           | can exceed the length of the cache line which causes you to
           | need more memory loads to do the same work.
           | 
           | It can also create more code. A method that took a
           | `foo(NotFinal& bar)` could be duplicated by the compiler for
           | the specialized cases which would be bad if there's a lot of
           | implementations of `NotFinal` that end up being marshalled
           | into foo. You could end up loading multiple implementations
           | of the same function which may be slower than just keeping
           | the virtual dispatch tables warm.
        
           | phire wrote:
           | Jumps/calls are actually be pretty cheap with modern branch
           | predictors. Even indirect calls through vtables, which is the
           | opposite of most programmers intuition.
           | 
           | And if the devirtualisation leads to inlining, that results
           | in code bloat which can lower performance though more
           | instruction cache misses, which are not cheap.
           | 
           | Inlining is actually pretty evil. It almost always speeds
           | things up for microbenchmarks, as such benchmarks easily fit
           | in icache. So programmers and modern compilers often go out
           | of their way to do more inlining. But when you apply too much
           | inlining to a whole program, things start to slow down.
           | 
           | But it's not like inlining is universally bad in larger
           | program, inlining can enable further optimisations, mostly
           | because it allows constant propagation to travel across
           | function boundaries.
           | 
           | Basically, compilers need better heuristics about when they
           | should be inlining. If it's just saving the overhead of a
           | lightweight call, then they shouldn't be inlining.
        
             | qsdf38100 wrote:
             | "Inlining is actually pretty evil".
             | 
             | No it's not. Except if you __force_inline__ everything, of
             | course.
             | 
             | Inlining reduces the number of instructions in a lot of
             | cases. Especially when things are abstracted and factored
             | with lot of indirections into small functions that calls
             | other small functions and so on. Consider a 'isEmpty'
             | function, which dissolves to 1 cpu instruction once
             | inlined, compared with a call/save reg/compare/return.
             | Highly dynamic code (with most functions being virtual)
             | tend to result in a fest of chained calls, jumping into
             | functions doing very little work. Yes the stack is usually
             | hot and fast, but spending 80% of the instructions doing
             | stack management is still a big waste.
             | 
             | Compilers already have good heuristics about when they
             | should be inlining, chances are they are a lot better at it
             | than you. They don't always inline, and that's not possible
             | anyway.
             | 
             | My experience is that compiler do marvels with inlining
             | decisions when there are lots of small functions they _can_
             | inline if they want to. It gives the compiler a lot of
             | freedom. Lambdas are great for that as well.
             | 
             | Make sure you make the most possible compile-time
             | information available to the compiler, factor your code,
             | don't have huge functions, and let the compiler do its
             | magic. As a plus, you can have high level abstractions,
             | deep hierarchies, and still get excellent performances.
        
               | grdbjydcv wrote:
               | The "evilness" is just that sometimes if you inline
               | aggressively in a microbenchmark things get faster but in
               | real programs things get slower.
               | 
               | As you say: "chances are they are a lot better at it than
               | you". Infrequently they are not.
        
           | neonsunset wrote:
           | Practically - it never does. It is always cheaper to perform
           | a direct, possibly inlined, call (devirtualization !=
           | inlining) than a virtual one.
           | 
           | Guarded devirtualization is also cheaper than virtual calls,
           | even when it has to do                   if (instance is
           | SpecificType st) { st.Call() }         else { instance.Call()
           | }
           | 
           | or even chain multiple checks at once (with either regular
           | ifs or emitting a jump table)
           | 
           | This technique is heavily used in various forms by .NET, JVM
           | and JavaScript JIT implementations (other platforms also do
           | that, but these are the major ones)
           | 
           | The first two devirtualize virtual and interface calls
           | (important in Java because all calls default to virtual,
           | important in C# because people like to abuse interfaces and
           | occasionally inheritance, C# delegates are also
           | devirtualized/inlined now). The JS JIT (like V8) performs
           | "inline caching" which is similar where for known object
           | shapes property access is shape type identifier comparison
           | and direct property read instead of keyed lookup which is way
           | more expensive.
        
       | andrewla wrote:
       | I'm surprised that it has any impact on performance at all, and
       | I'd love to see the codegen differences between the applications.
       | 
       | Mostly the `final` keyword serves as a compile-time assertion.
       | The compiler (sometimes linker) is perfectly capable of seeing
       | that a class has no derived classes, but what `final` assures is
       | that if you attempt to derive from such a class, you will raise a
       | compile-time error.
       | 
       | This is similar to how `inline` works in practice -- rather than
       | providing a useful hint to the compiler (though the compiler is
       | free to treat it that way) it provides an assertion that if you
       | do non-inlinable operations (e.g. non-tail recursion) then the
       | compiler can flag that.
       | 
       | All of this is to say that `final` can speed up runtimes -- but
       | it does so by forcing you to organize your code such that the
       | guarantees apply. By using `final` classes, in places where
       | dynamic dispatch can be reduced to static dispatch, you force the
       | developer to not introduce patterns that would prevent static
       | dispatch.
        
         | bgirard wrote:
         | > The compiler (sometimes linker) is perfectly capable of
         | seeing that a class has no derived classes
         | 
         | How? The compiler doesn't see the full program.
         | 
         | The linker I'm less sure about. If the class isn't guaranteed
         | to be fully private wouldn't an optimizing linker have to be
         | conservative in case you inject a derived class?
        
         | GuB-42 wrote:
         | "inline" is confusing in C++, as it is not really about
         | inlining. Its purpose is to allow multiple definitions of the
         | same function. It is useful when you have a function defined in
         | a header file, because if included in several source files, it
         | will be present in multiple object files, and without "inline"
         | the linker will complain of multiple definitions.
         | 
         | It is also an optimization hint, but AFAIK, modern compiler
         | ignore it.
        
           | wredue wrote:
           | I believe the wording I've seen is that compilers may not
           | respect the inline keyword, not that it is ignored.
        
           | fweimer wrote:
           | GCC does not ignore inline for inlining purposes:
           | 
           | Need a way to make inlining heuristics ignore whether a
           | function is inline
           | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008
           | 
           | (Bug saw a few updates recently, that's how I remembered.)
           | 
           | As a workaround, if you need the linkage aspect of the inline
           | keyword, you currently have to write fake templates instead.
           | Not great.
        
           | lqr wrote:
           | 10 years ago it was already folklore that compilers ignore
           | the "inline" keyword when optimizing, but that was false for
           | clang/llvm: https://stackoverflow.com/questions/27042935/are-
           | the-inline-...
        
           | jacoblambda wrote:
           | The thing with `inline` as an optimisation is that it's not
           | about optimising by inlining directly. It's a promise about
           | how you intend to use the function.
           | 
           | It's not just "you can have multiple definitions of the same
           | function" but rather a promise that the function doesn't need
           | to be address/pointer equivalent between translation units.
           | This is arguably more important than inlining directly
           | because it means the compiler can fully deduce how the
           | function may be used without any LTO or other cross
           | translation unit optimisation techniques.
           | 
           | Of course you could still technically expose a pointer to the
           | function outside a TU but doing so would be obvious to the
           | compiler and it can fall back to generating a strictly
           | conformant version of the function. Otherwise however it can
           | potentially deduce that some branches in said function are
           | unreachable and eliminate them or otherwise specialise the
           | code for the specific use cases in that TU. So it potentially
           | opens up alternative optimisations even if there's still a
           | function call and it's not inlined directly.
        
         | wheybags wrote:
         | What if I dlopen a shared object that contains a derived class,
         | then instantiate it. You cannot statically verify that I won't.
         | Or you could swap out a normally linked shared object for one
         | that creates a subclass. Etc etc. This kind of stuff is why I
         | think shared object boundaries should be limited to the lowest
         | common denominator (basically c abi). Dynamic linking high
         | level languages was a mistake. The only winning move is not to
         | play.
        
         | lanza wrote:
         | > Mostly the `final` keyword serves as a compile-time
         | assertion. The compiler (sometimes linker) is perfectly capable
         | of seeing that a class has no derived classes
         | 
         | That's incorrect. The optimizer has to assume everything
         | escapes the current optimization unit unless explicitly told
         | otherwise. It needs explicit guarantees about the visibility to
         | figure out the extent of the derivations allowed.
        
       | bluGill wrote:
       | I use final more for communication. Don't look for deeper derived
       | classes as there are none. that it results in slower code is an
       | annoying surprise.
        
       | p0w3n3d wrote:
       | I would say the most performance impact would give `constexpr`
       | followed by `const`. I wouldn't bet any money on `final` which in
       | C++ is a guard of inheritance, and C++ function invocation
       | address is resolved the `vtable` hence final wouldn't change
       | anything. Maybe the author was mistaken with `final` keyword in
       | Java
        
         | adrianN wrote:
         | In my experience the compiler is pretty good at figuring out
         | what is constant so adding const is more documentation for
         | humans, especially in C++, where const is more of a hint than a
         | hard boundary. Devirtualization, as can happen when you add a
         | final, or the optimizations enabled by adding a restrict to a
         | pointer, are on the other hand often essential for performance
         | in hot code.
        
           | bayindirh wrote:
           | Since "const" makes things read-only, being const correct
           | makes sure that you don't do funny things with the data you
           | shouldn't mutate, which in turn eliminates tons of data bugs
           | out of the gate.
           | 
           | So, it's an opt-in security feature first, and a compiler
           | hint second.
        
       | ein0p wrote:
       | You should use final to express design intent. In fact I'd rather
       | it were the default in C++, and there was some sort of an
       | opposite ('derivable'?) keyword instead, but that ship has sailed
       | long time ago. Any measurable negative perf impact should be
       | filed as a bug and fixed.
        
         | cesarb wrote:
         | > In fact I'd rather it were the default in C++, and there was
         | some sort of an opposite ('derivable'?) keyword instead
         | 
         | Kotlin (which uses the equivalent of the Java "final" keyword
         | by default) uses the "open" keyword for that purpose.
        
         | josefx wrote:
         | Intent is nice and all that, but I would like a
         | "nonwithstanding" keyword instead that just lets me bypass that
         | kind of "intent" without having to copy paste the entire
         | implementation just to remove a pointless keyword or make a
         | destructor public when I need it.
        
         | jbverschoor wrote:
         | In general, I think things should be strict by default. Way
         | easier to optimize and less error prone.
        
         | leni536 wrote:
         | C++ doesn't have the fragile base problem, as members aren't
         | virtual my default. The only concern with unintended
         | inheritance is with polymorhpic deletion. "final" on class
         | definition disables some tricks thag you can do with private
         | inheritance.
         | 
         | Having said that "final" on member functions is great, and I
         | like to see that instead of "override".
        
       | jey wrote:
       | I wonder if LTO was turned on when using Clang? Might lead to a
       | performance improvement.
        
       | pineapple_sauce wrote:
       | What should be evaluated is removing indirection and tightly
       | packing your data. I'm sure you'll gain a better performance
       | improvement. virtual calls and shared_ptr are littered in the
       | codebase.
       | 
       | In this way: you can avoid the need for the `final` keyword and
       | do the optimization the keyword enables (de-virtualize calls).
       | 
       | >Yes, it is very hacky and I am disgusted by this myself. I would
       | never do this in an actual product
       | 
       | Why? What's with the C++ community and their disgust for macros
       | without any underlying reasoning? It reminds me of everyone
       | blindly saying "Don't use goto; it creates spaghetti code".
       | 
       | Sure, if macros are overly used: it can be hard to read and
       | maintain. But, for something simple like this, you shouldn't be
       | thinking "I would never do this in an actual product".
        
         | sfink wrote:
         | Macros that are giving you some value can be ok. In this case,
         | once the performance conclusion is reached, the only reason to
         | continue using a macro is if you really need the `final`ity to
         | vary between builds. Otherwise, just delete it or use the
         | actual keyword.
         | 
         | (But I'm worse than the author; if I'm just comparing
         | performance, I'd probably put `final` everywhere applicable and
         | then do separate compiles with `-Dfinal=` and
         | `-Dfinal=final`... I'd be making the assumption that it's
         | something I either always or never want eventually, though.)
        
         | bluGill wrote:
         | Macros in C are a text replace and so it is hard to see from a
         | debugger how th code got like that.
        
           | pineapple_sauce wrote:
           | Yes, I'm well aware of the definition of a macro in C and
           | C++. Macros are simpler than templates. You can expand them
           | with a compiler flag.
        
             | bluGill wrote:
             | when things get complex templete error messages are easier
             | to follow. nobody makes complex macros but if you tried.
             | (template error messeges are legendary for a reason. nested
             | macros are worse)
        
         | jandrewrogers wrote:
         | In modern C++, macros are a viewed as a code smell because they
         | are strictly worse than alternatives in almost all situations.
         | It is a cultural norm; it is a bit like using "unsafe" in Rust
         | if not strictly required for some trivial case. The C++
         | language has made a concerted effort to eliminate virtually all
         | use cases for macros since C++11 and replace them with type-
         | safe first-class features in the language. It is a bit of a
         | legacy thing at this point, there are large modern C++
         | codebases with no macros at all, not even for things like
         | logging. While macros aren't going away, especially in older
         | code, the cultural norm in modern C++ has tended toward macros
         | being a legacy foot-gun and best avoided if at all possible.
         | 
         | The main remaining use case for the old C macro facility I
         | still see in new code is to support conditional compilation of
         | architecture-specific code e.g. ARM vs x86 assembly routines or
         | intrinsics.
        
           | sgerenser wrote:
           | But how would one conditionally enable or disable the "final"
           | keyword on class members without a preprocessor macro, even
           | in C++23?
        
       | gpderetta wrote:
       | 1% is nothing to scoff of. But I suspect that the variability of
       | compilation (specifically quirks of instruction selection,
       | register allocation and function alignment) more than mask any
       | gains.
       | 
       | The clang regression might be explainable by final allowing some
       | additional inlining and clang making an hash of it.
        
       | jcalvinowens wrote:
       | That's interesting. Maybe final enabled more inlining, and clang
       | is being too aggressive about it for the icache sizes in play
       | here? I'd love to see a comparison of the generated code.
       | 
       | I'm disappointed the author's conclusion is "don't use final",
       | not "something is wrong with clang".
        
         | ot wrote:
         | Or "something is wrong with my benchmark setup", which is also
         | a possibility :)
         | 
         | Without a comparison of generated code, it could be anything.
        
       | indigoabstract wrote:
       | If it does have a noticeable impact, that would be surprising, a
       | bit like going back to the days when 'inline' was supposed to
       | tell the compiler to inline the designated functions (no longer
       | its main use case nowadays).
        
       | sfink wrote:
       | tldr: sprinkled a keyword around in the hopes that it "does
       | something" to speed things up, tested it, got noisy results but
       | no miraculous speedup.
       | 
       | I started skimming this article after a while, because it seemed
       | to be going into the weeds of performance comparison without ever
       | backing up to look at what the change might be doing. Which meant
       | that I couldn't tell if I was going to be looking at the usual
       | random noise of performance testing or something real.
       | 
       | For `final`, I'd want to at least see if it changing the
       | generated code by replacing indirect vtable calls with direct or
       | inlined calls. It might be that the compiler is already figuring
       | it out and the keyword isn't doing anything. It might be that the
       | compiler _is_ changing code, but the target address was already
       | well-predicted and it 's perturbing code layout enough that it
       | gets slower (or faster). There could be something interesting
       | here, but I can't tell without at least a little assembly output
       | (or perhaps a relevant portion of some intermediate
       | representation, not that I would know which one to look at).
       | 
       | If it's not changing anything, then perhaps there could be an
       | interesting investigation into the variance of performance
       | testing in this scenario. If it's changing something, then there
       | could be an interesting investigation into when that makes things
       | faster vs slower. As it is, I can't tell what I should be looking
       | for.
        
         | sgerenser wrote:
         | This is what I was waiting for too. Especially with the large
         | regression on Clang/Ubuntu. Maybe he uncovered a Clang/LLVM
         | codegen bug, but you'd need to compare the generated assembly
         | to know.
        
       | jeffbee wrote:
       | It's difficult to discuss this stuff because the impact can be
       | negligible or negative for one person, but large and consistently
       | positive for another. You can only usefully discuss it on a given
       | baseline, and for something like final I would hope that baseline
       | would be a project that already enjoys PGO, LTO, and BOLT.
        
       | tombert wrote:
       | I don't do much C++, but I have definitely found that engineers
       | will just assert that something is "faster" without any evidence
       | to back that up.
       | 
       | Quick example, I got in an argument with someone a few years ago
       | that claimed in C# that a `switch` was better than an `if(x==1)
       | elseif(x==2)...` because switch was "faster" and rejected my PR.
       | I mentioned that that doesn't appear to be true, we went back and
       | forth until I did a compile-then-decompile of a minimal test with
       | equality-based-ifs, and showed that the compiler actually
       | converts equality-based-ifs to `switch` behind the scenes. The
       | guy accepted my PR after that.
       | 
       | But there's tons of this stuff like this in CS, and I kind of
       | blame professors for a lot of it [1]. A large part of becoming a
       | decent engineer [2] for me was learning to stop trusting what
       | professors taught me in college. Most of what they said was fine,
       | but you can't _assume_ that; what they tell you could be out of
       | date, or simply never correct to begin with, and as far as I can
       | tell you have to _always_ test these things.
       | 
       | It doesn't help that a lot of these "it's faster" arguments are
       | often reductive because they only are faster in extremely minimal
       | tests. Sometimes a microbenchmark will show that something is
       | faster, and there's value in that, but I think it's important
       | that that can also be a small percentage of the total program;
       | compilers are obscenely good at optimizing nowadays, it can be
       | difficult to determine _when_ something will be optimized, and
       | your assertion that something is  "faster" might not actually be
       | true in a non-trivial program.
       | 
       | This is why I don't really like doing any kind of major
       | optimizations before the program actually works. I try to keep
       | the program in a reasonable Big-O and I try and minimize network
       | calls cuz of latency, but I don't bother with any kind of micro-
       | optimizations in the first draft. I don't mess with bitwise, I
       | don't concern myself on which version of a particular data
       | structure is a millisecond faster, I don't focus too much on
       | whether I can get away with a smaller sized float, etc. Once I
       | know that the program is correct, _then_ I benchmark to see if
       | any kind of micro-optimizations will actually matter, and often
       | they really don 't.
       | 
       | [1] That includes me up to about a year ago.
       | 
       | [2] At least I like to pretend I am.
        
         | BurningFrog wrote:
         | Even if one of these constructs is faster _it doesn 't matter_
         | 99% of the time.
         | 
         | Writing well structured readable code is typically far more
         | important than making it twice as fast. And those times can
         | rarely be predicted beforehand, so you should mostly not worry
         | about it until you see real performance problems.
        
           | tombert wrote:
           | I mostly focus on "using stuff that won't break", and yeah
           | "if it actually matters".
           | 
           | For example, much to the annoyance of a lot of people, I
           | don't typically use floating point numbers when I start out.
           | I will use the "decimal" or "money" types of the language, or
           | GMP if I'm using C. When I do that, I can be sure that I
           | won't have to worry about any kind of funky overflow issues
           | or bizarre rounding problems. There _might_ be a performance
           | overhead associated with it, but then I have to ask myself
           | "how often is this actually called?"
           | 
           | If the answer is "a billion times" or "once in every
           | iteration of the event loop" or something, then I will
           | probably eventually go back and figure out if I can use a
           | float or convert it to an integer-based thing, but in a lot
           | of cases the answer is "like ten or twenty times", and at
           | that point I'm not even 100% sure it would be even measurable
           | to change to the "faster" implementations.
           | 
           | What annoys me is that people will act like they really care
           | about speed, do all these annoying micro-optimizations, and
           | then forget that pretty much all of them get wiped out
           | immediately upon hitting the network, since the latency
           | associated with that is obscene.
        
           | apantel wrote:
           | The counter-argument to this is if you are building something
           | that is in the critical path of an application (for example,
           | parsing HTTP in a web server), you need to be performance-
           | minded from the beginning because design decisions lead to
           | design decisions. If you are building something in the
           | critical path of the application, the best thing to do is
           | build it from the ground up measuring the performance of what
           | you have as you go. This way, each time you add something you
           | will see the performance impact and usually there's a more
           | performant way of doing something that isn't more obscure. If
           | you do this as you build, early choices become constraints,
           | but because you chose the most performant thing at every
           | stage, the whole process takes you in the direction of a
           | highly-performant implementation.
           | 
           | Why should you care about performance?
           | 
           | I can give you my personal experience: I've been working on a
           | Java web/application server for the past 15 years and a
           | typical request (only reading, not writing to the db) would
           | take maybe 4-5 ms to execute. That includes HTTP request
           | parsing, JSON parsing, session validation, method execution,
           | JSON serialization, and HTTP response dispatch. Over the past
           | 9 months I have refactored the entire application for
           | performance and a typical request now takes about 0.25 ms or
           | 250 microseconds. The computer is doing so much less work to
           | accomplish the same tasks, it's almost silly how much work it
           | was doing before. And the result is the machine can handle
           | 20x more requests in the same amount of time. If it could
           | handle 200 requests per second per core before, now it can
           | handle 4000. That means the need to scale is felt 20x less
           | intensely, which means less complexity around scaling.
           | 
           | High performance means reduced scaling requirements.
        
             | tombert wrote:
             | But even that sort of depends right? Hardware is often
             | pretty cheap in comparison to dev-time. I really depends on
             | the project, what kind of servers you're using, the nature
             | of the application etc, but I think a lot of the time it
             | might be cheaper to just pay for 20x the servers than it
             | would be to pay a human to go find a critical path.
             | 
             | I'm not saying you completely throw caution to the wind,
             | I'm just saying that there's a finite amount of human
             | resources and it can really vary how you want to allocate
             | them. Sometimes the better path is to just throw money at
             | the problem.
             | 
             | It really depends.
        
               | apantel wrote:
               | I think it depends on what you're building and who's
               | building it. We're all benefitting from the fact that the
               | designers of NGINX made performance a priority. We like
               | using things that were designed to be performant. We like
               | high-FPS games. We like fast internet.
               | 
               | I personally don't like the idea of throwing compute at a
               | slow solution. I like when the extra effort has been put
               | into something. The good feeling I get from interacting
               | with something that is optimal or excellent is an end in
               | itself and one of the things I live for.
        
               | tombert wrote:
               | Sure, though I've mentioned a few times in this thread
               | now that the thing that bothers me more than CPU
               | optimizations is not taking into account latency,
               | particularly when hitting the network, and I think
               | focusing on that will generally pay higher dividends than
               | trying to optimize for processing.
               | 
               | CPUs are ridiculously fast now, and compilers are really
               | really good now too. I'm not going to say that processing
               | speed is a "solved" problem, but I am going to say that
               | in a lot of performance-related cases the CPU processing
               | is probably not your problem. I will admit that this kind
               | of pokes holes in my previous response, because
               | introducing more machines into the mix will almost
               | certainly increase latency, but I think it more or less
               | holds depending on context.
               | 
               | But I think it really is a matter of nuance, which you
               | hinted at. If I'm making an admin screen that's going to
               | have like a dozen users max, then a slow, crappy solution
               | is probably fine; the requests will be served fast enough
               | to where no one will notice anyway, and you can probably
               | even get away with the cheapest machine/VM. If I'm making
               | an FPS game that has 100,000 concurrent users, then it
               | almost certainly will be beneficial to squeeze out as
               | much performance out of the machine as possible, both CPU
               | _and_ latency-wise.
               | 
               | But as I keep repeating everywhere, you have to measure.
               | You cannot assume that your intuition is going to be
               | right, particularly at-scale.
        
               | apantel wrote:
               | I absolutely agree that latency is the real thing to
               | optimize for. In my case, I only leave the application to
               | access the db, and my applications tend not to be write-
               | heavy. So in my case latency-per-request == how much work
               | the computer has to do, which is constrained to one core
               | because the overhead of parallelizing any part of the
               | pipeline is greater than the work required. See, in that
               | sense, we're already close to the performance ceiling for
               | per-request processing because clock speeds aren't going
               | up. You can't make the processing of a given request
               | faster by throwing more hardware at it. You can only make
               | it faster by creating less work for the hardware to do.
               | 
               | (Ironically, HN is buckling under load right now, or some
               | other issue.)
        
               | oivey wrote:
               | It almost certainly would require more than 20x servers
               | because setting up horizontal scaling will have some sort
               | of overhead. Not only that, there is the significant
               | engineering effort to develop and maintain the code to
               | scale.
               | 
               | If your problem can fit on one server, it can massively
               | reduce engineering and infrastructure costs.
        
             | neonsunset wrote:
             | Please accept a high five from a fellow "it does so little
             | work it must have sub-millisecond request latency"
             | aficionado (though I must admit I'm guilty of abusing
             | memory caches to achieve this).
        
               | apantel wrote:
               | Caches, precomputed values, lookup tables -- it's all
               | good as long as it's well-organized and maintainable.
        
           | neonsunset wrote:
           | This attitude is part of the problem. Another part of the
           | problem is having no idea which things actually end up
           | costing performance and how much.
           | 
           | It is why many language ecosystems suffered from performance
           | issues for a really long time even if completely unwarranted.
           | 
           | Is changing ifs to switch or vice versa, as outlined in the
           | post above, a waste of time? Yes, unless you are writing some
           | encoding algorithm or a parser, it will not matter. The
           | compiler will lower trivial statements to the same codegen
           | and it will not impact the resulting performance anyway even
           | if there was difference given a problem the code was solving.
           | 
           | However, there are things that _do_ cost like interface spam,
           | abusing lambdas writing needlessly complex wokflow-style
           | patterns (which are also less readable and worse in 8 out of
           | 10 instances), not caching objects that always have the same
           | value, etc.
           | 
           | These kinds of issues, for example, plagued .NET ecosystem
           | until more recent culture shift where it started to be cool
           | once again to focus on performance. It wasn't being helped by
           | the notion of "well-structured code" being just idiotic
           | "clean architecture" and "GoF patterns" style dogma applied
           | to smallest applications and simplest of business domains.
           | 
           | (it is also the reason why picking slow languages in general
           | is a really bad idea - _everything_ costs more and you have
           | way less leeway for no productivity win - Ruby and Python,
           | and JS with Node.js are less productive to write in than C#
           | /F#, Kotlin/Java or Go(under some conditions))
        
             | tombert wrote:
             | I mean, that's kind of why I tried to emphasize measuring
             | things yourself instead of depending on tribal knowledge.
             | 
             | There are plenty of cases where even the "slow"
             | implementation is more than fast enough, and there are also
             | plenty of cases where the "correct" solution (from a big-O
             | or intuition perspective) is actually slower than the dumb
             | case. Intuition _helps_ , you _have_ to measure and /or
             | look at the compiled results if you want to ensure correct
             | numbers.
             | 
             | An example that really annoys me is how every whiteboard
             | interview ends up being "interesting ways to use a
             | hashmap", which isn't inherently an issue, but they will
             | usually be so small-scoped that an iterative "array of
             | pairs" might actually be cheaper than paying the up-front
             | cost of hashing and potentially dealing with collisions.
             | Interviews almost always ignore constant factors, and
             | that's fair enough, but in reality constant factors _can_
             | matter, and we 're training future employees to ignore
             | that.
             | 
             | I'll say it again: as far as I can tell, you _have_ to
             | measure if you want to know if your result is  "faster".
             | "Measuring" might involve memory profilers, or dumb timers,
             | or a mixture of both. Gut instincts are often wrong.
        
         | leetcrew wrote:
         | agreed, especially in cases like this. final is primarily a way
         | to prohibit overriding methods and extending classes, and it
         | indicates to the reader that they should not be doing this. use
         | it when it makes conceptual sense.
         | 
         | that said, c++ is usually a language you use when you care
         | about performance, at least to an extent. it's worth
         | understanding features like nrvo and rewriting functions to
         | allow the compiler to pick the optimization if it doesn't hurt
         | readability too much.
        
         | wvenable wrote:
         | In my opinion, the only things that really matter are
         | algorithmic complexity and readability. And even algorithmic
         | complexity is usually only an issue a certain scales. Whether
         | or not an 'if' is faster than a 'switch' is the micro of micro
         | optimizations -- you better have a good reason to care. The
         | question I would have for you is was your bunch of ifs more
         | readable than a switch would be.
        
           | doctor_phil wrote:
           | But a switch and an if-else *is* a matter of algorithmic
           | complexity. (Well, at least could be for a naive compiler). A
           | switch could be converted to a constant time jump, but the
           | if-else would be trying each case linearly.
        
             | cogman10 wrote:
             | Yup.
             | 
             | That said, the linear test is often faster due to CPU
             | caches, which is why JITs will often convert switches to
             | if/elses.
             | 
             | IMO, switch is clearer in general and potentially faster
             | (at very least the same speed) so it should be preferred
             | when dealing with 3+ if/elseif statements.
        
               | tombert wrote:
               | Hard disagree that it's "clearer". I have had to deal
               | with a ton of bugs with people trying to be clever with
               | the `break` logic, or forgetting to put `break` in there
               | at all.
               | 
               | if statements are dumber, and maybe arguably uglier, but
               | I feel like they're also more clear, and people don't try
               | and be clever with them.
        
               | cogman10 wrote:
               | Updates to languages (don't know where C# is on this)
               | have different types of switch statements that eliminate
               | the `break` problem.
               | 
               | For example, with java there's enhanced switch that looks
               | like this                   var val = switch(foo) {
               | case 1, 2, 3 -> bar;          case 4 -> baz;
               | default -> {            yield bat();          }         }
               | 
               | The C style switch break stuff is definitely a language
               | mistake.
        
               | wvenable wrote:
               | C# has both switch expressions like this and also break
               | statements are not optional in traditional switch
               | statements so it actually solves both problems. You can't
               | get too clever with switch statements in C#.
               | 
               | However most languages have pretty permissive switch
               | statements just like C.
        
               | tombert wrote:
               | Yeah, fair, it's been awhile since I've done any C#, so
               | my memory is a bit hazy with the details. I've been
               | burned C with switch statements so I have a pretty strong
               | distaste for them.
        
               | neonsunset wrote:
               | C# has switch statements which are C/C++ style switches
               | and switch expressions which are like Rust's match except
               | no control flow statements inside:                   var
               | len = slice switch         {             null => 0,
               | "Hello" or "World" => 1,             ['@', ..var tags] =>
               | tags.Length,             ['{', ..var body, '}'] =>
               | body.Length,             _ => slice.Length,         };
               | 
               | (it supports a lot more patterns but that wouldn't fit)
        
               | gloryjulio wrote:
               | This is just forcing return value. You either have to
               | break or return at the branches. To me they all look
               | equivalent
        
               | neonsunset wrote:
               | Any sufficiently advanced compiler will rewrite those
               | arbitrarily depending on its heuristics. What authors
               | usually forget is that there is defined behavior and
               | specification which the compiler abides by, but it is
               | otherwise free to produce any codegen that preserves the
               | defined program order. Branch reordering, generating jump
               | tables, optimizing away or coalescing checks into
               | branchless forms are all very common. When someone says
               | "oh I write C because it lets you tell CPU how exactly to
               | execute the code" is simply a sign that a person never
               | actually looked at disassembly and has little to no idea
               | how the tool they use works.
        
               | cogman10 wrote:
               | A complier will definitely try this, but it's important
               | to note that if/else blocks tell the compiler that "you
               | will run these evaluations in order". Now, if the
               | compiler can detect that the evaluations have no side
               | effects (which, in this simple example with just integer
               | checks, is fairly likely) then yeah I can see a jump
               | table getting shoved in as an optimization.
               | 
               | However, the moment you add a side effect or something
               | more complicated like a method call, it becomes really
               | hard for the complier to know if that sort of
               | optimization is safe to do.
               | 
               | The benefit of the switch statement is that it's already
               | well positioned for the compiler to optimize as it does
               | not have the "you must run these evaluations in order"
               | requirement. It forces you to write code that is fairly
               | compiler friendly.
               | 
               | All that said, probably a waste of time debating :D.
               | Ideally you have profiled your code and the profiler has
               | told you "this is the slow block" before you get to the
               | point of worrying about how to make it faster.
        
               | tombert wrote:
               | I agree with what you said but in this particular case,
               | it actually was a direct integer equality check, there
               | was zero risk of hitting side effects and that was
               | plainly obvious to me, the checker, and compiler.
        
               | cogman10 wrote:
               | And to your original comment, I think the reviewer was
               | wrong to reject the PR over that. Performance has to be
               | measured before you can use it to reject (or create...) a
               | PR. If someone hasn't done that then unless it's
               | something obvious like "You are making a ton of tiny heap
               | allocations in a tight loop" then I think nitpicking
               | these sorts of things is just wrong.
        
             | saurik wrote:
             | While I personally find the if statements harder to
             | immediately mentally parse/grok--as I have to prove to
             | myself that they are all using the same variable and are
             | all chained correctly in a way that is visually obvious for
             | the switch statement--I don't find "but what if we use a
             | naive compiler" at all a useful argument to make as, well,
             | we aren't using a naive compiler, and, if we were, there
             | are a ton of other things we are going to be sad about the
             | performance of leading us down a path of re-implementing a
             | number of other optimizations. The goal of the compiler is
             | to shift computational complexity from runtime to compile
             | time, and figuring out whether the switch table or the
             | comparisons are the right approach seems like a legitimate
             | use case (which maybe we have to sometimes disable, but
             | probably only very rarely).
        
             | bregma wrote:
             | But what if, and stick with me here, a compiler is capable
             | of reading and processing your code and through simple
             | scalar evolution of the conditionals and phi-reduction, it
             | can't tell the difference between a switch statement and a
             | sequence of if statements by the time it finishes its
             | single static analysis phase?
             | 
             | It turns out the algorithmic complexity of a switch
             | statement and the equivalent series of if-statements is
             | identical. The bijective mapping between them is close to
             | the identity function. Does a naive compiler exist that
             | doesn't emit the same instructions for both, at least
             | outside of toy hobby project compilers written by amateurs
             | with no experience?
        
           | tombert wrote:
           | Yeah, and it's not like I didn't know how to do the stuff I
           | was doing with a switch, I just don't like switches because
           | I've forgotten to add break statements and had code that
           | appeared correct but actually a month down the line. I've
           | also seen other people make the same mistakes. ifs, in my
           | opinion at least, are a bit harder to screw up, so I will
           | always prefer them.
           | 
           | But I agree, algorithmic complexity is generally the only
           | thing I focus on, and even then it's almost always a case of
           | "will that actually matter?" If I know that `n` is never
           | going to be more than like `10`, I might not bother trying to
           | optimize an O(n^2) operation.
           | 
           | What I feel often gets ignored in these conversations is
           | latency; people obsess over some "optimization" they learned
           | in college a decade ago, and ignore the 200 HTTP or Redis
           | calls being made ten lines below, despite the fact that the
           | latter will have a substantially higher impact on
           | performance.
        
         | saghm wrote:
         | > But there's tons of this stuff like this in CS
         | 
         | Reminds me of the classic
         | https://stackoverflow.com/questions/24848359/which-is-faster...
        
           | sgerenser wrote:
           | Never saw that before, that is indeed a classic.
        
         | jollyllama wrote:
         | I've encountered similar situations before. It's insane to me
         | when people hold up PRs over that kind of thing.
        
         | dosshell wrote:
         | > I can get away with a smaller sized float
         | 
         | When talking about not assuming optimizations...
         | 
         | 32bit float is slower than 64bit float on reasonable modern
         | x86-64.
         | 
         | The reason is that 32bit float is emulated by using 64bit.
         | 
         | Of course if you have several floats you need to optimize
         | against cache.
        
           | tombert wrote:
           | Sure, I clarified this in a sibling comment, but I kind of
           | meant that I will use the slower "money" or "decimal" types
           | by default. Usually those are more accurate and less error-
           | prone, and then if it actually matters I might go back to a
           | floating point or integer-based solution.
        
           | sgerenser wrote:
           | I think this is only true if using x87 floating point, which
           | anything computationally intensive is generally avoiding
           | these days in favor of SSE/AVX floats. In the latter case,
           | for a given vector width, the cpu can process twice as many
           | 32 bit floats as 64 bit floats per clock cycle.
        
             | dosshell wrote:
             | Yes, as I wrote, it is only true for one float value.
             | 
             | SIMD/MIMD will benefit of working on smaller width. This is
             | not only true because they do more work per clock but
             | because memory is slow. Super slow compared to the cpu.
             | Optimization is alot about cache misses optimization.
             | 
             | (But remember that the cache line is 64 bytes, so reading a
             | single value smaller than that will take the same time. So
             | it does not matter in theory when comparing one f32 against
             | one f64)
        
           | jcranmer wrote:
           | Um... no. This is 100% completely and totally wrong.
           | 
           | x86-64 requires the hardware to support SSE2, which has
           | native single-precision and double-precision instructions for
           | floating-point (e.g., scalar multiply is MULSS and MULSD,
           | respectively). Both the single precision and the double
           | precision instructions will take the same time, except for
           | DIVSS/DIVSD, where the 32-bit float version is slightly
           | faster (about 2 cycles latency faster, and reciprocal
           | throughput of 3 versus 5 per Agner's tables).
           | 
           | You might be thinking of x87 floating-point units, where all
           | arithmetic is done internally using 80-bit floating-point
           | types. But all x86 chips in like the last 20 years have had
           | SSE units--which are faster anyways. Even in the days when it
           | was the major floating-point units, it wasn't any slower,
           | since all floating-point operations took the same time
           | independent of format. It might be slower if you insisted
           | that code compilation strictly follow IEEE 754 rules, but the
           | solution everybody did was to _not_ do that and that 's why
           | things like Java's strictfp or C's FLT_EVAL_METHOD were born.
           | Even in _that_ case, however, 32-bit floats would likely be
           | faster than 64-bit for the simple fact that 32-bit floats can
           | safely be emulated in 80-bit without fear of double rounding
           | but 64-bit floats cannot.
        
             | dosshell wrote:
             | I agree with you. It should take the same time when
             | thinking more about it. I remember learning this in ~2016
             | and I did performance test on Skylake which confirmed
             | (Windows VS2015). I think I remember that i only tested
             | with addsd/addss. Definitely not x87. But as always, if the
             | result can not be reproduced... I stand corrected until
             | then.
        
         | jandrewrogers wrote:
         | A significant part of it is that what engineers believe was
         | effectively true at one time. They simply haven't revisited
         | those beliefs or verified their relevance in a long time. It
         | isn't a terrible heuristic for life in general to assume that
         | what worked ten years ago will work today. The rate at which
         | the equilibriums shift due to changes in hardware and software
         | environments when designing for system performance is so rapid
         | that you need to make a continuous habit of checking that your
         | understanding of how the world works maps to reality.
         | 
         | I've solved a lot of arguments with godbolt and simple
         | performance tests. Some topics are recurring themes among
         | software engineers e.g.:
         | 
         | - compilers are almost always better at micro-optimizations
         | than you are
         | 
         | - disk I/O is almost never a bottleneck in competent designs
         | 
         | - brute-force sequential scans are often optimal algorithms
         | 
         | - memory is best treated as a block device
         | 
         | - vectorization can offer large performance gains
         | 
         | - etc...
         | 
         | No one is immune to this. I am sometimes surprised at the
         | extent to which assumptions are no longer true when I revisit
         | optimization work I did 10+ years ago.
         | 
         | Most performance these days is architectural, so getting the
         | initial design right often has a bigger impact than micro-
         | optimizations and localized Big-O tweaks. You can always go
         | back and tweak algorithms or codegen later but architecture is
         | permanent.
        
           | neonsunset wrote:
           | .NET is a particularly bad case for this because it was a
           | decade of few performance improvements, which caused a
           | certain intuition to develop within the industry, then 6-8
           | years of significant changes each year (with most wins
           | compressed to the last 4 years or so). Companies moving from
           | .NET Framework 4.6/7/8 to .NET 8 experience a 10x _average_
           | performance improvement, which naturally comes with rendering
           | a lot of performance know-how obsolete overnight.
           | 
           | (the techniques that used to work were similar to earlier
           | Java versions and overall very dynamic languages with some
           | exceptions, the techniques that still work and now are
           | required today are the same as in C++ or Rust)
        
           | tombert wrote:
           | Yep, completely agree with you on this. Intuition is often
           | wrong, or at least outdated.
           | 
           | When I'm building stuff I try my best to focus on
           | "correctness", and try to come up with an algorithm/design
           | that will encompass all realistic use cases. If I focus on
           | that, it's relatively easy to go back and convert my
           | `decimal` type to a float64, or even convert an if statement
           | into a switch if it's actually faster.
        
         | klyrs wrote:
         | > A large part of becoming a decent engineer [2] for me was
         | learning to stop trusting what professors taught me in college
         | 
         | When I was taught about performance, it was all about
         | benchmarking and profiling. I never needed to trust what my
         | professors taught, because they taught me to dig in and find
         | the truth for myself. This was taught alongside the big-O
         | stuff, with several examples where "fast" algorithms are slower
         | on small inputs.
        
           | TylerE wrote:
           | How do you even get meaningful profiling out of most modern
           | langs? It seems the vast majority of time and calls gets
           | spent inside tiny anonymous functions, GC allocations, and
           | stuff like that.
        
             | klyrs wrote:
             | I don't use most modern langs! And especially if I'm doing
             | work where performance is critical, I won't kneecap myself
             | by using a language that I can't reasonably profile.
        
             | neonsunset wrote:
             | This is easy in most modern programming languages.
             | 
             | JVM ecosystem has IntelliJ Idea profiler and similar
             | advanced tools (AFAIK).
             | 
             | .NET has VS/Rider/dotnet-trace profilers (they are very
             | detailed) to produce flamegraphs.
             | 
             | Then there are native profilers which can work with any AOT
             | compiled language that produces canonically symbolicated
             | binaries: Rust, C#/F#(AOT mode), Go, Swift, C++, etc.
             | 
             | For example, you can do `samply record ./some_binary`[0]
             | and then explore multi-threaded flamegraph once completed
             | (I use it to profile C#, it's more convenient than dotTrace
             | for preliminary perf work and is usually more than
             | sufficient).
             | 
             | [0] https://github.com/mstange/samply
        
         | trueismywork wrote:
         | There's not yet a culture of writing reproducible benchmarks to
         | gage these effects.
        
         | zmj wrote:
         | .NET is a little smarter about switch code generation these
         | days: https://github.com/dotnet/roslyn/pull/66081
        
       | JackYoustra wrote:
       | I really wish he'd listed all the flags he used. To add on to the
       | flags already listed by some other commenters, `-mcpu` and
       | related flags are really crucial in these microbenchmarks: over
       | such a small change and such a small set of tight loops, you
       | could just be regression on coincidences in the microarchitecture
       | scheduler vs higher level assumptions.
        
         | j_not_j wrote:
         | And he didn't repeat each test case 5 or 9 times, and take the
         | median (or even an average).
         | 
         | There will be operating system noise that can be in the multi-
         | percent range. This is defined as various OS services that run
         | "in the background" taking up cpu time, emptying cache lines
         | (which may be most important), and flushing a few translate
         | lookaside entries.
         | 
         | Once you recognize the variability from run to run, claiming
         | "1%" becomes less credible. Depending on the noise level, of
         | course.
         | 
         | Linux benchmarks like SPECcpu tend to be run in "single-user
         | mode" meaning almost no background processes are running.
        
       | mgraczyk wrote:
       | The main case where I use final and where I would expect benefits
       | (not covered well by the article) is when you are using an
       | external library with pure virtual interfaces that you implement.
       | 
       | For example, the AWS C++ SDK uses virtual functions for
       | everything. When you subclass their classes, marking your classes
       | as final allows the compiler to devirtualize your own calls to
       | your own functions (GCC does this reliably).
       | 
       | I'm curious to understand better how clang is producing worse
       | code in these cases. The code used for the blog post is a bit too
       | complicated for me to look at, but I would love to see some
       | microbenchmarks. My guess is that there is some kind of icache or
       | code side problem. where inlining more produces worse code.
        
         | cogman10 wrote:
         | Could easily just be a bad optimization pathway.
         | 
         | `final` tells the compiler that nothing extends this class.
         | That means the compiler can theoretically do things like
         | inlining class methods and eliminate virtual method calls
         | (perhaps duplicating the method)?
         | 
         | However, it's quite possible that one of those optimizations
         | makes the code bigger or misaligns things with the cache in
         | unexpected ways. Sometimes, a method call can bet faster than
         | inlining. Especially with hot loops.
         | 
         | All this being said, I'd expect final to offer very little
         | benefit over PGO. Its main value is the constraint it imposes
         | and not the optimization it might enable.
        
       | jeffbee wrote:
       | I profiled this project and there are abundant opportunities for
       | devirtualization. The virtual interface `IHittable` is the hot
       | one. However, the WITH_FINAL define is not sufficient, because
       | the hot call is still virtual. At `hit_object |=
       | _objects[node->object_index()]->hit` I am still seeing ` mov
       | (%rdi),%rax; call *0x18(%rax)` so the application of final here
       | was not sufficient to do the job. Whatever differences are being
       | measures are caused by bogons.
        
         | gpderetta wrote:
         | I haven't looked at the code, but if you have multiple leaves,
         | even marking all of them as final won't help if the call is
         | through a base class.
        
           | jeffbee wrote:
           | Yeah the practical cases for devirtualization are when you
           | have a base class, a derived class that you actually use, and
           | another derived class that you use in tests. For your release
           | binary the tests aren't visible so that can all be
           | devirtualized.
           | 
           | In cases where you have Dog and Goose that both derive from
           | Animal and then you have std::vector<Animal>, what is the
           | compiler supposed to do?
        
       | lanza wrote:
       | If you're measuring a compiler you need to post the flags and
       | version used. Otherwise the entire experiment is in the noise.
        
       | LorenDB wrote:
       | Man, I wish this blog had an RSS feed.
        
       | magnat wrote:
       | > I created a "large test suite" to be more intensive. On my dev
       | machine it needed to run for 8 hours.
       | 
       | During such long and compute-intensive tests, how are thermal
       | considerations mitigated? Not saying that this was case here, but
       | I can see how after saturating all cores for 8 hours, the whole
       | PC might get hot to the point CPU starts throttling, so when you
       | reboot to next OS or start another batch, overall performance
       | could be a bit lower.
        
         | lastgeniusua wrote:
         | having recently done similar day-and-night long suites of
         | benchmarks (on a laptop in heat dissipation conditions worse
         | than on any decent desktop), I've found that there is no
         | correlation between the order the benchmarks are run in and
         | their performance (or energy consumption!). i would therefore
         | assume that a non-overclocked processor would not exhibit the
         | patterns you are thinking of here
        
       | leni536 wrote:
       | This is the gist of the difference in code generation when final
       | is involved:
       | 
       | https://godbolt.org/z/7xKj6qTcj
       | 
       | edit: And a case involving inlining:
       | 
       | https://godbolt.org/z/E9qrb3hKM
        
       | fransje26 wrote:
       | I'm actually more worried about Clang being close to 100% slower
       | than GCC on Linux. That doesn't seem right.
       | 
       | I am prepared to believe that there is some performance
       | difference between the two, varying per case, but I would expect
       | a few percent difference, not twice the run time..
        
       | mastax wrote:
       | Changes in the layout of the binary can have large impacts on the
       | program performance [0] so it's possible that the unexpected
       | performance decrease is caused by unpredictable changes in the
       | layout of the binary between compilations. I think there is some
       | tool which helps ensure layout is consistent for benchmarking,
       | but I can't remember what it's called.
       | 
       | [0]: https://research.facebook.com/publications/bolt-a-
       | practical-...
        
       | 2genders30392 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza
       | JOiAyaPWIrGMzTEpz
        
       | 2genders38019 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza
       | nwLmahpjCXqFPlpUz
        
       | 2genders38257 wrote:
       | Are you lonely? Do u want an AI girlfriend?
       | https://discord.gg/elyza HKRNAypelEPwqpJCs
        
       | 2genders19091 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza -- FOLLOW
       | THE HOMIE https://twitter.com/hashimthearab EeObufoYXHkYhZDiY
        
       | 2genders10037 wrote:
       | Are you lonely? Do u want an AI girlfriend?
       | https://discord.gg/elyza -- FOLLOW THE HOMIE
       | https://twitter.com/hashimthearab oYrdFzJskKROckxAP
        
       | 2genders47456 wrote:
       | Are you lonely? Do u want an AI girlfriend?
       | https://discord.gg/elyza IFinfIkJHoWwcanAe
        
       | akoboldfrying wrote:
       | I would expect "final" to have no effect on this type of code at
       | all. That it does in some cases cause measurable differences I
       | put down to randomly hitting internal compiler thresholds
       | (perhaps one of the inlining heuristics is "Don't inline a
       | function with more than 100 tokens", and the "final" keyword
       | pushes a couple of functions to 101).
       | 
       | Why would I expect no performance difference? I haven't looked at
       | the code, but I would expect that for each pixel, it iterates
       | through an array/vector/list etc. of objects that implement some
       | common interface, and calls one or more methods (probably
       | something called intersectRay() or similar) on that interface.
       | _By design, that interface cannot be made final, and that 's what
       | counts._ Whether the concrete derived classes are final or not
       | makes no difference.
       | 
       | In order to make this a good test of "final", the pointer type of
       | that container should be constrained to a concrete object type,
       | like Sphere. Of course, this means the scene is limited to
       | spheres.
       | 
       | The only case where final can make a difference, by
       | devirtualising a call that couldn't otherwise be devirtualised,
       | is when you hold a pointer to that type, _and_ the object it
       | points at was allocated  "uncertainly", e.g., by the caller. (If
       | the object was allocated in the same basic block where the method
       | call later occurs, the compiler already knows its runtime type
       | and will devirtualise the call anyway, even without "final".)
        
       | 2genders24636 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza
       | NpcYmpwpqFKwvKBhd
        
       | SEXMCNIGGA37282 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza -- FOLLOW
       | THE HOMIE https://twitter.com/hashimthearab UiwpbkxTkprkcsNTe
        
       | 2genders46002 wrote:
       | Are you lonely? Do u want an AI girlfriend?
       | https://discord.gg/candyai sqUtGmoCgbvFbnDmq
        
       | 2genders47311 wrote:
       | Are you lonely? Do u want an AI girlfriend?
       | https://discord.gg/elyza -- FOLLOW THE HOMIE
       | https://twitter.com/hashimthearab CdWdoFiFKDRHlXsPK
        
       | 2genders49493 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/candyai
       | PnNTYCJUwNFKYqVHQ
        
       | 2genders15635 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza
       | hFrPIcFsjhUtShDCf
        
       | 2genders44672 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/candyai
       | iAYVTbpPGyURBTzIQ
        
       ___________________________________________________________________
       (page generated 2024-04-22 23:01 UTC)