[HN Gopher] The Performance Impact of C++'s `final` Keyword
___________________________________________________________________
The Performance Impact of C++'s `final` Keyword
Author : hasheddan
Score : 99 points
Date : 2024-04-22 17:32 UTC (5 hours ago)
(HTM) web link (16bpp.net)
(TXT) w3m dump (16bpp.net)
| mgaunard wrote:
| What final enables is devirtualization in certain cases. The main
| advantage of devirtualization is that it is necessary for
| inlining.
|
| Inlining has other requirements as well -- LTO pretty much covers
| it.
|
| The article doesn't have sufficient data to tell whether the
| testcase is built in such a way that any of these optimizations
| can happen or is beneficial.
| i80and wrote:
| If you already have LTO, can't the compiler determine this
| information for devirtualization purposes on its own?
| nickwanninger wrote:
| At the level that LLVM's LTO operates, no information about
| classes or objects is left, so LLVM itself can't really
| devirtualize C++ methods in most cases
| nwallin wrote:
| You appear to be correct. Clang does not devirtualize in
| LTO, but GCC does. Personally I consider this very strange.
| $ cat animal.h cat.cpp main.cpp // animal.h
| #pragma once class animal {
| public: virtual ~animal() {} virtual
| void speak() = 0; }; animal&
| get_mystery_animal(); // cat.cpp
| #include "animal.h" #include <cstdio>
| class cat final : public animal { public:
| ~cat() override{} void speak() override{
| puts("meow"); } };
| static cat garfield{}; animal&
| get_mystery_animal() { return garfield; }
| // main.cpp #include "animal.h"
| int main() { animal& a = get_mystery_animal();
| a.speak(); } $ make clean && CXX=clang++
| make -j && objdump --disassemble=main -C lto_test
| rm -f *.o lto_test clang++ -c -flto -O3 -g cat.cpp
| -o cat.o clang++ -c -flto -O3 -g main.cpp -o main.o
| clang++ -flto -O3 -g cat.o main.o -o lto_test
| lto_test: file format elf64-x86-64
| Disassembly of section .init: Disassembly
| of section .plt: Disassembly of section
| .plt.got: Disassembly of section .text:
| 00000000000011b0 <main>: 11b0: 50
| push %rax 11b1: 48 8b 05 58 2e 00 00 mov
| 0x2e58(%rip),%rax # 4010 <garfield>
| 11b8: 48 8d 3d 51 2e 00 00 lea 0x2e51(%rip),%rdi
| # 4010 <garfield> 11bf: ff 50 10
| call *0x10(%rax) 11c2: 31 c0
| xor %eax,%eax 11c4: 59
| pop %rcx 11c5: c3 ret
| Disassembly of section .fini: $ make clean &&
| CXX=g++ make -j && objdump --disassemble=main -C
| lto_test|sed -e 's,^, ,' rm -f *.o lto_test
| g++ -c -flto -O3 -g cat.cpp -o cat.o g++ -c -flto
| -O3 -g main.cpp -o main.o g++ -flto -O3 -g cat.o
| main.o -o lto_test lto_test: file
| format elf64-x86-64 Disassembly
| of section .init: Disassembly of section
| .plt: Disassembly of section .plt.got:
| Disassembly of section .text:
| 0000000000001090 <main>: 1090: 48 83 ec 08
| sub $0x8,%rsp 1094: 48 8d 3d 75 2f 00 00
| lea 0x2f75(%rip),%rdi # 4010 <garfield>
| 109b: e8 50 01 00 00 call 11f0 <cat::speak()>
| 10a0: 31 c0 xor %eax,%eax
| 10a2: 48 83 c4 08 add $0x8,%rsp
| 10a6: c3 ret
| Disassembly of section .fini:
| wiml wrote:
| If your runtime environment has dynamic linking, then the LTO
| pass can't always be sure that a subclass won't be introduced
| later that overrides the method.
| i80and wrote:
| Aha! That makes sense. I wasn't thinking of that case.
| Thanks!
| gpderetta wrote:
| You can tell the compiler it is indeed compiling the whole
| program.
| adzm wrote:
| MSVC with LTO and PGO will inline virtual calls in some
| situations along with a check for the expected vtable,
| bypassing the inlined code and calling the virtual function
| normally if it is an unexpected value.
| bluGill wrote:
| not if there is a shared libray or other plugin. Then you
| coannot determine until runtime if there is an override.
| ot wrote:
| In general the compiler/linker cannot assume that derived
| classes won't arrive later through a shared object.
|
| You can tell it "I won't do that" though with additional
| flags, like Clang's -fwhole-program-vtables, and even then
| it's not that simple. There was an effort in Clang to better
| support whole program devirtualization, but I haven't been
| following what kind of progress has been made:
| https://groups.google.com/g/llvm-dev/c/6LfIiAo9g68?pli=1
| samus wrote:
| This is one of the cases where JIT compiling can shine. You
| can use a bazillion interfaces to decouple application code,
| and the JIT will optimize the calls after it found out which
| implementation is used. This works as long as there is only
| one or two of them actually active at runtime.
| Negitivefrags wrote:
| See this is why I find this odd.
|
| Is there a theory as to how devirtualisation could hurt
| performance?
| samus wrote:
| Devirtualization maybe not necessarily, but inlining might
| make code fail to fit into instruction caches.
| hansvm wrote:
| There's a cost to loading more instructions, especially if
| you have more types of instructions.
|
| The main advantages to inlining are (1) avoiding a jump and
| other function call overhead, (2) the ability to push down
| optimizations.
|
| If you execute the "same" code (same instructions, different
| location) in many places that can cause cache evictions and
| other slowdowns. It's worse if some minor optimizations were
| applied by the inlining, so you have more types of
| instructions to unpack.
|
| The question, roughly, is whether the gains exceed the costs.
| This can be a bit hard to determine because it can depend on
| the size of the whole program and other non-local parameters,
| leading to performance cliffs at various stages of
| complexity. Microbenchmarks will tend to suggest inlining is
| better in more cases that it actually is.
|
| Over time you get a feel for which functions should be
| inlined. E.g., very often you'll have guard clauses or
| whatnot around a trivial amount of work when the caller is
| expected to be able to prove the guarded information at
| compile-time. A function call takes space in the generated
| assembly too, and if you're only guarding a few instructions
| it's usually worth forcing an inline (even in places where
| the compiler's heuristics would choose not to because the
| guard clauses take up too much space), regardless of the
| potential cache costs.
| masklinn wrote:
| Code bloat causing icache evictions?
| cogman10 wrote:
| Through inlining.
|
| If you have something like a `while` loop and that while
| loop's instructions fit neatly on the cache line, then
| executing that loop can be quiet fast even if you have to
| jump to different code locations to do the internals.
| However, if you pump in more instructions in that loop you
| can exceed the length of the cache line which causes you to
| need more memory loads to do the same work.
|
| It can also create more code. A method that took a
| `foo(NotFinal& bar)` could be duplicated by the compiler for
| the specialized cases which would be bad if there's a lot of
| implementations of `NotFinal` that end up being marshalled
| into foo. You could end up loading multiple implementations
| of the same function which may be slower than just keeping
| the virtual dispatch tables warm.
| phire wrote:
| Jumps/calls are actually be pretty cheap with modern branch
| predictors. Even indirect calls through vtables, which is the
| opposite of most programmers intuition.
|
| And if the devirtualisation leads to inlining, that results
| in code bloat which can lower performance though more
| instruction cache misses, which are not cheap.
|
| Inlining is actually pretty evil. It almost always speeds
| things up for microbenchmarks, as such benchmarks easily fit
| in icache. So programmers and modern compilers often go out
| of their way to do more inlining. But when you apply too much
| inlining to a whole program, things start to slow down.
|
| But it's not like inlining is universally bad in larger
| program, inlining can enable further optimisations, mostly
| because it allows constant propagation to travel across
| function boundaries.
|
| Basically, compilers need better heuristics about when they
| should be inlining. If it's just saving the overhead of a
| lightweight call, then they shouldn't be inlining.
| qsdf38100 wrote:
| "Inlining is actually pretty evil".
|
| No it's not. Except if you __force_inline__ everything, of
| course.
|
| Inlining reduces the number of instructions in a lot of
| cases. Especially when things are abstracted and factored
| with lot of indirections into small functions that calls
| other small functions and so on. Consider a 'isEmpty'
| function, which dissolves to 1 cpu instruction once
| inlined, compared with a call/save reg/compare/return.
| Highly dynamic code (with most functions being virtual)
| tend to result in a fest of chained calls, jumping into
| functions doing very little work. Yes the stack is usually
| hot and fast, but spending 80% of the instructions doing
| stack management is still a big waste.
|
| Compilers already have good heuristics about when they
| should be inlining, chances are they are a lot better at it
| than you. They don't always inline, and that's not possible
| anyway.
|
| My experience is that compiler do marvels with inlining
| decisions when there are lots of small functions they _can_
| inline if they want to. It gives the compiler a lot of
| freedom. Lambdas are great for that as well.
|
| Make sure you make the most possible compile-time
| information available to the compiler, factor your code,
| don't have huge functions, and let the compiler do its
| magic. As a plus, you can have high level abstractions,
| deep hierarchies, and still get excellent performances.
| grdbjydcv wrote:
| The "evilness" is just that sometimes if you inline
| aggressively in a microbenchmark things get faster but in
| real programs things get slower.
|
| As you say: "chances are they are a lot better at it than
| you". Infrequently they are not.
| neonsunset wrote:
| Practically - it never does. It is always cheaper to perform
| a direct, possibly inlined, call (devirtualization !=
| inlining) than a virtual one.
|
| Guarded devirtualization is also cheaper than virtual calls,
| even when it has to do if (instance is
| SpecificType st) { st.Call() } else { instance.Call()
| }
|
| or even chain multiple checks at once (with either regular
| ifs or emitting a jump table)
|
| This technique is heavily used in various forms by .NET, JVM
| and JavaScript JIT implementations (other platforms also do
| that, but these are the major ones)
|
| The first two devirtualize virtual and interface calls
| (important in Java because all calls default to virtual,
| important in C# because people like to abuse interfaces and
| occasionally inheritance, C# delegates are also
| devirtualized/inlined now). The JS JIT (like V8) performs
| "inline caching" which is similar where for known object
| shapes property access is shape type identifier comparison
| and direct property read instead of keyed lookup which is way
| more expensive.
| andrewla wrote:
| I'm surprised that it has any impact on performance at all, and
| I'd love to see the codegen differences between the applications.
|
| Mostly the `final` keyword serves as a compile-time assertion.
| The compiler (sometimes linker) is perfectly capable of seeing
| that a class has no derived classes, but what `final` assures is
| that if you attempt to derive from such a class, you will raise a
| compile-time error.
|
| This is similar to how `inline` works in practice -- rather than
| providing a useful hint to the compiler (though the compiler is
| free to treat it that way) it provides an assertion that if you
| do non-inlinable operations (e.g. non-tail recursion) then the
| compiler can flag that.
|
| All of this is to say that `final` can speed up runtimes -- but
| it does so by forcing you to organize your code such that the
| guarantees apply. By using `final` classes, in places where
| dynamic dispatch can be reduced to static dispatch, you force the
| developer to not introduce patterns that would prevent static
| dispatch.
| bgirard wrote:
| > The compiler (sometimes linker) is perfectly capable of
| seeing that a class has no derived classes
|
| How? The compiler doesn't see the full program.
|
| The linker I'm less sure about. If the class isn't guaranteed
| to be fully private wouldn't an optimizing linker have to be
| conservative in case you inject a derived class?
| GuB-42 wrote:
| "inline" is confusing in C++, as it is not really about
| inlining. Its purpose is to allow multiple definitions of the
| same function. It is useful when you have a function defined in
| a header file, because if included in several source files, it
| will be present in multiple object files, and without "inline"
| the linker will complain of multiple definitions.
|
| It is also an optimization hint, but AFAIK, modern compiler
| ignore it.
| wredue wrote:
| I believe the wording I've seen is that compilers may not
| respect the inline keyword, not that it is ignored.
| fweimer wrote:
| GCC does not ignore inline for inlining purposes:
|
| Need a way to make inlining heuristics ignore whether a
| function is inline
| https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008
|
| (Bug saw a few updates recently, that's how I remembered.)
|
| As a workaround, if you need the linkage aspect of the inline
| keyword, you currently have to write fake templates instead.
| Not great.
| lqr wrote:
| 10 years ago it was already folklore that compilers ignore
| the "inline" keyword when optimizing, but that was false for
| clang/llvm: https://stackoverflow.com/questions/27042935/are-
| the-inline-...
| jacoblambda wrote:
| The thing with `inline` as an optimisation is that it's not
| about optimising by inlining directly. It's a promise about
| how you intend to use the function.
|
| It's not just "you can have multiple definitions of the same
| function" but rather a promise that the function doesn't need
| to be address/pointer equivalent between translation units.
| This is arguably more important than inlining directly
| because it means the compiler can fully deduce how the
| function may be used without any LTO or other cross
| translation unit optimisation techniques.
|
| Of course you could still technically expose a pointer to the
| function outside a TU but doing so would be obvious to the
| compiler and it can fall back to generating a strictly
| conformant version of the function. Otherwise however it can
| potentially deduce that some branches in said function are
| unreachable and eliminate them or otherwise specialise the
| code for the specific use cases in that TU. So it potentially
| opens up alternative optimisations even if there's still a
| function call and it's not inlined directly.
| wheybags wrote:
| What if I dlopen a shared object that contains a derived class,
| then instantiate it. You cannot statically verify that I won't.
| Or you could swap out a normally linked shared object for one
| that creates a subclass. Etc etc. This kind of stuff is why I
| think shared object boundaries should be limited to the lowest
| common denominator (basically c abi). Dynamic linking high
| level languages was a mistake. The only winning move is not to
| play.
| lanza wrote:
| > Mostly the `final` keyword serves as a compile-time
| assertion. The compiler (sometimes linker) is perfectly capable
| of seeing that a class has no derived classes
|
| That's incorrect. The optimizer has to assume everything
| escapes the current optimization unit unless explicitly told
| otherwise. It needs explicit guarantees about the visibility to
| figure out the extent of the derivations allowed.
| bluGill wrote:
| I use final more for communication. Don't look for deeper derived
| classes as there are none. that it results in slower code is an
| annoying surprise.
| p0w3n3d wrote:
| I would say the most performance impact would give `constexpr`
| followed by `const`. I wouldn't bet any money on `final` which in
| C++ is a guard of inheritance, and C++ function invocation
| address is resolved the `vtable` hence final wouldn't change
| anything. Maybe the author was mistaken with `final` keyword in
| Java
| adrianN wrote:
| In my experience the compiler is pretty good at figuring out
| what is constant so adding const is more documentation for
| humans, especially in C++, where const is more of a hint than a
| hard boundary. Devirtualization, as can happen when you add a
| final, or the optimizations enabled by adding a restrict to a
| pointer, are on the other hand often essential for performance
| in hot code.
| bayindirh wrote:
| Since "const" makes things read-only, being const correct
| makes sure that you don't do funny things with the data you
| shouldn't mutate, which in turn eliminates tons of data bugs
| out of the gate.
|
| So, it's an opt-in security feature first, and a compiler
| hint second.
| ein0p wrote:
| You should use final to express design intent. In fact I'd rather
| it were the default in C++, and there was some sort of an
| opposite ('derivable'?) keyword instead, but that ship has sailed
| long time ago. Any measurable negative perf impact should be
| filed as a bug and fixed.
| cesarb wrote:
| > In fact I'd rather it were the default in C++, and there was
| some sort of an opposite ('derivable'?) keyword instead
|
| Kotlin (which uses the equivalent of the Java "final" keyword
| by default) uses the "open" keyword for that purpose.
| josefx wrote:
| Intent is nice and all that, but I would like a
| "nonwithstanding" keyword instead that just lets me bypass that
| kind of "intent" without having to copy paste the entire
| implementation just to remove a pointless keyword or make a
| destructor public when I need it.
| jbverschoor wrote:
| In general, I think things should be strict by default. Way
| easier to optimize and less error prone.
| leni536 wrote:
| C++ doesn't have the fragile base problem, as members aren't
| virtual my default. The only concern with unintended
| inheritance is with polymorhpic deletion. "final" on class
| definition disables some tricks thag you can do with private
| inheritance.
|
| Having said that "final" on member functions is great, and I
| like to see that instead of "override".
| jey wrote:
| I wonder if LTO was turned on when using Clang? Might lead to a
| performance improvement.
| pineapple_sauce wrote:
| What should be evaluated is removing indirection and tightly
| packing your data. I'm sure you'll gain a better performance
| improvement. virtual calls and shared_ptr are littered in the
| codebase.
|
| In this way: you can avoid the need for the `final` keyword and
| do the optimization the keyword enables (de-virtualize calls).
|
| >Yes, it is very hacky and I am disgusted by this myself. I would
| never do this in an actual product
|
| Why? What's with the C++ community and their disgust for macros
| without any underlying reasoning? It reminds me of everyone
| blindly saying "Don't use goto; it creates spaghetti code".
|
| Sure, if macros are overly used: it can be hard to read and
| maintain. But, for something simple like this, you shouldn't be
| thinking "I would never do this in an actual product".
| sfink wrote:
| Macros that are giving you some value can be ok. In this case,
| once the performance conclusion is reached, the only reason to
| continue using a macro is if you really need the `final`ity to
| vary between builds. Otherwise, just delete it or use the
| actual keyword.
|
| (But I'm worse than the author; if I'm just comparing
| performance, I'd probably put `final` everywhere applicable and
| then do separate compiles with `-Dfinal=` and
| `-Dfinal=final`... I'd be making the assumption that it's
| something I either always or never want eventually, though.)
| bluGill wrote:
| Macros in C are a text replace and so it is hard to see from a
| debugger how th code got like that.
| pineapple_sauce wrote:
| Yes, I'm well aware of the definition of a macro in C and
| C++. Macros are simpler than templates. You can expand them
| with a compiler flag.
| bluGill wrote:
| when things get complex templete error messages are easier
| to follow. nobody makes complex macros but if you tried.
| (template error messeges are legendary for a reason. nested
| macros are worse)
| jandrewrogers wrote:
| In modern C++, macros are a viewed as a code smell because they
| are strictly worse than alternatives in almost all situations.
| It is a cultural norm; it is a bit like using "unsafe" in Rust
| if not strictly required for some trivial case. The C++
| language has made a concerted effort to eliminate virtually all
| use cases for macros since C++11 and replace them with type-
| safe first-class features in the language. It is a bit of a
| legacy thing at this point, there are large modern C++
| codebases with no macros at all, not even for things like
| logging. While macros aren't going away, especially in older
| code, the cultural norm in modern C++ has tended toward macros
| being a legacy foot-gun and best avoided if at all possible.
|
| The main remaining use case for the old C macro facility I
| still see in new code is to support conditional compilation of
| architecture-specific code e.g. ARM vs x86 assembly routines or
| intrinsics.
| sgerenser wrote:
| But how would one conditionally enable or disable the "final"
| keyword on class members without a preprocessor macro, even
| in C++23?
| gpderetta wrote:
| 1% is nothing to scoff of. But I suspect that the variability of
| compilation (specifically quirks of instruction selection,
| register allocation and function alignment) more than mask any
| gains.
|
| The clang regression might be explainable by final allowing some
| additional inlining and clang making an hash of it.
| jcalvinowens wrote:
| That's interesting. Maybe final enabled more inlining, and clang
| is being too aggressive about it for the icache sizes in play
| here? I'd love to see a comparison of the generated code.
|
| I'm disappointed the author's conclusion is "don't use final",
| not "something is wrong with clang".
| ot wrote:
| Or "something is wrong with my benchmark setup", which is also
| a possibility :)
|
| Without a comparison of generated code, it could be anything.
| indigoabstract wrote:
| If it does have a noticeable impact, that would be surprising, a
| bit like going back to the days when 'inline' was supposed to
| tell the compiler to inline the designated functions (no longer
| its main use case nowadays).
| sfink wrote:
| tldr: sprinkled a keyword around in the hopes that it "does
| something" to speed things up, tested it, got noisy results but
| no miraculous speedup.
|
| I started skimming this article after a while, because it seemed
| to be going into the weeds of performance comparison without ever
| backing up to look at what the change might be doing. Which meant
| that I couldn't tell if I was going to be looking at the usual
| random noise of performance testing or something real.
|
| For `final`, I'd want to at least see if it changing the
| generated code by replacing indirect vtable calls with direct or
| inlined calls. It might be that the compiler is already figuring
| it out and the keyword isn't doing anything. It might be that the
| compiler _is_ changing code, but the target address was already
| well-predicted and it 's perturbing code layout enough that it
| gets slower (or faster). There could be something interesting
| here, but I can't tell without at least a little assembly output
| (or perhaps a relevant portion of some intermediate
| representation, not that I would know which one to look at).
|
| If it's not changing anything, then perhaps there could be an
| interesting investigation into the variance of performance
| testing in this scenario. If it's changing something, then there
| could be an interesting investigation into when that makes things
| faster vs slower. As it is, I can't tell what I should be looking
| for.
| sgerenser wrote:
| This is what I was waiting for too. Especially with the large
| regression on Clang/Ubuntu. Maybe he uncovered a Clang/LLVM
| codegen bug, but you'd need to compare the generated assembly
| to know.
| jeffbee wrote:
| It's difficult to discuss this stuff because the impact can be
| negligible or negative for one person, but large and consistently
| positive for another. You can only usefully discuss it on a given
| baseline, and for something like final I would hope that baseline
| would be a project that already enjoys PGO, LTO, and BOLT.
| tombert wrote:
| I don't do much C++, but I have definitely found that engineers
| will just assert that something is "faster" without any evidence
| to back that up.
|
| Quick example, I got in an argument with someone a few years ago
| that claimed in C# that a `switch` was better than an `if(x==1)
| elseif(x==2)...` because switch was "faster" and rejected my PR.
| I mentioned that that doesn't appear to be true, we went back and
| forth until I did a compile-then-decompile of a minimal test with
| equality-based-ifs, and showed that the compiler actually
| converts equality-based-ifs to `switch` behind the scenes. The
| guy accepted my PR after that.
|
| But there's tons of this stuff like this in CS, and I kind of
| blame professors for a lot of it [1]. A large part of becoming a
| decent engineer [2] for me was learning to stop trusting what
| professors taught me in college. Most of what they said was fine,
| but you can't _assume_ that; what they tell you could be out of
| date, or simply never correct to begin with, and as far as I can
| tell you have to _always_ test these things.
|
| It doesn't help that a lot of these "it's faster" arguments are
| often reductive because they only are faster in extremely minimal
| tests. Sometimes a microbenchmark will show that something is
| faster, and there's value in that, but I think it's important
| that that can also be a small percentage of the total program;
| compilers are obscenely good at optimizing nowadays, it can be
| difficult to determine _when_ something will be optimized, and
| your assertion that something is "faster" might not actually be
| true in a non-trivial program.
|
| This is why I don't really like doing any kind of major
| optimizations before the program actually works. I try to keep
| the program in a reasonable Big-O and I try and minimize network
| calls cuz of latency, but I don't bother with any kind of micro-
| optimizations in the first draft. I don't mess with bitwise, I
| don't concern myself on which version of a particular data
| structure is a millisecond faster, I don't focus too much on
| whether I can get away with a smaller sized float, etc. Once I
| know that the program is correct, _then_ I benchmark to see if
| any kind of micro-optimizations will actually matter, and often
| they really don 't.
|
| [1] That includes me up to about a year ago.
|
| [2] At least I like to pretend I am.
| BurningFrog wrote:
| Even if one of these constructs is faster _it doesn 't matter_
| 99% of the time.
|
| Writing well structured readable code is typically far more
| important than making it twice as fast. And those times can
| rarely be predicted beforehand, so you should mostly not worry
| about it until you see real performance problems.
| tombert wrote:
| I mostly focus on "using stuff that won't break", and yeah
| "if it actually matters".
|
| For example, much to the annoyance of a lot of people, I
| don't typically use floating point numbers when I start out.
| I will use the "decimal" or "money" types of the language, or
| GMP if I'm using C. When I do that, I can be sure that I
| won't have to worry about any kind of funky overflow issues
| or bizarre rounding problems. There _might_ be a performance
| overhead associated with it, but then I have to ask myself
| "how often is this actually called?"
|
| If the answer is "a billion times" or "once in every
| iteration of the event loop" or something, then I will
| probably eventually go back and figure out if I can use a
| float or convert it to an integer-based thing, but in a lot
| of cases the answer is "like ten or twenty times", and at
| that point I'm not even 100% sure it would be even measurable
| to change to the "faster" implementations.
|
| What annoys me is that people will act like they really care
| about speed, do all these annoying micro-optimizations, and
| then forget that pretty much all of them get wiped out
| immediately upon hitting the network, since the latency
| associated with that is obscene.
| apantel wrote:
| The counter-argument to this is if you are building something
| that is in the critical path of an application (for example,
| parsing HTTP in a web server), you need to be performance-
| minded from the beginning because design decisions lead to
| design decisions. If you are building something in the
| critical path of the application, the best thing to do is
| build it from the ground up measuring the performance of what
| you have as you go. This way, each time you add something you
| will see the performance impact and usually there's a more
| performant way of doing something that isn't more obscure. If
| you do this as you build, early choices become constraints,
| but because you chose the most performant thing at every
| stage, the whole process takes you in the direction of a
| highly-performant implementation.
|
| Why should you care about performance?
|
| I can give you my personal experience: I've been working on a
| Java web/application server for the past 15 years and a
| typical request (only reading, not writing to the db) would
| take maybe 4-5 ms to execute. That includes HTTP request
| parsing, JSON parsing, session validation, method execution,
| JSON serialization, and HTTP response dispatch. Over the past
| 9 months I have refactored the entire application for
| performance and a typical request now takes about 0.25 ms or
| 250 microseconds. The computer is doing so much less work to
| accomplish the same tasks, it's almost silly how much work it
| was doing before. And the result is the machine can handle
| 20x more requests in the same amount of time. If it could
| handle 200 requests per second per core before, now it can
| handle 4000. That means the need to scale is felt 20x less
| intensely, which means less complexity around scaling.
|
| High performance means reduced scaling requirements.
| tombert wrote:
| But even that sort of depends right? Hardware is often
| pretty cheap in comparison to dev-time. I really depends on
| the project, what kind of servers you're using, the nature
| of the application etc, but I think a lot of the time it
| might be cheaper to just pay for 20x the servers than it
| would be to pay a human to go find a critical path.
|
| I'm not saying you completely throw caution to the wind,
| I'm just saying that there's a finite amount of human
| resources and it can really vary how you want to allocate
| them. Sometimes the better path is to just throw money at
| the problem.
|
| It really depends.
| apantel wrote:
| I think it depends on what you're building and who's
| building it. We're all benefitting from the fact that the
| designers of NGINX made performance a priority. We like
| using things that were designed to be performant. We like
| high-FPS games. We like fast internet.
|
| I personally don't like the idea of throwing compute at a
| slow solution. I like when the extra effort has been put
| into something. The good feeling I get from interacting
| with something that is optimal or excellent is an end in
| itself and one of the things I live for.
| tombert wrote:
| Sure, though I've mentioned a few times in this thread
| now that the thing that bothers me more than CPU
| optimizations is not taking into account latency,
| particularly when hitting the network, and I think
| focusing on that will generally pay higher dividends than
| trying to optimize for processing.
|
| CPUs are ridiculously fast now, and compilers are really
| really good now too. I'm not going to say that processing
| speed is a "solved" problem, but I am going to say that
| in a lot of performance-related cases the CPU processing
| is probably not your problem. I will admit that this kind
| of pokes holes in my previous response, because
| introducing more machines into the mix will almost
| certainly increase latency, but I think it more or less
| holds depending on context.
|
| But I think it really is a matter of nuance, which you
| hinted at. If I'm making an admin screen that's going to
| have like a dozen users max, then a slow, crappy solution
| is probably fine; the requests will be served fast enough
| to where no one will notice anyway, and you can probably
| even get away with the cheapest machine/VM. If I'm making
| an FPS game that has 100,000 concurrent users, then it
| almost certainly will be beneficial to squeeze out as
| much performance out of the machine as possible, both CPU
| _and_ latency-wise.
|
| But as I keep repeating everywhere, you have to measure.
| You cannot assume that your intuition is going to be
| right, particularly at-scale.
| apantel wrote:
| I absolutely agree that latency is the real thing to
| optimize for. In my case, I only leave the application to
| access the db, and my applications tend not to be write-
| heavy. So in my case latency-per-request == how much work
| the computer has to do, which is constrained to one core
| because the overhead of parallelizing any part of the
| pipeline is greater than the work required. See, in that
| sense, we're already close to the performance ceiling for
| per-request processing because clock speeds aren't going
| up. You can't make the processing of a given request
| faster by throwing more hardware at it. You can only make
| it faster by creating less work for the hardware to do.
|
| (Ironically, HN is buckling under load right now, or some
| other issue.)
| oivey wrote:
| It almost certainly would require more than 20x servers
| because setting up horizontal scaling will have some sort
| of overhead. Not only that, there is the significant
| engineering effort to develop and maintain the code to
| scale.
|
| If your problem can fit on one server, it can massively
| reduce engineering and infrastructure costs.
| neonsunset wrote:
| Please accept a high five from a fellow "it does so little
| work it must have sub-millisecond request latency"
| aficionado (though I must admit I'm guilty of abusing
| memory caches to achieve this).
| apantel wrote:
| Caches, precomputed values, lookup tables -- it's all
| good as long as it's well-organized and maintainable.
| neonsunset wrote:
| This attitude is part of the problem. Another part of the
| problem is having no idea which things actually end up
| costing performance and how much.
|
| It is why many language ecosystems suffered from performance
| issues for a really long time even if completely unwarranted.
|
| Is changing ifs to switch or vice versa, as outlined in the
| post above, a waste of time? Yes, unless you are writing some
| encoding algorithm or a parser, it will not matter. The
| compiler will lower trivial statements to the same codegen
| and it will not impact the resulting performance anyway even
| if there was difference given a problem the code was solving.
|
| However, there are things that _do_ cost like interface spam,
| abusing lambdas writing needlessly complex wokflow-style
| patterns (which are also less readable and worse in 8 out of
| 10 instances), not caching objects that always have the same
| value, etc.
|
| These kinds of issues, for example, plagued .NET ecosystem
| until more recent culture shift where it started to be cool
| once again to focus on performance. It wasn't being helped by
| the notion of "well-structured code" being just idiotic
| "clean architecture" and "GoF patterns" style dogma applied
| to smallest applications and simplest of business domains.
|
| (it is also the reason why picking slow languages in general
| is a really bad idea - _everything_ costs more and you have
| way less leeway for no productivity win - Ruby and Python,
| and JS with Node.js are less productive to write in than C#
| /F#, Kotlin/Java or Go(under some conditions))
| tombert wrote:
| I mean, that's kind of why I tried to emphasize measuring
| things yourself instead of depending on tribal knowledge.
|
| There are plenty of cases where even the "slow"
| implementation is more than fast enough, and there are also
| plenty of cases where the "correct" solution (from a big-O
| or intuition perspective) is actually slower than the dumb
| case. Intuition _helps_ , you _have_ to measure and /or
| look at the compiled results if you want to ensure correct
| numbers.
|
| An example that really annoys me is how every whiteboard
| interview ends up being "interesting ways to use a
| hashmap", which isn't inherently an issue, but they will
| usually be so small-scoped that an iterative "array of
| pairs" might actually be cheaper than paying the up-front
| cost of hashing and potentially dealing with collisions.
| Interviews almost always ignore constant factors, and
| that's fair enough, but in reality constant factors _can_
| matter, and we 're training future employees to ignore
| that.
|
| I'll say it again: as far as I can tell, you _have_ to
| measure if you want to know if your result is "faster".
| "Measuring" might involve memory profilers, or dumb timers,
| or a mixture of both. Gut instincts are often wrong.
| leetcrew wrote:
| agreed, especially in cases like this. final is primarily a way
| to prohibit overriding methods and extending classes, and it
| indicates to the reader that they should not be doing this. use
| it when it makes conceptual sense.
|
| that said, c++ is usually a language you use when you care
| about performance, at least to an extent. it's worth
| understanding features like nrvo and rewriting functions to
| allow the compiler to pick the optimization if it doesn't hurt
| readability too much.
| wvenable wrote:
| In my opinion, the only things that really matter are
| algorithmic complexity and readability. And even algorithmic
| complexity is usually only an issue a certain scales. Whether
| or not an 'if' is faster than a 'switch' is the micro of micro
| optimizations -- you better have a good reason to care. The
| question I would have for you is was your bunch of ifs more
| readable than a switch would be.
| doctor_phil wrote:
| But a switch and an if-else *is* a matter of algorithmic
| complexity. (Well, at least could be for a naive compiler). A
| switch could be converted to a constant time jump, but the
| if-else would be trying each case linearly.
| cogman10 wrote:
| Yup.
|
| That said, the linear test is often faster due to CPU
| caches, which is why JITs will often convert switches to
| if/elses.
|
| IMO, switch is clearer in general and potentially faster
| (at very least the same speed) so it should be preferred
| when dealing with 3+ if/elseif statements.
| tombert wrote:
| Hard disagree that it's "clearer". I have had to deal
| with a ton of bugs with people trying to be clever with
| the `break` logic, or forgetting to put `break` in there
| at all.
|
| if statements are dumber, and maybe arguably uglier, but
| I feel like they're also more clear, and people don't try
| and be clever with them.
| cogman10 wrote:
| Updates to languages (don't know where C# is on this)
| have different types of switch statements that eliminate
| the `break` problem.
|
| For example, with java there's enhanced switch that looks
| like this var val = switch(foo) {
| case 1, 2, 3 -> bar; case 4 -> baz;
| default -> { yield bat(); } }
|
| The C style switch break stuff is definitely a language
| mistake.
| wvenable wrote:
| C# has both switch expressions like this and also break
| statements are not optional in traditional switch
| statements so it actually solves both problems. You can't
| get too clever with switch statements in C#.
|
| However most languages have pretty permissive switch
| statements just like C.
| tombert wrote:
| Yeah, fair, it's been awhile since I've done any C#, so
| my memory is a bit hazy with the details. I've been
| burned C with switch statements so I have a pretty strong
| distaste for them.
| neonsunset wrote:
| C# has switch statements which are C/C++ style switches
| and switch expressions which are like Rust's match except
| no control flow statements inside: var
| len = slice switch { null => 0,
| "Hello" or "World" => 1, ['@', ..var tags] =>
| tags.Length, ['{', ..var body, '}'] =>
| body.Length, _ => slice.Length, };
|
| (it supports a lot more patterns but that wouldn't fit)
| gloryjulio wrote:
| This is just forcing return value. You either have to
| break or return at the branches. To me they all look
| equivalent
| neonsunset wrote:
| Any sufficiently advanced compiler will rewrite those
| arbitrarily depending on its heuristics. What authors
| usually forget is that there is defined behavior and
| specification which the compiler abides by, but it is
| otherwise free to produce any codegen that preserves the
| defined program order. Branch reordering, generating jump
| tables, optimizing away or coalescing checks into
| branchless forms are all very common. When someone says
| "oh I write C because it lets you tell CPU how exactly to
| execute the code" is simply a sign that a person never
| actually looked at disassembly and has little to no idea
| how the tool they use works.
| cogman10 wrote:
| A complier will definitely try this, but it's important
| to note that if/else blocks tell the compiler that "you
| will run these evaluations in order". Now, if the
| compiler can detect that the evaluations have no side
| effects (which, in this simple example with just integer
| checks, is fairly likely) then yeah I can see a jump
| table getting shoved in as an optimization.
|
| However, the moment you add a side effect or something
| more complicated like a method call, it becomes really
| hard for the complier to know if that sort of
| optimization is safe to do.
|
| The benefit of the switch statement is that it's already
| well positioned for the compiler to optimize as it does
| not have the "you must run these evaluations in order"
| requirement. It forces you to write code that is fairly
| compiler friendly.
|
| All that said, probably a waste of time debating :D.
| Ideally you have profiled your code and the profiler has
| told you "this is the slow block" before you get to the
| point of worrying about how to make it faster.
| tombert wrote:
| I agree with what you said but in this particular case,
| it actually was a direct integer equality check, there
| was zero risk of hitting side effects and that was
| plainly obvious to me, the checker, and compiler.
| cogman10 wrote:
| And to your original comment, I think the reviewer was
| wrong to reject the PR over that. Performance has to be
| measured before you can use it to reject (or create...) a
| PR. If someone hasn't done that then unless it's
| something obvious like "You are making a ton of tiny heap
| allocations in a tight loop" then I think nitpicking
| these sorts of things is just wrong.
| saurik wrote:
| While I personally find the if statements harder to
| immediately mentally parse/grok--as I have to prove to
| myself that they are all using the same variable and are
| all chained correctly in a way that is visually obvious for
| the switch statement--I don't find "but what if we use a
| naive compiler" at all a useful argument to make as, well,
| we aren't using a naive compiler, and, if we were, there
| are a ton of other things we are going to be sad about the
| performance of leading us down a path of re-implementing a
| number of other optimizations. The goal of the compiler is
| to shift computational complexity from runtime to compile
| time, and figuring out whether the switch table or the
| comparisons are the right approach seems like a legitimate
| use case (which maybe we have to sometimes disable, but
| probably only very rarely).
| bregma wrote:
| But what if, and stick with me here, a compiler is capable
| of reading and processing your code and through simple
| scalar evolution of the conditionals and phi-reduction, it
| can't tell the difference between a switch statement and a
| sequence of if statements by the time it finishes its
| single static analysis phase?
|
| It turns out the algorithmic complexity of a switch
| statement and the equivalent series of if-statements is
| identical. The bijective mapping between them is close to
| the identity function. Does a naive compiler exist that
| doesn't emit the same instructions for both, at least
| outside of toy hobby project compilers written by amateurs
| with no experience?
| tombert wrote:
| Yeah, and it's not like I didn't know how to do the stuff I
| was doing with a switch, I just don't like switches because
| I've forgotten to add break statements and had code that
| appeared correct but actually a month down the line. I've
| also seen other people make the same mistakes. ifs, in my
| opinion at least, are a bit harder to screw up, so I will
| always prefer them.
|
| But I agree, algorithmic complexity is generally the only
| thing I focus on, and even then it's almost always a case of
| "will that actually matter?" If I know that `n` is never
| going to be more than like `10`, I might not bother trying to
| optimize an O(n^2) operation.
|
| What I feel often gets ignored in these conversations is
| latency; people obsess over some "optimization" they learned
| in college a decade ago, and ignore the 200 HTTP or Redis
| calls being made ten lines below, despite the fact that the
| latter will have a substantially higher impact on
| performance.
| saghm wrote:
| > But there's tons of this stuff like this in CS
|
| Reminds me of the classic
| https://stackoverflow.com/questions/24848359/which-is-faster...
| sgerenser wrote:
| Never saw that before, that is indeed a classic.
| jollyllama wrote:
| I've encountered similar situations before. It's insane to me
| when people hold up PRs over that kind of thing.
| dosshell wrote:
| > I can get away with a smaller sized float
|
| When talking about not assuming optimizations...
|
| 32bit float is slower than 64bit float on reasonable modern
| x86-64.
|
| The reason is that 32bit float is emulated by using 64bit.
|
| Of course if you have several floats you need to optimize
| against cache.
| tombert wrote:
| Sure, I clarified this in a sibling comment, but I kind of
| meant that I will use the slower "money" or "decimal" types
| by default. Usually those are more accurate and less error-
| prone, and then if it actually matters I might go back to a
| floating point or integer-based solution.
| sgerenser wrote:
| I think this is only true if using x87 floating point, which
| anything computationally intensive is generally avoiding
| these days in favor of SSE/AVX floats. In the latter case,
| for a given vector width, the cpu can process twice as many
| 32 bit floats as 64 bit floats per clock cycle.
| dosshell wrote:
| Yes, as I wrote, it is only true for one float value.
|
| SIMD/MIMD will benefit of working on smaller width. This is
| not only true because they do more work per clock but
| because memory is slow. Super slow compared to the cpu.
| Optimization is alot about cache misses optimization.
|
| (But remember that the cache line is 64 bytes, so reading a
| single value smaller than that will take the same time. So
| it does not matter in theory when comparing one f32 against
| one f64)
| jcranmer wrote:
| Um... no. This is 100% completely and totally wrong.
|
| x86-64 requires the hardware to support SSE2, which has
| native single-precision and double-precision instructions for
| floating-point (e.g., scalar multiply is MULSS and MULSD,
| respectively). Both the single precision and the double
| precision instructions will take the same time, except for
| DIVSS/DIVSD, where the 32-bit float version is slightly
| faster (about 2 cycles latency faster, and reciprocal
| throughput of 3 versus 5 per Agner's tables).
|
| You might be thinking of x87 floating-point units, where all
| arithmetic is done internally using 80-bit floating-point
| types. But all x86 chips in like the last 20 years have had
| SSE units--which are faster anyways. Even in the days when it
| was the major floating-point units, it wasn't any slower,
| since all floating-point operations took the same time
| independent of format. It might be slower if you insisted
| that code compilation strictly follow IEEE 754 rules, but the
| solution everybody did was to _not_ do that and that 's why
| things like Java's strictfp or C's FLT_EVAL_METHOD were born.
| Even in _that_ case, however, 32-bit floats would likely be
| faster than 64-bit for the simple fact that 32-bit floats can
| safely be emulated in 80-bit without fear of double rounding
| but 64-bit floats cannot.
| dosshell wrote:
| I agree with you. It should take the same time when
| thinking more about it. I remember learning this in ~2016
| and I did performance test on Skylake which confirmed
| (Windows VS2015). I think I remember that i only tested
| with addsd/addss. Definitely not x87. But as always, if the
| result can not be reproduced... I stand corrected until
| then.
| jandrewrogers wrote:
| A significant part of it is that what engineers believe was
| effectively true at one time. They simply haven't revisited
| those beliefs or verified their relevance in a long time. It
| isn't a terrible heuristic for life in general to assume that
| what worked ten years ago will work today. The rate at which
| the equilibriums shift due to changes in hardware and software
| environments when designing for system performance is so rapid
| that you need to make a continuous habit of checking that your
| understanding of how the world works maps to reality.
|
| I've solved a lot of arguments with godbolt and simple
| performance tests. Some topics are recurring themes among
| software engineers e.g.:
|
| - compilers are almost always better at micro-optimizations
| than you are
|
| - disk I/O is almost never a bottleneck in competent designs
|
| - brute-force sequential scans are often optimal algorithms
|
| - memory is best treated as a block device
|
| - vectorization can offer large performance gains
|
| - etc...
|
| No one is immune to this. I am sometimes surprised at the
| extent to which assumptions are no longer true when I revisit
| optimization work I did 10+ years ago.
|
| Most performance these days is architectural, so getting the
| initial design right often has a bigger impact than micro-
| optimizations and localized Big-O tweaks. You can always go
| back and tweak algorithms or codegen later but architecture is
| permanent.
| neonsunset wrote:
| .NET is a particularly bad case for this because it was a
| decade of few performance improvements, which caused a
| certain intuition to develop within the industry, then 6-8
| years of significant changes each year (with most wins
| compressed to the last 4 years or so). Companies moving from
| .NET Framework 4.6/7/8 to .NET 8 experience a 10x _average_
| performance improvement, which naturally comes with rendering
| a lot of performance know-how obsolete overnight.
|
| (the techniques that used to work were similar to earlier
| Java versions and overall very dynamic languages with some
| exceptions, the techniques that still work and now are
| required today are the same as in C++ or Rust)
| tombert wrote:
| Yep, completely agree with you on this. Intuition is often
| wrong, or at least outdated.
|
| When I'm building stuff I try my best to focus on
| "correctness", and try to come up with an algorithm/design
| that will encompass all realistic use cases. If I focus on
| that, it's relatively easy to go back and convert my
| `decimal` type to a float64, or even convert an if statement
| into a switch if it's actually faster.
| klyrs wrote:
| > A large part of becoming a decent engineer [2] for me was
| learning to stop trusting what professors taught me in college
|
| When I was taught about performance, it was all about
| benchmarking and profiling. I never needed to trust what my
| professors taught, because they taught me to dig in and find
| the truth for myself. This was taught alongside the big-O
| stuff, with several examples where "fast" algorithms are slower
| on small inputs.
| TylerE wrote:
| How do you even get meaningful profiling out of most modern
| langs? It seems the vast majority of time and calls gets
| spent inside tiny anonymous functions, GC allocations, and
| stuff like that.
| klyrs wrote:
| I don't use most modern langs! And especially if I'm doing
| work where performance is critical, I won't kneecap myself
| by using a language that I can't reasonably profile.
| neonsunset wrote:
| This is easy in most modern programming languages.
|
| JVM ecosystem has IntelliJ Idea profiler and similar
| advanced tools (AFAIK).
|
| .NET has VS/Rider/dotnet-trace profilers (they are very
| detailed) to produce flamegraphs.
|
| Then there are native profilers which can work with any AOT
| compiled language that produces canonically symbolicated
| binaries: Rust, C#/F#(AOT mode), Go, Swift, C++, etc.
|
| For example, you can do `samply record ./some_binary`[0]
| and then explore multi-threaded flamegraph once completed
| (I use it to profile C#, it's more convenient than dotTrace
| for preliminary perf work and is usually more than
| sufficient).
|
| [0] https://github.com/mstange/samply
| trueismywork wrote:
| There's not yet a culture of writing reproducible benchmarks to
| gage these effects.
| zmj wrote:
| .NET is a little smarter about switch code generation these
| days: https://github.com/dotnet/roslyn/pull/66081
| JackYoustra wrote:
| I really wish he'd listed all the flags he used. To add on to the
| flags already listed by some other commenters, `-mcpu` and
| related flags are really crucial in these microbenchmarks: over
| such a small change and such a small set of tight loops, you
| could just be regression on coincidences in the microarchitecture
| scheduler vs higher level assumptions.
| j_not_j wrote:
| And he didn't repeat each test case 5 or 9 times, and take the
| median (or even an average).
|
| There will be operating system noise that can be in the multi-
| percent range. This is defined as various OS services that run
| "in the background" taking up cpu time, emptying cache lines
| (which may be most important), and flushing a few translate
| lookaside entries.
|
| Once you recognize the variability from run to run, claiming
| "1%" becomes less credible. Depending on the noise level, of
| course.
|
| Linux benchmarks like SPECcpu tend to be run in "single-user
| mode" meaning almost no background processes are running.
| mgraczyk wrote:
| The main case where I use final and where I would expect benefits
| (not covered well by the article) is when you are using an
| external library with pure virtual interfaces that you implement.
|
| For example, the AWS C++ SDK uses virtual functions for
| everything. When you subclass their classes, marking your classes
| as final allows the compiler to devirtualize your own calls to
| your own functions (GCC does this reliably).
|
| I'm curious to understand better how clang is producing worse
| code in these cases. The code used for the blog post is a bit too
| complicated for me to look at, but I would love to see some
| microbenchmarks. My guess is that there is some kind of icache or
| code side problem. where inlining more produces worse code.
| cogman10 wrote:
| Could easily just be a bad optimization pathway.
|
| `final` tells the compiler that nothing extends this class.
| That means the compiler can theoretically do things like
| inlining class methods and eliminate virtual method calls
| (perhaps duplicating the method)?
|
| However, it's quite possible that one of those optimizations
| makes the code bigger or misaligns things with the cache in
| unexpected ways. Sometimes, a method call can bet faster than
| inlining. Especially with hot loops.
|
| All this being said, I'd expect final to offer very little
| benefit over PGO. Its main value is the constraint it imposes
| and not the optimization it might enable.
| jeffbee wrote:
| I profiled this project and there are abundant opportunities for
| devirtualization. The virtual interface `IHittable` is the hot
| one. However, the WITH_FINAL define is not sufficient, because
| the hot call is still virtual. At `hit_object |=
| _objects[node->object_index()]->hit` I am still seeing ` mov
| (%rdi),%rax; call *0x18(%rax)` so the application of final here
| was not sufficient to do the job. Whatever differences are being
| measures are caused by bogons.
| gpderetta wrote:
| I haven't looked at the code, but if you have multiple leaves,
| even marking all of them as final won't help if the call is
| through a base class.
| jeffbee wrote:
| Yeah the practical cases for devirtualization are when you
| have a base class, a derived class that you actually use, and
| another derived class that you use in tests. For your release
| binary the tests aren't visible so that can all be
| devirtualized.
|
| In cases where you have Dog and Goose that both derive from
| Animal and then you have std::vector<Animal>, what is the
| compiler supposed to do?
| lanza wrote:
| If you're measuring a compiler you need to post the flags and
| version used. Otherwise the entire experiment is in the noise.
| LorenDB wrote:
| Man, I wish this blog had an RSS feed.
| magnat wrote:
| > I created a "large test suite" to be more intensive. On my dev
| machine it needed to run for 8 hours.
|
| During such long and compute-intensive tests, how are thermal
| considerations mitigated? Not saying that this was case here, but
| I can see how after saturating all cores for 8 hours, the whole
| PC might get hot to the point CPU starts throttling, so when you
| reboot to next OS or start another batch, overall performance
| could be a bit lower.
| lastgeniusua wrote:
| having recently done similar day-and-night long suites of
| benchmarks (on a laptop in heat dissipation conditions worse
| than on any decent desktop), I've found that there is no
| correlation between the order the benchmarks are run in and
| their performance (or energy consumption!). i would therefore
| assume that a non-overclocked processor would not exhibit the
| patterns you are thinking of here
| leni536 wrote:
| This is the gist of the difference in code generation when final
| is involved:
|
| https://godbolt.org/z/7xKj6qTcj
|
| edit: And a case involving inlining:
|
| https://godbolt.org/z/E9qrb3hKM
| fransje26 wrote:
| I'm actually more worried about Clang being close to 100% slower
| than GCC on Linux. That doesn't seem right.
|
| I am prepared to believe that there is some performance
| difference between the two, varying per case, but I would expect
| a few percent difference, not twice the run time..
| mastax wrote:
| Changes in the layout of the binary can have large impacts on the
| program performance [0] so it's possible that the unexpected
| performance decrease is caused by unpredictable changes in the
| layout of the binary between compilations. I think there is some
| tool which helps ensure layout is consistent for benchmarking,
| but I can't remember what it's called.
|
| [0]: https://research.facebook.com/publications/bolt-a-
| practical-...
| 2genders30392 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza
| JOiAyaPWIrGMzTEpz
| 2genders38019 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza
| nwLmahpjCXqFPlpUz
| 2genders38257 wrote:
| Are you lonely? Do u want an AI girlfriend?
| https://discord.gg/elyza HKRNAypelEPwqpJCs
| 2genders19091 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza -- FOLLOW
| THE HOMIE https://twitter.com/hashimthearab EeObufoYXHkYhZDiY
| 2genders10037 wrote:
| Are you lonely? Do u want an AI girlfriend?
| https://discord.gg/elyza -- FOLLOW THE HOMIE
| https://twitter.com/hashimthearab oYrdFzJskKROckxAP
| 2genders47456 wrote:
| Are you lonely? Do u want an AI girlfriend?
| https://discord.gg/elyza IFinfIkJHoWwcanAe
| akoboldfrying wrote:
| I would expect "final" to have no effect on this type of code at
| all. That it does in some cases cause measurable differences I
| put down to randomly hitting internal compiler thresholds
| (perhaps one of the inlining heuristics is "Don't inline a
| function with more than 100 tokens", and the "final" keyword
| pushes a couple of functions to 101).
|
| Why would I expect no performance difference? I haven't looked at
| the code, but I would expect that for each pixel, it iterates
| through an array/vector/list etc. of objects that implement some
| common interface, and calls one or more methods (probably
| something called intersectRay() or similar) on that interface.
| _By design, that interface cannot be made final, and that 's what
| counts._ Whether the concrete derived classes are final or not
| makes no difference.
|
| In order to make this a good test of "final", the pointer type of
| that container should be constrained to a concrete object type,
| like Sphere. Of course, this means the scene is limited to
| spheres.
|
| The only case where final can make a difference, by
| devirtualising a call that couldn't otherwise be devirtualised,
| is when you hold a pointer to that type, _and_ the object it
| points at was allocated "uncertainly", e.g., by the caller. (If
| the object was allocated in the same basic block where the method
| call later occurs, the compiler already knows its runtime type
| and will devirtualise the call anyway, even without "final".)
| 2genders24636 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza
| NpcYmpwpqFKwvKBhd
| SEXMCNIGGA37282 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza -- FOLLOW
| THE HOMIE https://twitter.com/hashimthearab UiwpbkxTkprkcsNTe
| 2genders46002 wrote:
| Are you lonely? Do u want an AI girlfriend?
| https://discord.gg/candyai sqUtGmoCgbvFbnDmq
| 2genders47311 wrote:
| Are you lonely? Do u want an AI girlfriend?
| https://discord.gg/elyza -- FOLLOW THE HOMIE
| https://twitter.com/hashimthearab CdWdoFiFKDRHlXsPK
| 2genders49493 wrote:
| hi are u lonely want ai gf?? https://discord.gg/candyai
| PnNTYCJUwNFKYqVHQ
| 2genders15635 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza
| hFrPIcFsjhUtShDCf
| 2genders44672 wrote:
| hi are u lonely want ai gf?? https://discord.gg/candyai
| iAYVTbpPGyURBTzIQ
___________________________________________________________________
(page generated 2024-04-22 23:01 UTC)