[HN Gopher] The Performance Impact of C++'s `final` Keyword
___________________________________________________________________
The Performance Impact of C++'s `final` Keyword
Author : hasheddan
Score : 233 points
Date : 2024-04-22 17:32 UTC (1 days ago)
(HTM) web link (16bpp.net)
(TXT) w3m dump (16bpp.net)
| mgaunard wrote:
| What final enables is devirtualization in certain cases. The main
| advantage of devirtualization is that it is necessary for
| inlining.
|
| Inlining has other requirements as well -- LTO pretty much covers
| it.
|
| The article doesn't have sufficient data to tell whether the
| testcase is built in such a way that any of these optimizations
| can happen or is beneficial.
| i80and wrote:
| If you already have LTO, can't the compiler determine this
| information for devirtualization purposes on its own?
| nickwanninger wrote:
| At the level that LLVM's LTO operates, no information about
| classes or objects is left, so LLVM itself can't really
| devirtualize C++ methods in most cases
| nwallin wrote:
| You appear to be correct. Clang does not devirtualize in
| LTO, but GCC does. Personally I consider this very strange.
| $ cat animal.h cat.cpp main.cpp // animal.h
| #pragma once class animal {
| public: virtual ~animal() {} virtual
| void speak() = 0; }; animal&
| get_mystery_animal(); // cat.cpp
| #include "animal.h" #include <cstdio>
| class cat final : public animal { public:
| ~cat() override{} void speak() override{
| puts("meow"); } };
| static cat garfield{}; animal&
| get_mystery_animal() { return garfield; }
| // main.cpp #include "animal.h"
| int main() { animal& a = get_mystery_animal();
| a.speak(); } $ make clean && CXX=clang++
| make -j && objdump --disassemble=main -C lto_test
| rm -f *.o lto_test clang++ -c -flto -O3 -g cat.cpp
| -o cat.o clang++ -c -flto -O3 -g main.cpp -o main.o
| clang++ -flto -O3 -g cat.o main.o -o lto_test
| lto_test: file format elf64-x86-64
| Disassembly of section .init: Disassembly
| of section .plt: Disassembly of section
| .plt.got: Disassembly of section .text:
| 00000000000011b0 <main>: 11b0: 50
| push %rax 11b1: 48 8b 05 58 2e 00 00 mov
| 0x2e58(%rip),%rax # 4010 <garfield>
| 11b8: 48 8d 3d 51 2e 00 00 lea 0x2e51(%rip),%rdi
| # 4010 <garfield> 11bf: ff 50 10
| call *0x10(%rax) 11c2: 31 c0
| xor %eax,%eax 11c4: 59
| pop %rcx 11c5: c3 ret
| Disassembly of section .fini: $ make clean &&
| CXX=g++ make -j && objdump --disassemble=main -C
| lto_test|sed -e 's,^, ,' rm -f *.o lto_test
| g++ -c -flto -O3 -g cat.cpp -o cat.o g++ -c -flto
| -O3 -g main.cpp -o main.o g++ -flto -O3 -g cat.o
| main.o -o lto_test lto_test: file
| format elf64-x86-64 Disassembly
| of section .init: Disassembly of section
| .plt: Disassembly of section .plt.got:
| Disassembly of section .text:
| 0000000000001090 <main>: 1090: 48 83 ec 08
| sub $0x8,%rsp 1094: 48 8d 3d 75 2f 00 00
| lea 0x2f75(%rip),%rdi # 4010 <garfield>
| 109b: e8 50 01 00 00 call 11f0 <cat::speak()>
| 10a0: 31 c0 xor %eax,%eax
| 10a2: 48 83 c4 08 add $0x8,%rsp
| 10a6: c3 ret
| Disassembly of section .fini:
| ranger_danger wrote:
| What if you add -fwhole-program-vtables on clang?
| wiml wrote:
| If your runtime environment has dynamic linking, then the LTO
| pass can't always be sure that a subclass won't be introduced
| later that overrides the method.
| i80and wrote:
| Aha! That makes sense. I wasn't thinking of that case.
| Thanks!
| gpderetta wrote:
| You can tell the compiler it is indeed compiling the whole
| program.
| adzm wrote:
| MSVC with LTO and PGO will inline virtual calls in some
| situations along with a check for the expected vtable,
| bypassing the inlined code and calling the virtual function
| normally if it is an unexpected value.
| bluGill wrote:
| not if there is a shared libray or other plugin. Then you
| coannot determine until runtime if there is an override.
| ot wrote:
| In general the compiler/linker cannot assume that derived
| classes won't arrive later through a shared object.
|
| You can tell it "I won't do that" though with additional
| flags, like Clang's -fwhole-program-vtables, and even then
| it's not that simple. There was an effort in Clang to better
| support whole program devirtualization, but I haven't been
| following what kind of progress has been made:
| https://groups.google.com/g/llvm-dev/c/6LfIiAo9g68?pli=1
| Slix wrote:
| This optimization option isn't on by default? That sounds
| like a lot of missed optimization. Most programs aren't
| going to be loading from shared libraries.
|
| Maybe I can set this option at work. Though it's scary
| because I'd have to be certain.
| soooosn wrote:
| I think you have answered your own question: If turning
| on the setting is scary for you in a very localized
| project at your company, imagine how scary it would be to
| turn on by default for everybody :-P
| Thiez wrote:
| The JVM can actually perform this optimization
| optimistically and can undo it if the assumption is
| violated at runtime. So Java's 'everything is virtual by
| default' approach doesn't hurt. Of course relying an a
| sufficiently smart JIT comes with its own trade-offs.
| samus wrote:
| This is one of the cases where JIT compiling can shine. You
| can use a bazillion interfaces to decouple application code,
| and the JIT will optimize the calls after it found out which
| implementation is used. This works as long as there is only
| one or two of them actually active at runtime.
| account42 wrote:
| You don't need a JIT to do whole program optimization.
| samus wrote:
| AOT whole program optimization has two limits:
|
| * It is possible with `dlopen()` to load code objects
| that violate the assumptions made during compilation.
|
| * The presence of runtime configuration mechanisms and
| application input can make it impossible to anticipate
| things like the choice of implementations of an
| interface.
|
| One can always strive to reduce such situations, but it
| might simply not be necessary if a JIT is present.
| Negitivefrags wrote:
| See this is why I find this odd.
|
| Is there a theory as to how devirtualisation could hurt
| performance?
| samus wrote:
| Devirtualization maybe not necessarily, but inlining might
| make code fail to fit into instruction caches.
| hansvm wrote:
| There's a cost to loading more instructions, especially if
| you have more types of instructions.
|
| The main advantages to inlining are (1) avoiding a jump and
| other function call overhead, (2) the ability to push down
| optimizations.
|
| If you execute the "same" code (same instructions, different
| location) in many places that can cause cache evictions and
| other slowdowns. It's worse if some minor optimizations were
| applied by the inlining, so you have more types of
| instructions to unpack.
|
| The question, roughly, is whether the gains exceed the costs.
| This can be a bit hard to determine because it can depend on
| the size of the whole program and other non-local parameters,
| leading to performance cliffs at various stages of
| complexity. Microbenchmarks will tend to suggest inlining is
| better in more cases that it actually is.
|
| Over time you get a feel for which functions should be
| inlined. E.g., very often you'll have guard clauses or
| whatnot around a trivial amount of work when the caller is
| expected to be able to prove the guarded information at
| compile-time. A function call takes space in the generated
| assembly too, and if you're only guarding a few instructions
| it's usually worth forcing an inline (even in places where
| the compiler's heuristics would choose not to because the
| guard clauses take up too much space), regardless of the
| potential cache costs.
| masklinn wrote:
| Code bloat causing icache evictions?
| cogman10 wrote:
| Through inlining.
|
| If you have something like a `while` loop and that while
| loop's instructions fit neatly on the cache line, then
| executing that loop can be quiet fast even if you have to
| jump to different code locations to do the internals.
| However, if you pump in more instructions in that loop you
| can exceed the length of the cache line which causes you to
| need more memory loads to do the same work.
|
| It can also create more code. A method that took a
| `foo(NotFinal& bar)` could be duplicated by the compiler for
| the specialized cases which would be bad if there's a lot of
| implementations of `NotFinal` that end up being marshalled
| into foo. You could end up loading multiple implementations
| of the same function which may be slower than just keeping
| the virtual dispatch tables warm.
| phire wrote:
| Jumps/calls are actually be pretty cheap with modern branch
| predictors. Even indirect calls through vtables, which is the
| opposite of most programmers intuition.
|
| And if the devirtualisation leads to inlining, that results
| in code bloat which can lower performance though more
| instruction cache misses, which are not cheap.
|
| Inlining is actually pretty evil. It almost always speeds
| things up for microbenchmarks, as such benchmarks easily fit
| in icache. So programmers and modern compilers often go out
| of their way to do more inlining. But when you apply too much
| inlining to a whole program, things start to slow down.
|
| But it's not like inlining is universally bad in larger
| program, inlining can enable further optimisations, mostly
| because it allows constant propagation to travel across
| function boundaries.
|
| Basically, compilers need better heuristics about when they
| should be inlining. If it's just saving the overhead of a
| lightweight call, then they shouldn't be inlining.
| qsdf38100 wrote:
| "Inlining is actually pretty evil".
|
| No it's not. Except if you __force_inline__ everything, of
| course.
|
| Inlining reduces the number of instructions in a lot of
| cases. Especially when things are abstracted and factored
| with lot of indirections into small functions that calls
| other small functions and so on. Consider a 'isEmpty'
| function, which dissolves to 1 cpu instruction once
| inlined, compared with a call/save reg/compare/return.
| Highly dynamic code (with most functions being virtual)
| tend to result in a fest of chained calls, jumping into
| functions doing very little work. Yes the stack is usually
| hot and fast, but spending 80% of the instructions doing
| stack management is still a big waste.
|
| Compilers already have good heuristics about when they
| should be inlining, chances are they are a lot better at it
| than you. They don't always inline, and that's not possible
| anyway.
|
| My experience is that compiler do marvels with inlining
| decisions when there are lots of small functions they _can_
| inline if they want to. It gives the compiler a lot of
| freedom. Lambdas are great for that as well.
|
| Make sure you make the most possible compile-time
| information available to the compiler, factor your code,
| don't have huge functions, and let the compiler do its
| magic. As a plus, you can have high level abstractions,
| deep hierarchies, and still get excellent performances.
| grdbjydcv wrote:
| The "evilness" is just that sometimes if you inline
| aggressively in a microbenchmark things get faster but in
| real programs things get slower.
|
| As you say: "chances are they are a lot better at it than
| you". Infrequently they are not.
| EasyMark wrote:
| doesn't the compiler usually do well enough that you
| really only need to worry about time critical sections of
| code? Even then you could go in and look at the assembler
| and see if it's being inlined, no?
| usefulcat wrote:
| I find that gcc and clang are so aggressive about
| inlining that it's usually more effective to tell them
| what _not_ to inline.
|
| In a moderately-sized codebase I regularly work on, I use
| __attribute__((noinline)) nearly ten times as often as
| __attribute__((always_inline)). And I use
| __attribute__((cold)) even more than noinline.
|
| So yeah, I can kind of see why someone would say inlining
| is 'evil', though I think it's more accurate to say that
| it's just not possible for compilers to figure out these
| kinds of details without copious hints (like PGO).
| jandrewrogers wrote:
| +1 on the __attribute__((cold)). Compilers so
| aggressively optimize based on their heuristics that you
| spend more time telling them that an apparent
| optimization opportunity is not actually an optimization.
|
| When writing ultra-robust code that has to survive every
| vaguely plausible contingency in a graceful way, the code
| is littered with code paths that only exist for
| astronomically improbable situations. The branch
| predictor can figure this out but the compiler frequently
| cannot without explicit instructions to not pollute the
| i-cache.
| somenameforme wrote:
| I find the Unreal Engine source to be a reasonable
| reference for C++ discussions, because it runs just
| unbelievably well for what it does, and on a huge array
| of hardware (and software). And it's explicit with
| inlining, other hints, and even a million things that
| could be easily called micro-optimizations, to a somewhat
| absurd degree. So I'd take away two conclusions from
| this.
|
| The first is that when building a code base you don't
| necessarily know what it's being compiled with. And so
| even _if_ there were a super-amazing compiler, there 's
| no guarantee that's what will be compiling your code.
| Making it explicit, so long as you have a reasonably good
| idea of what you're doing, is generally just a good idea.
| It also conveys intent to some degree, especially things
| like final.
|
| The second is that I think the saying _' premature
| optimization is the root of all evil'_ is the root of all
| evil. Because that mindset has gradually transitioned to
| being against optimization in general outside of the most
| primitive things like not running critical sections in
| O(N^2) when they could be O(N). And I think it's this
| mindset that has gradually brought us to where we are
| today where need what what would have been a literal
| supercomputer not that long ago, to run a word processor.
| It's like death by a thousand cuts, and quite ridiculous.
| a_e_k wrote:
| Another for the pro side: inlining can allow for better
| branch prediction if the different call sites would tend to
| drive different code paths in the function.
| phire wrote:
| This was true 15 years ago, but not so much today.
|
| The branch predictors actually hash the history of the
| last few branches taken into the branch prediction query.
| So the exact same branch within a child function will map
| different branch predictors entries depending on which
| parent function it was called from, and there is no
| benifit to inlining.
|
| It also means that branch predictor can also learn
| correlations between branches within a function. Like
| when a branches at the top and bottom of functions share
| conditions, or have inverted conditions.
| neonsunset wrote:
| Practically - it never does. It is always cheaper to perform
| a direct, possibly inlined, call (devirtualization !=
| inlining) than a virtual one.
|
| Guarded devirtualization is also cheaper than virtual calls,
| even when it has to do if (instance is
| SpecificType st) { st.Call() } else { instance.Call()
| }
|
| or even chain multiple checks at once (with either regular
| ifs or emitting a jump table)
|
| This technique is heavily used in various forms by .NET, JVM
| and JavaScript JIT implementations (other platforms also do
| that, but these are the major ones)
|
| The first two devirtualize virtual and interface calls
| (important in Java because all calls default to virtual,
| important in C# because people like to abuse interfaces and
| occasionally inheritance, C# delegates are also
| devirtualized/inlined now). The JS JIT (like V8) performs
| "inline caching" which is similar where for known object
| shapes property access is shape type identifier comparison
| and direct property read instead of keyed lookup which is way
| more expensive.
| ynik wrote:
| Caution! If you compare across languages like that, not all
| virtual calls are implemented equally. A C++ virtual call
| is just a load from a fixed offset in the vtbl followed by
| an indirect call. This is fairly cheap, on modern CPUs
| pretty much the same as a non-virtual non-inlined call. A
| Java/C# interface call involves a lot more stuff, because
| there's no single fixed vtbl offset that's valid for all
| classes implementing the interface.
| neonsunset wrote:
| Yes, it is true that there is difference. I'm not sure
| about JVM implementation details but the reason the
| comment says "virtual _and_ interface " calls is to
| outline it. Virtual calls in .NET are sufficiently
| close[0] to virtual calls in C++. Interface calls,
| however, are coded differently[1].
|
| Also you are correct - virtual calls are not terribly
| expensive, but they encroach on ever limited* CPU
| resources like indirect jump and load predictors and, as
| noted in parent comments, block inlining, which is highly
| undesirable.
|
| [0] https://github.com/dotnet/runtime/blob/5111fdc0dc464f
| 01647d6...
|
| [1] https://github.com/dotnet/runtime/blob/main/docs/desi
| gn/core... (mind you, the text was initially written 18
| years ago, wow)
|
| * through great effort of our industry to take back
| whatever performance wins each generation brings with
| even more abstractions that fail to improve our
| productivity
| variadix wrote:
| It basically never should unless the inliner made a terrible
| judgement. Devirtualizing in C++ can remove 3 levels of
| pointer chasing, all of which could be cache misses. Many
| optimizations in modern compilers require the context of the
| function to be inlined to make major optimizations, which
| requires devirtualization. The only downside is I$ pressure,
| but this is generally not a problem because hot loops are
| usually tight.
| bandrami wrote:
| If it's done badly, the same code that runs N times also gets
| cached N times because it's in N different locations in
| memory rather than one location that gets jumped to. Modern
| compilers and schedulers will eliminate a lot of that (but
| probably not for anything much smaller than a page), but in
| general there's always a tradeoff.
| chipdart wrote:
| > What final enables is devirtualization in certain cases. The
| main advantage of devirtualization is that it is necessary for
| inlining.
|
| I think that enabling inlining is just one of the indirect
| consequences of devirtualization, and perhaps one that is
| largely irrelevant for performance improvements.
|
| The whole point of devirtualization is eliminating the need to
| resort to pointer dereferencing when calling virtual members.
| The main trait of a virtual class is it's use of a vtable that
| requires dereferencing virtual members to access each and every
| one of them.
|
| In classes with larger inheritance chains, you can easily have
| more than one pointer dereferencing taking place before you
| call a virtual members function.
|
| Once a class is final, none of that is required anymore. When a
| member is referred, no dereferencing takes place.
|
| Devirtualization helps performance because you are able to
| benefit from inheritance and not have to pay a performance
| penalty for that. Without the final keyword, a performance
| oriented project would need to be architected to not use
| inheritance at all, or in the very least in code in the hot
| path, because that sneaks gratuitous pointer dereferences all
| over the place, which require running extra operations and has
| a negative impact on caching.
|
| The whole purpose of the final keyword is that compilers can
| easily eliminate all pointer dereferencing used by virtual
| members. What stops them from applying this optimization is
| that they have no information on whether that class will be
| inherited and one of its members will either override any of
| its members or invoke any member function implemented by one of
| its parent classes.
|
| With the introduction of the final keyword, you are now able to
| tell the compiler "from thereon, this is exactly what you get"
| and the compiler can trim out anything loose.
| simonask wrote:
| An extra indirection (indirect call versus direct call) is
| practically nothing on modern hardware. Branch predictors are
| insanely good, and this isn't something you generally have to
| worry about.
|
| Inlining is by far the most impactful optimization here,
| because it can eliminate the call altogether, and thus
| specialize the called function to the callsite, lifting
| constants, hoisting loop variables, etc.
| silvestrov wrote:
| "is practically nothing on modern hardware" _if the data is
| already present in the L2 cache._ Random RAM access that
| stalls execution is expensive.
|
| My guess is this is why he didn't see any speedup: all the
| code could fit inside the L2 cache, so he did not have to
| pay for RAM access for the deference.
|
| The number of different classes is important, not the
| number of objects as they have the same small number of
| vtable pointers.
|
| It might be different for large codebases like Chrome and
| Firefox.
| dblohm7 wrote:
| Firefox has done a lot of work on devirtualization over
| the years. There is a cost.
| ot1138 wrote:
| I had a section of code which incurred ~20 clock cycles to
| make a function call to a virtual function in a critical
| loop. That's over and above potential delays resulting from
| cache misses and the need to place multiple parameters on
| the stack.
|
| I was going to eliminate polymorphism altogether for this
| object but later figured out how to refactor so that this
| particular call could be called once a millisecond. Then if
| more work was needed, it would dispatch a task to a
| dedicated CPU.
|
| This was an incredibly performant improvement which made a
| significant difference to my P&L.
| mgaunard wrote:
| Could just be inefficient spilling caused by ABI
| requirements due to the inability to inline.
|
| In general if you're manipulating values that fit into
| registers and work on a platform with a shitty ABI,you
| need to be very careful of what your function call
| boundaries look like.
|
| The most obvious example is SIMD programming on Windows
| x86 32-bit.
| pixelpoet wrote:
| Vfuncs are only fast when they can be predicted:
| https://forwardscattering.org/post/28
| mgaunard wrote:
| Same as any other branch. They're fast if predicted
| correctly and slow if not.
|
| If they cannot be predicted, write your code accordingly.
| account42 wrote:
| > Devirtualization helps performance because you are able to
| benefit from inheritance and not have to pay a performance
| penalty for that. Without the final keyword, a performance
| oriented project would need to be architected to not use
| inheritance at all, or in the very least in code in the hot
| path, because that sneaks gratuitous pointer dereferences all
| over the place, which require running extra operations and
| has a negative impact on caching.
|
| _virtual_ inheritance. Regular old inheritance does not need
| or benefit from devirtualization. This is why the CRTP
| exists.
| chipdart wrote:
| > This is why the CRTP exists.
|
| CRTP does not exist for that. CRTP was one of the many
| happy accidents in template metaprogramming that happened
| to be discovered when doing recursive templates.
|
| Also, you've missed the whole point. CRTP is a way to
| rearchitect your code to avoid dereferencing pointers to
| virtual members in inheritance. The whole point is that
| with final you do not need to pull tricks: just tell the
| compiler that you don't want the class to be inherited, and
| the compiler picks up from there and does everything for
| you.
| account42 wrote:
| If that's your point then it is simply wrong. Final does
| not allow the compiler to devirtualize calls through a
| base pointer, it only eliminates the virtualness for
| calls through pointers to the (final) derived type. The
| compiler can devirtualize calls through base pointers in
| others ways (by deducing the possible derived types via
| whole program optimization or PGO) but final does not
| help with that.
| chipdart wrote:
| > If that's your point then it is simply wrong. Final
| does not allow the compiler to devirtualize calls through
| a base pointer, it only eliminates the virtualness for
| calls through pointers to the (final) derived type.
|
| Please read my post. That's not my claim. I think I was
| very clear.
| scaredginger wrote:
| Maybe a nitpick, but virtual inheritance is a term used for
| something else entirely.
|
| What you're talking about is dynamic dispatch
| oasisaimlessly wrote:
| > In classes with larger inheritance chains, you can easily
| have more than one pointer dereferencing taking place before
| you call a virtual members function.
|
| This is not a thing in C++; vtables are flat, not nested.
| Function pointers are always 1 dereference away.
| bdjsiqoocwk wrote:
| Whats devirtualization in C++?
|
| Funny how things work. From working with Julia I've built a
| good intuition for guessing when functions would be inlined.
| And yet, I've never heard the word devirtualization until now.
| saagarjha wrote:
| In C++ virtual functions are polymorphic and indirected, with
| the target not known to the compiler. Devirtualization gives
| the compiler this information (in this case a final method
| cannot be overridden and branch to something else).
| andrewla wrote:
| I'm surprised that it has any impact on performance at all, and
| I'd love to see the codegen differences between the applications.
|
| Mostly the `final` keyword serves as a compile-time assertion.
| The compiler (sometimes linker) is perfectly capable of seeing
| that a class has no derived classes, but what `final` assures is
| that if you attempt to derive from such a class, you will raise a
| compile-time error.
|
| This is similar to how `inline` works in practice -- rather than
| providing a useful hint to the compiler (though the compiler is
| free to treat it that way) it provides an assertion that if you
| do non-inlinable operations (e.g. non-tail recursion) then the
| compiler can flag that.
|
| All of this is to say that `final` can speed up runtimes -- but
| it does so by forcing you to organize your code such that the
| guarantees apply. By using `final` classes, in places where
| dynamic dispatch can be reduced to static dispatch, you force the
| developer to not introduce patterns that would prevent static
| dispatch.
| bgirard wrote:
| > The compiler (sometimes linker) is perfectly capable of
| seeing that a class has no derived classes
|
| How? The compiler doesn't see the full program.
|
| The linker I'm less sure about. If the class isn't guaranteed
| to be fully private wouldn't an optimizing linker have to be
| conservative in case you inject a derived class?
| GuB-42 wrote:
| "inline" is confusing in C++, as it is not really about
| inlining. Its purpose is to allow multiple definitions of the
| same function. It is useful when you have a function defined in
| a header file, because if included in several source files, it
| will be present in multiple object files, and without "inline"
| the linker will complain of multiple definitions.
|
| It is also an optimization hint, but AFAIK, modern compiler
| ignore it.
| wredue wrote:
| I believe the wording I've seen is that compilers may not
| respect the inline keyword, not that it is ignored.
| fweimer wrote:
| GCC does not ignore inline for inlining purposes:
|
| Need a way to make inlining heuristics ignore whether a
| function is inline
| https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008
|
| (Bug saw a few updates recently, that's how I remembered.)
|
| As a workaround, if you need the linkage aspect of the inline
| keyword, you currently have to write fake templates instead.
| Not great.
| lqr wrote:
| 10 years ago it was already folklore that compilers ignore
| the "inline" keyword when optimizing, but that was false for
| clang/llvm: https://stackoverflow.com/questions/27042935/are-
| the-inline-...
| jacoblambda wrote:
| The thing with `inline` as an optimisation is that it's not
| about optimising by inlining directly. It's a promise about
| how you intend to use the function.
|
| It's not just "you can have multiple definitions of the same
| function" but rather a promise that the function doesn't need
| to be address/pointer equivalent between translation units.
| This is arguably more important than inlining directly
| because it means the compiler can fully deduce how the
| function may be used without any LTO or other cross
| translation unit optimisation techniques.
|
| Of course you could still technically expose a pointer to the
| function outside a TU but doing so would be obvious to the
| compiler and it can fall back to generating a strictly
| conformant version of the function. Otherwise however it can
| potentially deduce that some branches in said function are
| unreachable and eliminate them or otherwise specialise the
| code for the specific use cases in that TU. So it potentially
| opens up alternative optimisations even if there's still a
| function call and it's not inlined directly.
| ack_complete wrote:
| > "inline" is confusing in C++, as it is not really about
| inlining. Its purpose is to allow multiple definitions of the
| same function.
|
| No, its purpose was and is still to specify a preference for
| inlining. The C++ standard itself says this:
|
| > The inline specifier indicates to the implementation that
| inline substitution of the function body at the point of call
| is to be preferred to the usual function call mechanism.
|
| https://eel.is/c++draft/dcl.inline
| lelanthran wrote:
| > It is useful when you have a function defined in a header
| file, because if included in several source files, it will be
| present in multiple object files, and without "inline" the
| linker will complain of multiple definitions.
|
| Traditionally you'd use `static` for that use case, wouldn't
| you?
|
| After all, `inline` can be ignored, `static` can't.
| pjmlp wrote:
| No, because that would make it internal to each object
| file, while what you want is for all object files to see
| the same memory location.
| lelanthran wrote:
| > No, because that would make it internal to each object
| file, while what you want is for all object files to see
| the same memory location.
|
| I can see exactly one use for an effect like that: static
| variables within the function.
|
| Are there any other uses?
| pjmlp wrote:
| Global variables and the magic of a build system based on
| C semantics.
| wheybags wrote:
| What if I dlopen a shared object that contains a derived class,
| then instantiate it. You cannot statically verify that I won't.
| Or you could swap out a normally linked shared object for one
| that creates a subclass. Etc etc. This kind of stuff is why I
| think shared object boundaries should be limited to the lowest
| common denominator (basically c abi). Dynamic linking high
| level languages was a mistake. The only winning move is not to
| play.
| lanza wrote:
| > Mostly the `final` keyword serves as a compile-time
| assertion. The compiler (sometimes linker) is perfectly capable
| of seeing that a class has no derived classes
|
| That's incorrect. The optimizer has to assume everything
| escapes the current optimization unit unless explicitly told
| otherwise. It needs explicit guarantees about the visibility to
| figure out the extent of the derivations allowed.
| sixthDot wrote:
| > I'd love to see the codegen differences between the
| applications
|
| There are two applications, dynamic calls and dynamic casts.
|
| Dynamic casts to final classes dont require to check the whole
| inheritance chain. Recently done this in styx [0]. The gain may
| appear marginal, e.g 3 or 4 dereferences saved but in programs
| based on OOP you can easily have *Billions* of dynamic casts
| saved.
|
| [0]: https://gitlab.com/styx-
| lang/styx/-/commit/62c48e004d5485d4f....
| bluGill wrote:
| I use final more for communication. Don't look for deeper derived
| classes as there are none. that it results in slower code is an
| annoying surprise.
| p0w3n3d wrote:
| I would say the most performance impact would give `constexpr`
| followed by `const`. I wouldn't bet any money on `final` which in
| C++ is a guard of inheritance, and C++ function invocation
| address is resolved the `vtable` hence final wouldn't change
| anything. Maybe the author was mistaken with `final` keyword in
| Java
| adrianN wrote:
| In my experience the compiler is pretty good at figuring out
| what is constant so adding const is more documentation for
| humans, especially in C++, where const is more of a hint than a
| hard boundary. Devirtualization, as can happen when you add a
| final, or the optimizations enabled by adding a restrict to a
| pointer, are on the other hand often essential for performance
| in hot code.
| bayindirh wrote:
| Since "const" makes things read-only, being const correct
| makes sure that you don't do funny things with the data you
| shouldn't mutate, which in turn eliminates tons of data bugs
| out of the gate.
|
| So, it's an opt-in security feature first, and a compiler
| hint second.
| Lockal wrote:
| How does const affects code generation in C/C++? Last time
| I checked, const was purely informational. Compilers can't
| eliminate reads for const pointer data, because const_cast
| exists. Compilers can't eliminate double calls to const
| methods, because inside function definition such functions
| can still legally modify mutable variables (and have many
| side effects).
|
| What actually may help is __attribute__((pure)) and
| __attribute__((const)), but I don't see them often in real
| code (unfortunately).
| account42 wrote:
| Const affects code generation when used on _variables_.
| If you have a `const int i` then the compiler can assume
| that i never changes.
|
| But you're right that this does not hold true for const
| pointers or references.
|
| > What actually may help is __attribute__((pure)) and
| __attribute__((const)), but I don't see them often in
| real code (unfortunately).
|
| It's disppointing that these haven't been standardized.
| I'd prefer different semantics though, e.g. something
| that allows things like memoization or other forms of
| caching that are technically side effects but where you
| still are ok with allowing the compiler to remove /
| reorder / eliminate calls.
| lelanthran wrote:
| > In my experience the compiler is pretty good at figuring
| out what is constant so adding const is more documentation
| for humans,
|
| In the same TU, sure. But across TU boundaries the compiler
| really can't figure out what should be const and what should
| not, so `const` in parameter or return values allows the
| compiler to tell the human _" You are attempting to make a
| modification to a value that some other TU put into RO
| memory."_, or issue similar diagnostics.
| account42 wrote:
| > followed by `const`
|
| Const can only ever possibly have a performance impact when
| used directly on variables. const pointers / references are
| purely for the benefit of the programmer - the compiler can
| assume nothing because the variable could be modified elsewhere
| or through another pointer/reference and const_cast is legal
| anyway unless the original variable was const.
| ein0p wrote:
| You should use final to express design intent. In fact I'd rather
| it were the default in C++, and there was some sort of an
| opposite ('derivable'?) keyword instead, but that ship has sailed
| long time ago. Any measurable negative perf impact should be
| filed as a bug and fixed.
| cesarb wrote:
| > In fact I'd rather it were the default in C++, and there was
| some sort of an opposite ('derivable'?) keyword instead
|
| Kotlin (which uses the equivalent of the Java "final" keyword
| by default) uses the "open" keyword for that purpose.
| josefx wrote:
| Intent is nice and all that, but I would like a
| "nonwithstanding" keyword instead that just lets me bypass that
| kind of "intent" without having to copy paste the entire
| implementation just to remove a pointless keyword or make a
| destructor public when I need it.
| jbverschoor wrote:
| In general, I think things should be strict by default. Way
| easier to optimize and less error prone.
| leni536 wrote:
| C++ doesn't have the fragile base problem, as members aren't
| virtual my default. The only concern with unintended
| inheritance is with polymorhpic deletion. "final" on class
| definition disables some tricks thag you can do with private
| inheritance.
|
| Having said that "final" on member functions is great, and I
| like to see that instead of "override".
| pjmlp wrote:
| All OOP languages have it, the issue is related to changing
| the behaviour of the base class, and the change introducing
| unforceen consequences on the inheritance tree.
|
| Changing an existing method way of calling (regular, virtual,
| static), changing visibility, overloading, introducing a name
| that clashes downstream, introducing a virtual destructor,
| making a data member non-copyable,...
| leni536 wrote:
| > All OOP languages have it, the issue is related to
| changing the behaviour of the base class, and the change
| introducing unforceen consequences on the inheritance tree.
|
| C++ largely solves it by having tight encapsulation. As
| long as you don't change anything that breaks your existing
| interface, you should be good. And your interface is opt-
| in, including public members and virtual functions.
| pjmlp wrote:
| Not when you change the contents of the class itself for
| public and protected inheritance members, which is
| exactly the whole issue of fragile base class.
|
| It doesn't go away just because private members exist as
| possible language feature.
| leni536 wrote:
| That's not a fragile base, that's just a fragile class.
| You can break APIs for all kinds of users, including
| derived classes.
|
| Some APIs are aimed towards derived classes, like
| protected members and virtual functions, but that doesn't
| make the issue fundamentally different. It's just
| breaking APIs.
|
| Point is, in C++ you have to opt-in to make these API
| surfaces, they are not the default.
| pjmlp wrote:
| I give up, word games to avoid acknowledging the same
| happens.
| jstimpfle wrote:
| Now try a regular function, you will be blown away. No need
| to type "final"...
| jey wrote:
| I wonder if LTO was turned on when using Clang? Might lead to a
| performance improvement.
| pineapple_sauce wrote:
| What should be evaluated is removing indirection and tightly
| packing your data. I'm sure you'll gain a better performance
| improvement. virtual calls and shared_ptr are littered in the
| codebase.
|
| In this way: you can avoid the need for the `final` keyword and
| do the optimization the keyword enables (de-virtualize calls).
|
| >Yes, it is very hacky and I am disgusted by this myself. I would
| never do this in an actual product
|
| Why? What's with the C++ community and their disgust for macros
| without any underlying reasoning? It reminds me of everyone
| blindly saying "Don't use goto; it creates spaghetti code".
|
| Sure, if macros are overly used: it can be hard to read and
| maintain. But, for something simple like this, you shouldn't be
| thinking "I would never do this in an actual product".
| sfink wrote:
| Macros that are giving you some value can be ok. In this case,
| once the performance conclusion is reached, the only reason to
| continue using a macro is if you really need the `final`ity to
| vary between builds. Otherwise, just delete it or use the
| actual keyword.
|
| (But I'm worse than the author; if I'm just comparing
| performance, I'd probably put `final` everywhere applicable and
| then do separate compiles with `-Dfinal=` and
| `-Dfinal=final`... I'd be making the assumption that it's
| something I either always or never want eventually, though.)
| bluGill wrote:
| Macros in C are a text replace and so it is hard to see from a
| debugger how th code got like that.
| pineapple_sauce wrote:
| Yes, I'm well aware of the definition of a macro in C and
| C++. Macros are simpler than templates. You can expand them
| with a compiler flag.
| bluGill wrote:
| when things get complex templete error messages are easier
| to follow. nobody makes complex macros but if you tried.
| (template error messeges are legendary for a reason. nested
| macros are worse)
| account42 wrote:
| > nobody makes complex macros
|
| http://boost.org/libs/preprocessor
| jandrewrogers wrote:
| In modern C++, macros are a viewed as a code smell because they
| are strictly worse than alternatives in almost all situations.
| It is a cultural norm; it is a bit like using "unsafe" in Rust
| if not strictly required for some trivial case. The C++
| language has made a concerted effort to eliminate virtually all
| use cases for macros since C++11 and replace them with type-
| safe first-class features in the language. It is a bit of a
| legacy thing at this point, there are large modern C++
| codebases with no macros at all, not even for things like
| logging. While macros aren't going away, especially in older
| code, the cultural norm in modern C++ has tended toward macros
| being a legacy foot-gun and best avoided if at all possible.
|
| The main remaining use case for the old C macro facility I
| still see in new code is to support conditional compilation of
| architecture-specific code e.g. ARM vs x86 assembly routines or
| intrinsics.
| sgerenser wrote:
| But how would one conditionally enable or disable the "final"
| keyword on class members without a preprocessor macro, even
| in C++23?
| jandrewrogers wrote:
| Macros are still useful for conditional compilation, as in
| this case. They've been sunsetted for anything that looks
| like code generation, which this isn't. I was more
| commenting on the reflexive "ick" reaction of the author to
| the use of macros (even when appropriate) because avoiding
| them has become so engrained in C++ culture. I'm a macro
| minimalist but I would use them here.
|
| Many people have a similar reaction to the use of "goto",
| even though it is absolutely the right choice in some
| contexts.
| gpderetta wrote:
| 1% is nothing to scoff of. But I suspect that the variability of
| compilation (specifically quirks of instruction selection,
| register allocation and function alignment) more than mask any
| gains.
|
| The clang regression might be explainable by final allowing some
| additional inlining and clang making an hash of it.
| jcalvinowens wrote:
| That's interesting. Maybe final enabled more inlining, and clang
| is being too aggressive about it for the icache sizes in play
| here? I'd love to see a comparison of the generated code.
|
| I'm disappointed the author's conclusion is "don't use final",
| not "something is wrong with clang".
| ot wrote:
| Or "something is wrong with my benchmark setup", which is also
| a possibility :)
|
| Without a comparison of generated code, it could be anything.
| indigoabstract wrote:
| If it does have a noticeable impact, that would be surprising, a
| bit like going back to the days when 'inline' was supposed to
| tell the compiler to inline the designated functions (no longer
| its main use case nowadays).
| sfink wrote:
| tldr: sprinkled a keyword around in the hopes that it "does
| something" to speed things up, tested it, got noisy results but
| no miraculous speedup.
|
| I started skimming this article after a while, because it seemed
| to be going into the weeds of performance comparison without ever
| backing up to look at what the change might be doing. Which meant
| that I couldn't tell if I was going to be looking at the usual
| random noise of performance testing or something real.
|
| For `final`, I'd want to at least see if it changing the
| generated code by replacing indirect vtable calls with direct or
| inlined calls. It might be that the compiler is already figuring
| it out and the keyword isn't doing anything. It might be that the
| compiler _is_ changing code, but the target address was already
| well-predicted and it 's perturbing code layout enough that it
| gets slower (or faster). There could be something interesting
| here, but I can't tell without at least a little assembly output
| (or perhaps a relevant portion of some intermediate
| representation, not that I would know which one to look at).
|
| If it's not changing anything, then perhaps there could be an
| interesting investigation into the variance of performance
| testing in this scenario. If it's changing something, then there
| could be an interesting investigation into when that makes things
| faster vs slower. As it is, I can't tell what I should be looking
| for.
| sgerenser wrote:
| This is what I was waiting for too. Especially with the large
| regression on Clang/Ubuntu. Maybe he uncovered a Clang/LLVM
| codegen bug, but you'd need to compare the generated assembly
| to know.
| akoboldfrying wrote:
| >changing the generated code by replacing indirect vtable calls
| with direct or inlined calls
|
| It can't possibly be doing this, if the raytracing code is like
| any other raytracer I've ever seen -- since it must be looping
| through a list of concrete objects that implement some shared
| interface, calling intersectRay() on each one, and the
| existence of those derived concrete object types means that
| that shared interface _can 't_ be made final, and that's the
| only thing that would enable devirtualisation -- it makes no
| difference whether the concrete derived types themselves are
| final or not.
| drivebycomment wrote:
| +1. On modern hardware and software systems, performance is
| effectively stochastic to some degree, as small random
| perturbations to the input (code, data, environments, etc) can
| have arbitrary effects for the performance. This means you
| can't draw a direct causal chain / mechanism from what you
| changed to the performance change - when it matters, you do
| need to do a deeper analysis and investigation to find the
| actual and full causal chain. I.e. a correlation is not a
| causation, and especially more so on modern hardware and
| software systems.
| jeffbee wrote:
| It's difficult to discuss this stuff because the impact can be
| negligible or negative for one person, but large and consistently
| positive for another. You can only usefully discuss it on a given
| baseline, and for something like final I would hope that baseline
| would be a project that already enjoys PGO, LTO, and BOLT.
| tombert wrote:
| I don't do much C++, but I have definitely found that engineers
| will just assert that something is "faster" without any evidence
| to back that up.
|
| Quick example, I got in an argument with someone a few years ago
| that claimed in C# that a `switch` was better than an `if(x==1)
| elseif(x==2)...` because switch was "faster" and rejected my PR.
| I mentioned that that doesn't appear to be true, we went back and
| forth until I did a compile-then-decompile of a minimal test with
| equality-based-ifs, and showed that the compiler actually
| converts equality-based-ifs to `switch` behind the scenes. The
| guy accepted my PR after that.
|
| But there's tons of this stuff like this in CS, and I kind of
| blame professors for a lot of it [1]. A large part of becoming a
| decent engineer [2] for me was learning to stop trusting what
| professors taught me in college. Most of what they said was fine,
| but you can't _assume_ that; what they tell you could be out of
| date, or simply never correct to begin with, and as far as I can
| tell you have to _always_ test these things.
|
| It doesn't help that a lot of these "it's faster" arguments are
| often reductive because they only are faster in extremely minimal
| tests. Sometimes a microbenchmark will show that something is
| faster, and there's value in that, but I think it's important
| that that can also be a small percentage of the total program;
| compilers are obscenely good at optimizing nowadays, it can be
| difficult to determine _when_ something will be optimized, and
| your assertion that something is "faster" might not actually be
| true in a non-trivial program.
|
| This is why I don't really like doing any kind of major
| optimizations before the program actually works. I try to keep
| the program in a reasonable Big-O and I try and minimize network
| calls cuz of latency, but I don't bother with any kind of micro-
| optimizations in the first draft. I don't mess with bitwise, I
| don't concern myself on which version of a particular data
| structure is a millisecond faster, I don't focus too much on
| whether I can get away with a smaller sized float, etc. Once I
| know that the program is correct, _then_ I benchmark to see if
| any kind of micro-optimizations will actually matter, and often
| they really don 't.
|
| [1] That includes me up to about a year ago.
|
| [2] At least I like to pretend I am.
| BurningFrog wrote:
| Even if one of these constructs is faster _it doesn 't matter_
| 99% of the time.
|
| Writing well structured readable code is typically far more
| important than making it twice as fast. And those times can
| rarely be predicted beforehand, so you should mostly not worry
| about it until you see real performance problems.
| tombert wrote:
| I mostly focus on "using stuff that won't break", and yeah
| "if it actually matters".
|
| For example, much to the annoyance of a lot of people, I
| don't typically use floating point numbers when I start out.
| I will use the "decimal" or "money" types of the language, or
| GMP if I'm using C. When I do that, I can be sure that I
| won't have to worry about any kind of funky overflow issues
| or bizarre rounding problems. There _might_ be a performance
| overhead associated with it, but then I have to ask myself
| "how often is this actually called?"
|
| If the answer is "a billion times" or "once in every
| iteration of the event loop" or something, then I will
| probably eventually go back and figure out if I can use a
| float or convert it to an integer-based thing, but in a lot
| of cases the answer is "like ten or twenty times", and at
| that point I'm not even 100% sure it would be even measurable
| to change to the "faster" implementations.
|
| What annoys me is that people will act like they really care
| about speed, do all these annoying micro-optimizations, and
| then forget that pretty much all of them get wiped out
| immediately upon hitting the network, since the latency
| associated with that is obscene.
| apantel wrote:
| The counter-argument to this is if you are building something
| that is in the critical path of an application (for example,
| parsing HTTP in a web server), you need to be performance-
| minded from the beginning because design decisions lead to
| design decisions. If you are building something in the
| critical path of the application, the best thing to do is
| build it from the ground up measuring the performance of what
| you have as you go. This way, each time you add something you
| will see the performance impact and usually there's a more
| performant way of doing something that isn't more obscure. If
| you do this as you build, early choices become constraints,
| but because you chose the most performant thing at every
| stage, the whole process takes you in the direction of a
| highly-performant implementation.
|
| Why should you care about performance?
|
| I can give you my personal experience: I've been working on a
| Java web/application server for the past 15 years and a
| typical request (only reading, not writing to the db) would
| take maybe 4-5 ms to execute. That includes HTTP request
| parsing, JSON parsing, session validation, method execution,
| JSON serialization, and HTTP response dispatch. Over the past
| 9 months I have refactored the entire application for
| performance and a typical request now takes about 0.25 ms or
| 250 microseconds. The computer is doing so much less work to
| accomplish the same tasks, it's almost silly how much work it
| was doing before. And the result is the machine can handle
| 20x more requests in the same amount of time. If it could
| handle 200 requests per second per core before, now it can
| handle 4000. That means the need to scale is felt 20x less
| intensely, which means less complexity around scaling.
|
| High performance means reduced scaling requirements.
| tombert wrote:
| But even that sort of depends right? Hardware is often
| pretty cheap in comparison to dev-time. I really depends on
| the project, what kind of servers you're using, the nature
| of the application etc, but I think a lot of the time it
| might be cheaper to just pay for 20x the servers than it
| would be to pay a human to go find a critical path.
|
| I'm not saying you completely throw caution to the wind,
| I'm just saying that there's a finite amount of human
| resources and it can really vary how you want to allocate
| them. Sometimes the better path is to just throw money at
| the problem.
|
| It really depends.
| apantel wrote:
| I think it depends on what you're building and who's
| building it. We're all benefitting from the fact that the
| designers of NGINX made performance a priority. We like
| using things that were designed to be performant. We like
| high-FPS games. We like fast internet.
|
| I personally don't like the idea of throwing compute at a
| slow solution. I like when the extra effort has been put
| into something. The good feeling I get from interacting
| with something that is optimal or excellent is an end in
| itself and one of the things I live for.
| tombert wrote:
| Sure, though I've mentioned a few times in this thread
| now that the thing that bothers me more than CPU
| optimizations is not taking into account latency,
| particularly when hitting the network, and I think
| focusing on that will generally pay higher dividends than
| trying to optimize for processing.
|
| CPUs are ridiculously fast now, and compilers are really
| really good now too. I'm not going to say that processing
| speed is a "solved" problem, but I am going to say that
| in a lot of performance-related cases the CPU processing
| is probably not your problem. I will admit that this kind
| of pokes holes in my previous response, because
| introducing more machines into the mix will almost
| certainly increase latency, but I think it more or less
| holds depending on context.
|
| But I think it really is a matter of nuance, which you
| hinted at. If I'm making an admin screen that's going to
| have like a dozen users max, then a slow, crappy solution
| is probably fine; the requests will be served fast enough
| to where no one will notice anyway, and you can probably
| even get away with the cheapest machine/VM. If I'm making
| an FPS game that has 100,000 concurrent users, then it
| almost certainly will be beneficial to squeeze out as
| much performance out of the machine as possible, both CPU
| _and_ latency-wise.
|
| But as I keep repeating everywhere, you have to measure.
| You cannot assume that your intuition is going to be
| right, particularly at-scale.
| apantel wrote:
| I absolutely agree that latency is the real thing to
| optimize for. In my case, I only leave the application to
| access the db, and my applications tend not to be write-
| heavy. So in my case latency-per-request == how much work
| the computer has to do, which is constrained to one core
| because the overhead of parallelizing any part of the
| pipeline is greater than the work required. See, in that
| sense, we're already close to the performance ceiling for
| per-request processing because clock speeds aren't going
| up. You can't make the processing of a given request
| faster by throwing more hardware at it. You can only make
| it faster by creating less work for the hardware to do.
|
| (Ironically, HN is buckling under load right now, or some
| other issue.)
| oivey wrote:
| It almost certainly would require more than 20x servers
| because setting up horizontal scaling will have some sort
| of overhead. Not only that, there is the significant
| engineering effort to develop and maintain the code to
| scale.
|
| If your problem can fit on one server, it can massively
| reduce engineering and infrastructure costs.
| neonsunset wrote:
| Please accept a high five from a fellow "it does so little
| work it must have sub-millisecond request latency"
| aficionado (though I must admit I'm guilty of abusing
| memory caches to achieve this).
| apantel wrote:
| Caches, precomputed values, lookup tables -- it's all
| good as long as it's well-organized and maintainable.
| neonsunset wrote:
| This attitude is part of the problem. Another part of the
| problem is having no idea which things actually end up
| costing performance and how much.
|
| It is why many language ecosystems suffered from performance
| issues for a really long time even if completely unwarranted.
|
| Is changing ifs to switch or vice versa, as outlined in the
| post above, a waste of time? Yes, unless you are writing some
| encoding algorithm or a parser, it will not matter. The
| compiler will lower trivial statements to the same codegen
| and it will not impact the resulting performance anyway even
| if there was difference given a problem the code was solving.
|
| However, there are things that _do_ cost like interface spam,
| abusing lambdas writing needlessly complex wokflow-style
| patterns (which are also less readable and worse in 8 out of
| 10 instances), not caching objects that always have the same
| value, etc.
|
| These kinds of issues, for example, plagued .NET ecosystem
| until more recent culture shift where it started to be cool
| once again to focus on performance. It wasn't being helped by
| the notion of "well-structured code" being just idiotic
| "clean architecture" and "GoF patterns" style dogma applied
| to smallest applications and simplest of business domains.
|
| (it is also the reason why picking slow languages in general
| is a really bad idea - _everything_ costs more and you have
| way less leeway for no productivity win - Ruby and Python,
| and JS with Node.js are less productive to write in than C#
| /F#, Kotlin/Java or Go(under some conditions))
| tombert wrote:
| I mean, that's kind of why I tried to emphasize measuring
| things yourself instead of depending on tribal knowledge.
|
| There are plenty of cases where even the "slow"
| implementation is more than fast enough, and there are also
| plenty of cases where the "correct" solution (from a big-O
| or intuition perspective) is actually slower than the dumb
| case. Intuition _helps_ , you _have_ to measure and /or
| look at the compiled results if you want to ensure correct
| numbers.
|
| An example that really annoys me is how every whiteboard
| interview ends up being "interesting ways to use a
| hashmap", which isn't inherently an issue, but they will
| usually be so small-scoped that an iterative "array of
| pairs" might actually be cheaper than paying the up-front
| cost of hashing and potentially dealing with collisions.
| Interviews almost always ignore constant factors, and
| that's fair enough, but in reality constant factors _can_
| matter, and we 're training future employees to ignore
| that.
|
| I'll say it again: as far as I can tell, you _have_ to
| measure if you want to know if your result is "faster".
| "Measuring" might involve memory profilers, or dumb timers,
| or a mixture of both. Gut instincts are often wrong.
| leetcrew wrote:
| agreed, especially in cases like this. final is primarily a way
| to prohibit overriding methods and extending classes, and it
| indicates to the reader that they should not be doing this. use
| it when it makes conceptual sense.
|
| that said, c++ is usually a language you use when you care
| about performance, at least to an extent. it's worth
| understanding features like nrvo and rewriting functions to
| allow the compiler to pick the optimization if it doesn't hurt
| readability too much.
| wvenable wrote:
| In my opinion, the only things that really matter are
| algorithmic complexity and readability. And even algorithmic
| complexity is usually only an issue a certain scales. Whether
| or not an 'if' is faster than a 'switch' is the micro of micro
| optimizations -- you better have a good reason to care. The
| question I would have for you is was your bunch of ifs more
| readable than a switch would be.
| doctor_phil wrote:
| But a switch and an if-else *is* a matter of algorithmic
| complexity. (Well, at least could be for a naive compiler). A
| switch could be converted to a constant time jump, but the
| if-else would be trying each case linearly.
| cogman10 wrote:
| Yup.
|
| That said, the linear test is often faster due to CPU
| caches, which is why JITs will often convert switches to
| if/elses.
|
| IMO, switch is clearer in general and potentially faster
| (at very least the same speed) so it should be preferred
| when dealing with 3+ if/elseif statements.
| tombert wrote:
| Hard disagree that it's "clearer". I have had to deal
| with a ton of bugs with people trying to be clever with
| the `break` logic, or forgetting to put `break` in there
| at all.
|
| if statements are dumber, and maybe arguably uglier, but
| I feel like they're also more clear, and people don't try
| and be clever with them.
| cogman10 wrote:
| Updates to languages (don't know where C# is on this)
| have different types of switch statements that eliminate
| the `break` problem.
|
| For example, with java there's enhanced switch that looks
| like this var val = switch(foo) {
| case 1, 2, 3 -> bar; case 4 -> baz;
| default -> { yield bat(); } }
|
| The C style switch break stuff is definitely a language
| mistake.
| wvenable wrote:
| C# has both switch expressions like this and also break
| statements are not optional in traditional switch
| statements so it actually solves both problems. You can't
| get too clever with switch statements in C#.
|
| However most languages have pretty permissive switch
| statements just like C.
| tombert wrote:
| Yeah, fair, it's been awhile since I've done any C#, so
| my memory is a bit hazy with the details. I've been
| burned C with switch statements so I have a pretty strong
| distaste for them.
| smaudet wrote:
| I think using C as your language with which to judge
| language constructs is hardly fair - one of its main
| strengths has been as a fairly stable, unchanging code-
| to-compiler contract, i.e. little to none syntax change
| or improvements.
|
| So no offense, but I would revisit the wider world of
| language constructs before claiming that switch
| statements are "all bad". There are plenty of bad
| languages or languages with poor implementations of
| syntax, that do not make the fundamental language
| construct bad.
| neonsunset wrote:
| C# has switch statements which are C/C++ style switches
| and switch expressions which are like Rust's match except
| no control flow statements inside: var
| len = slice switch { null => 0,
| "Hello" or "World" => 1, ['@', ..var tags] =>
| tags.Length, ['{', ..var body, '}'] =>
| body.Length, _ => slice.Length, };
|
| (it supports a lot more patterns but that wouldn't fit)
| gloryjulio wrote:
| This is just forcing return value. You either have to
| break or return at the branches. To me they all look
| equivalent
| SAI_Peregrinus wrote:
| I always set -Werror=implicit-fallthrough, among others.
| That prevents fallthrough unless explicitly annotated.
| Sadly these will forever remain optional warnings
| requiring specific compiler flags, since requiring them
| could break compiling broken legacy code.
| neonsunset wrote:
| Any sufficiently advanced compiler will rewrite those
| arbitrarily depending on its heuristics. What authors
| usually forget is that there is defined behavior and
| specification which the compiler abides by, but it is
| otherwise free to produce any codegen that preserves the
| defined program order. Branch reordering, generating jump
| tables, optimizing away or coalescing checks into
| branchless forms are all very common. When someone says
| "oh I write C because it lets you tell CPU how exactly to
| execute the code" is simply a sign that a person never
| actually looked at disassembly and has little to no idea
| how the tool they use works.
| cogman10 wrote:
| A complier will definitely try this, but it's important
| to note that if/else blocks tell the compiler that "you
| will run these evaluations in order". Now, if the
| compiler can detect that the evaluations have no side
| effects (which, in this simple example with just integer
| checks, is fairly likely) then yeah I can see a jump
| table getting shoved in as an optimization.
|
| However, the moment you add a side effect or something
| more complicated like a method call, it becomes really
| hard for the complier to know if that sort of
| optimization is safe to do.
|
| The benefit of the switch statement is that it's already
| well positioned for the compiler to optimize as it does
| not have the "you must run these evaluations in order"
| requirement. It forces you to write code that is fairly
| compiler friendly.
|
| All that said, probably a waste of time debating :D.
| Ideally you have profiled your code and the profiler has
| told you "this is the slow block" before you get to the
| point of worrying about how to make it faster.
| tombert wrote:
| I agree with what you said but in this particular case,
| it actually was a direct integer equality check, there
| was zero risk of hitting side effects and that was
| plainly obvious to me, the checker, and compiler.
| cogman10 wrote:
| And to your original comment, I think the reviewer was
| wrong to reject the PR over that. Performance has to be
| measured before you can use it to reject (or create...) a
| PR. If someone hasn't done that then unless it's
| something obvious like "You are making a ton of tiny heap
| allocations in a tight loop" then I think nitpicking
| these sorts of things is just wrong.
| saurik wrote:
| While I personally find the if statements harder to
| immediately mentally parse/grok--as I have to prove to
| myself that they are all using the same variable and are
| all chained correctly in a way that is visually obvious for
| the switch statement--I don't find "but what if we use a
| naive compiler" at all a useful argument to make as, well,
| we aren't using a naive compiler, and, if we were, there
| are a ton of other things we are going to be sad about the
| performance of leading us down a path of re-implementing a
| number of other optimizations. The goal of the compiler is
| to shift computational complexity from runtime to compile
| time, and figuring out whether the switch table or the
| comparisons are the right approach seems like a legitimate
| use case (which maybe we have to sometimes disable, but
| probably only very rarely).
| smaudet wrote:
| Per my sibling comment, I think the argument is not about
| speed, but simplicity.
|
| Awkward switch syntax aside, the switch is simpler to
| reason about. Fundamentally we should strive to keep our
| code simple to understand and verify, not worry about
| compiler optimizations (on the first pass).
| saurik wrote:
| Right, and there I would say we even agree, per my first
| sentence; however, I wanted to reply not to you, but to
| doctor_phil, who was explicitly disagreeing about speed.
| bregma wrote:
| But what if, and stick with me here, a compiler is capable
| of reading and processing your code and through simple
| scalar evolution of the conditionals and phi-reduction, it
| can't tell the difference between a switch statement and a
| sequence of if statements by the time it finishes its
| single static analysis phase?
|
| It turns out the algorithmic complexity of a switch
| statement and the equivalent series of if-statements is
| identical. The bijective mapping between them is close to
| the identity function. Does a naive compiler exist that
| doesn't emit the same instructions for both, at least
| outside of toy hobby project compilers written by amateurs
| with no experience?
| smaudet wrote:
| The issue with if statements (for compiled languages) is
| not one of "speed" but of correctness.
|
| If statements are unbounded, unconstrained logic
| constructs, whereas switch statements are type-checkable.
| The concern about missing break statements here is
| irrelevant, where your linter/compiler can warn about
| missing switch cases they can easily warn about non-
| terminated (non-explicitly marked as fall-through) cases.
|
| For non-compiled languages (so branch prediction is not
| possible because the code is not even loaded), switch
| statements also provide a speed-up, i.e. the parser can
| immediately evaluate the branch to execute vs being
| forced to evaluate intermediate steps (and the conditions
| to each if statement can produce side-effects e.g.
| if(checkAndDo()) { ... } else if (checkAndDoB()) { ... }
| else if (checkAndDoC()) { ... }
|
| Which, of course, is a potential use of if statements
| that switches cannot use (although side-effects are
| usually bad, if you listened to your CS profs)... And
| again a sort of "static analysis" guarantee that switches
| can provide that if statements cannot.
| adrianN wrote:
| Both the switch and the if have O(1) instructions, so both
| are the same from an algorithmic complexity perspective.
| yau8edq12i wrote:
| Unless the number of "else if" statements somehow grows
| e.g. linearly with the size of your input, which isn't
| plausible, the "else if" statements also execute in O(1)
| time.
| Gazoche wrote:
| It's linear with respect to the number of cases, not the
| size of inputs. It's still O(1) in the sense of algorithmic
| complexity.
| tombert wrote:
| Yeah, and it's not like I didn't know how to do the stuff I
| was doing with a switch, I just don't like switches because
| I've forgotten to add break statements and had code that
| appeared correct but actually a month down the line. I've
| also seen other people make the same mistakes. ifs, in my
| opinion at least, are a bit harder to screw up, so I will
| always prefer them.
|
| But I agree, algorithmic complexity is generally the only
| thing I focus on, and even then it's almost always a case of
| "will that actually matter?" If I know that `n` is never
| going to be more than like `10`, I might not bother trying to
| optimize an O(n^2) operation.
|
| What I feel often gets ignored in these conversations is
| latency; people obsess over some "optimization" they learned
| in college a decade ago, and ignore the 200 HTTP or Redis
| calls being made ten lines below, despite the fact that the
| latter will have a substantially higher impact on
| performance.
| dllthomas wrote:
| > in my opinion at least, are a bit harder to screw up, so
| I will always prefer them
|
| My experience is the opposite - a sizeable chain of ifs has
| more that can go wrong precisely because it is more
| flexible. If I'm looking at a switch, I immediately know,
| for instance, that none of the tests modifies anything.
|
| Meanwhile, while a missing break can be a brutal error in a
| language that allows it, it's usually trivial to set up
| linting to require either an explicit break or a comment
| indicating fallthrough.
| jpc0 wrote:
| ... really matter are algorithmic complexity ...
|
| This is not entirely true either... Measure. There are many
| cases where the optimiser will vectorise a certian algorithm
| but not another... In many cases On^2 vectorised may be
| significantly faster than On or Onlogn even for very large
| datasets depending on your data...
|
| Make your algorithms generic and it won't matter which one
| you use, if you find that one is slower swap it for the
| quicker one. Depending on CPU arch and compiler optimisations
| the fastest algorithm may actually change multiple times in a
| codebases lifetime even if the usage pattern doesn't change
| at all.
| bluGill wrote:
| While you are not wrong, if you have a decent language you
| will discover all the useful algorithms are already in your
| standard library and so it isn't a worry. Your code should
| mostly look like apply this existing algorithm to some new
| data structure.
| jpc0 wrote:
| I don't disagree with you at all on this. However you may
| need to combine several to get to an end result. And if
| that happens a few times in a codebase, well makes sense
| to factor that into a library.
| saghm wrote:
| > But there's tons of this stuff like this in CS
|
| Reminds me of the classic
| https://stackoverflow.com/questions/24848359/which-is-faster...
| sgerenser wrote:
| Never saw that before, that is indeed a classic.
| jollyllama wrote:
| I've encountered similar situations before. It's insane to me
| when people hold up PRs over that kind of thing.
| dosshell wrote:
| > I can get away with a smaller sized float
|
| When talking about not assuming optimizations...
|
| 32bit float is slower than 64bit float on reasonable modern
| x86-64.
|
| The reason is that 32bit float is emulated by using 64bit.
|
| Of course if you have several floats you need to optimize
| against cache.
| tombert wrote:
| Sure, I clarified this in a sibling comment, but I kind of
| meant that I will use the slower "money" or "decimal" types
| by default. Usually those are more accurate and less error-
| prone, and then if it actually matters I might go back to a
| floating point or integer-based solution.
| sgerenser wrote:
| I think this is only true if using x87 floating point, which
| anything computationally intensive is generally avoiding
| these days in favor of SSE/AVX floats. In the latter case,
| for a given vector width, the cpu can process twice as many
| 32 bit floats as 64 bit floats per clock cycle.
| dosshell wrote:
| Yes, as I wrote, it is only true for one float value.
|
| SIMD/MIMD will benefit of working on smaller width. This is
| not only true because they do more work per clock but
| because memory is slow. Super slow compared to the cpu.
| Optimization is alot about cache misses optimization.
|
| (But remember that the cache line is 64 bytes, so reading a
| single value smaller than that will take the same time. So
| it does not matter in theory when comparing one f32 against
| one f64)
| jcranmer wrote:
| Um... no. This is 100% completely and totally wrong.
|
| x86-64 requires the hardware to support SSE2, which has
| native single-precision and double-precision instructions for
| floating-point (e.g., scalar multiply is MULSS and MULSD,
| respectively). Both the single precision and the double
| precision instructions will take the same time, except for
| DIVSS/DIVSD, where the 32-bit float version is slightly
| faster (about 2 cycles latency faster, and reciprocal
| throughput of 3 versus 5 per Agner's tables).
|
| You might be thinking of x87 floating-point units, where all
| arithmetic is done internally using 80-bit floating-point
| types. But all x86 chips in like the last 20 years have had
| SSE units--which are faster anyways. Even in the days when it
| was the major floating-point units, it wasn't any slower,
| since all floating-point operations took the same time
| independent of format. It might be slower if you insisted
| that code compilation strictly follow IEEE 754 rules, but the
| solution everybody did was to _not_ do that and that 's why
| things like Java's strictfp or C's FLT_EVAL_METHOD were born.
| Even in _that_ case, however, 32-bit floats would likely be
| faster than 64-bit for the simple fact that 32-bit floats can
| safely be emulated in 80-bit without fear of double rounding
| but 64-bit floats cannot.
| dosshell wrote:
| I agree with you. It should take the same time when
| thinking more about it. I remember learning this in ~2016
| and I did performance test on Skylake which confirmed
| (Windows VS2015). I think I remember that i only tested
| with addsd/addss. Definitely not x87. But as always, if the
| result can not be reproduced... I stand corrected until
| then.
| dosshell wrote:
| I tried to reproduce it on Ivybridge (Windows VS20122)
| and failed (mulss and muldd) [0]. single and double
| precision takes the same time. I also found a behavior
| where the first batch of iterations takes more time
| regardless of precision. It is possible that this tricked
| me last time.
|
| [0] https://gist.github.com/dosshell/495680f0f768ae84a106
| eb054f2...
|
| Sorry for the confusion and spreading false information.
| jandrewrogers wrote:
| A significant part of it is that what engineers believe was
| effectively true at one time. They simply haven't revisited
| those beliefs or verified their relevance in a long time. It
| isn't a terrible heuristic for life in general to assume that
| what worked ten years ago will work today. The rate at which
| the equilibriums shift due to changes in hardware and software
| environments when designing for system performance is so rapid
| that you need to make a continuous habit of checking that your
| understanding of how the world works maps to reality.
|
| I've solved a lot of arguments with godbolt and simple
| performance tests. Some topics are recurring themes among
| software engineers e.g.:
|
| - compilers are almost always better at micro-optimizations
| than you are
|
| - disk I/O is almost never a bottleneck in competent designs
|
| - brute-force sequential scans are often optimal algorithms
|
| - memory is best treated as a block device
|
| - vectorization can offer large performance gains
|
| - etc...
|
| No one is immune to this. I am sometimes surprised at the
| extent to which assumptions are no longer true when I revisit
| optimization work I did 10+ years ago.
|
| Most performance these days is architectural, so getting the
| initial design right often has a bigger impact than micro-
| optimizations and localized Big-O tweaks. You can always go
| back and tweak algorithms or codegen later but architecture is
| permanent.
| neonsunset wrote:
| .NET is a particularly bad case for this because it was a
| decade of few performance improvements, which caused a
| certain intuition to develop within the industry, then 6-8
| years of significant changes each year (with most wins
| compressed to the last 4 years or so). Companies moving from
| .NET Framework 4.6/7/8 to .NET 8 experience a 10x _average_
| performance improvement, which naturally comes with rendering
| a lot of performance know-how obsolete overnight.
|
| (the techniques that used to work were similar to earlier
| Java versions and overall very dynamic languages with some
| exceptions, the techniques that still work and now are
| required today are the same as in C++ or Rust)
| throwaway2037 wrote:
| .NET 4.6 to .NET 8 is a 10x "average" performance
| improvement. I find this hard to believe. In what
| scenarios? I tried to Google for it and found very little
| hard evidence.
| neonsunset wrote:
| In general purpose scenarios, particularly in codebases
| which have high amount of abstractions, use ASP.NET Core
| and EF Core, parse and de/serialize text with the use of
| JSON, Regex and other options, have network and file IO,
| and are deployed on many-core hosts/container images.
|
| There are a few articles on msft devblogs that cover
| from-netframework migration to older versions (Core 3.1,
| 5/6/7):
|
| - https://devblogs.microsoft.com/dotnet/bing-ads-
| campaign-plat...
|
| - https://devblogs.microsoft.com/dotnet/microsoft-graph-
| dotnet...
|
| - https://devblogs.microsoft.com/dotnet/the-azure-cosmos-
| db-jo...
|
| - https://devblogs.microsoft.com/dotnet/one-service-
| journey-to...
|
| - https://devblogs.microsoft.com/dotnet/microsoft-
| commerce-dot...
|
| The tl;dr is depending on codebase the latency reduction
| was anywhere from 2x to 6x, varying per percentile, or
| the RPS was maintained with CPU usage dropping by ~2-6x.
|
| Now, these are codebases of likely above average quality.
|
| If you consider that moving 6 -> 8 yields another up to
| 15-30% on average through improved and enabled by default
| DynamicPGO, and if you also consider that the average
| codebase is of worse quality than whatever msft has,
| meaning that DPGO-reliant optimizations scale way better,
| it is not difficult to see the 10x number.
|
| Keep in mind that while particular regular piece of
| enterprise code could have improved within bounds of
| "poor netfx codegen" -> "not far from LLVM with FLTO and
| PGO", the bottlenecks have changed significantly where
| previously they could have been in lock contention
| (within GC or user code), object allocation, object
| memory copying, e.g. for financial domains - anything
| including possibly complex Regex queries on imported
| payment reports (these alone have now difference anywhere
| between 2 and >1000[0]), and for pretty much every code
| base also in interface/virtual dispatch for layers upon
| layers of "clean architecture" solutions.
|
| The vast majority of performance improvements (both
| compiler+gc and CoreLib+frameworks), which is difficult
| to think about, given it was 8 years, address the above
| first and foremost. At my previous employer the migration
| from NETFX 4.6 to .NET Core 3.1, while also deploying to
| much more constrained container images compared to beefy
| Windows Server hosts, reduced latency of most requests by
| the same factor of >5x (certain request type went from 2s
| to 350ms). It was my first wow moment when I decided to
| stay with .NET rather than move over to Go back then (was
| never a fan of syntax though, and other issues, which
| subsequently got fixed in .NET, that Go still has, are
| not tolerable for me).
|
| [0] Cumulative of
|
| https://devblogs.microsoft.com/dotnet/regex-performance-
| impr...
|
| https://devblogs.microsoft.com/dotnet/regular-expression-
| imp...
|
| https://devblogs.microsoft.com/dotnet/performance-
| improvemen...
| rerdavies wrote:
| Cheating.
|
| All of the 6x performance improvement cases seem to be
| related to using the .net based Kestrel web server
| instead of IIS web server, which requires marshalling and
| interprocess communication. Several of the 2x gains
| appear to be related to using a different database
| backend. Claims that regex performance has improved a
| thousand-fold.... seem more troubling than cause for
| celebration. Were you not precompiling your regex's in
| the older code? That would be a bug.
|
| Somewhere in there, there might be 30% improvements in
| .net codegen (it's hard to tell). Profile Guided
| Optimization (PGO) seems to provide a 35% performance
| improvement over older versions of .net with PGO
| disabled. But that's dishonest. PGO was around long
| before .net Core. And claiming that PGO will provide 10x
| performance because our code is worse than Microsoft's
| code insults both our code and our intelligence.
| ygra wrote:
| Not sure about the 10x, either, and if true it would
| involve more than just the JIT changes. But changing
| ASP.NET to ASP.NET Core at the same time and the web
| server as well as other libraries may make it plausible.
| For certain applications moving from .NET Framework to
| .NET isn't so simple when they have dependencies and
| those have changed their API significantly. And in that
| case most of the newer stuff seems to be built with
| performance in mind. So you gain 30 % from the JIT, 2x
| from Kestrel, and so on. Perhaps.
|
| With a Roslyn-based compiler at work I saw 20 % perf
| improvement just by switching from .NET Core 3.1 to .NET
| 6. No idea how slow .NET Framework was, though. I
| probably can't target the code to that anymore.
|
| But for regex even with precompilation, the compiler got
| a lot better at transforming the regex into an equivalent
| regex that performs better (automatic atomic grouping to
| reduce unnecessary backtracking when it's statically
| known that backtracking won't create more matches for
| example) and it also benefits a lot from the various
| vectorized implementations of Index of, etc. Typically
| with each improvement of one of those core methods for
| searching stuff in memory there's a corresponding change
| that uses them in regex.
|
| So where in .NET Framework a regex might walk through a
| whole string character by character multiple times with
| backtracking it might be replaced with effectively an
| EndsWith and LastIndexOfAny call in newer versions.
| neonsunset wrote:
| Roslyn didn't have much of changes in terms of
| optimizations - it compiles C# to IL so does very little
| of that, save for switches and certain new or otherwise
| features like collection literals. You are probably
| talking about RyuJIT, also called just JIT nowadays :D
|
| (the distinction becomes important for targets serviced
| by Mono, so to outline the difference Mono is usually
| specified, while CoreCLR and RyuJIT may not be, it also
| doesn't help that JIT, that is, the IL to machine code
| compiler, also services NativeAOT, so it gets more
| annoying to be accurate in a conversation without saying
| the generic ".net compiler", some people refer to it as
| JIT/ILC)
| ygra wrote:
| No, I meant that we've written a compiler, based on
| Roslyn, whose runtime for compiling the code has improved
| by 20 % when switching to .NET 6.
|
| And indeed, on the C# -> IL side there's little that's
| being actually optimized. Besides collection literals
| there's also switch statements/expressions over strings,
| along with certain pattern matching constructs that get
| improved on that side.
| neonsunset wrote:
| Interesting! (I was way off the mark, not reading
| carefully, ha)
|
| Is it a public project?
| ygra wrote:
| Nope, completely internal and part of how we offer
| essentially the same product on multiple platforms with
| minimal integration work. And existing C# - anything
| compilers are typically too focused on compiling a whole
| application instead of offering a library with a stable
| and usable API on the other end, so we had to roll our
| own.
| neonsunset wrote:
| No. _Dynamic_ PGO was first introduced in .NET 6 but was
| not mature and needed two releases worth of work to
| become enabled by default. It needs no user input and is
| similar to what OpenJDK Hotspot has been doing for some
| time and then a little more. It also is required for
| major features that were strictly not available
| previously: guarded devirtualization of virtual and
| interface calls and delegate inlining.
|
| Also, IIS hosting through Http.sys is still an option
| that sees separate set of improvements, but that's not
| relevant in most situations given the move to .NET 8 from
| Framework usually also involves replacing Windows Server
| host with a Linux container (though it works perfectly
| fine on Windows as well).
|
| On Regex, compiled and now source generated automata has
| seen _a lot_ of work in all recent releases, it is night
| and day to what it was before - just read the articles.
| Previously linear scans against heavy internal data
| structures (matching by hashset) and heavy transient
| allocations got replaced with bloom-filter style SIMD
| search and other state of the art text search
| algorithms[0], on a completely opposite end of a
| performance spectrum.
|
| So when you have compiler improvements multiplied by
| changes to CoreLib internals multiplied by changes to
| frameworks built on top - it's achievable with relative
| ease. .NET Framework, while performing adequately, was
| still _that_ slow compared to what we got today.
|
| [0] https://github.com/dotnet/runtime/tree/main/src/libra
| ries/Sy...
| rerdavies wrote:
| Sure. But static PGO was introduced in .Net Framework
| 4.7.0. And we're talking about apps in production, so
| there's no excuse NOT to use static PGO on the .net
| framework 4.7.0 version.
|
| And you have misrepresented the contents of the blogs.
| The projects discussed in the blogs are typically
| claiming ~30% improvements (perhaps because they weren't
| using static PGO in their 4.7.0 incarnation), with two
| dramatic outliers that seem to be related to migrating
| from IIS to Kestrel.
| neonsunset wrote:
| It's a moot point. Almost no one used static PGO and its
| feature set was way more limited - it did not have
| devirtualization which provides the biggest wins. Though
| you are welcome to disagree it won't change the reality
| of the impact .NET 8 release had on real world code.
|
| It's also convenient to ignore the rest of the content at
| the links but it seems you're more interested in proving
| your argument so the data I provided doesn't matter.
| andyayers wrote:
| Something closer to a "pure codegen/runtime" example
| perhaps: I have data showing Roslyn (the C# compiler,
| itself written in C#) speeds up between ~2x and ~3x
| running on .NET 8 vs .NET 4.7.1. Roslyn is built so that
| it can run either against full framework or core, so it's
| largely the same application IL.
| tombert wrote:
| Yep, completely agree with you on this. Intuition is often
| wrong, or at least outdated.
|
| When I'm building stuff I try my best to focus on
| "correctness", and try to come up with an algorithm/design
| that will encompass all realistic use cases. If I focus on
| that, it's relatively easy to go back and convert my
| `decimal` type to a float64, or even convert an if statement
| into a switch if it's actually faster.
| klyrs wrote:
| > A large part of becoming a decent engineer [2] for me was
| learning to stop trusting what professors taught me in college
|
| When I was taught about performance, it was all about
| benchmarking and profiling. I never needed to trust what my
| professors taught, because they taught me to dig in and find
| the truth for myself. This was taught alongside the big-O
| stuff, with several examples where "fast" algorithms are slower
| on small inputs.
| TylerE wrote:
| How do you even get meaningful profiling out of most modern
| langs? It seems the vast majority of time and calls gets
| spent inside tiny anonymous functions, GC allocations, and
| stuff like that.
| klyrs wrote:
| I don't use most modern langs! And especially if I'm doing
| work where performance is critical, I won't kneecap myself
| by using a language that I can't reasonably profile.
| neonsunset wrote:
| This is easy in most modern programming languages.
|
| JVM ecosystem has IntelliJ Idea profiler and similar
| advanced tools (AFAIK).
|
| .NET has VS/Rider/dotnet-trace profilers (they are very
| detailed) to produce flamegraphs.
|
| Then there are native profilers which can work with any AOT
| compiled language that produces canonically symbolicated
| binaries: Rust, C#/F#(AOT mode), Go, Swift, C++, etc.
|
| For example, you can do `samply record ./some_binary`[0]
| and then explore multi-threaded flamegraph once completed
| (I use it to profile C#, it's more convenient than dotTrace
| for preliminary perf work and is usually more than
| sufficient).
|
| [0] https://github.com/mstange/samply
| TylerE wrote:
| I mean sure, but I've never seen much in a flamegraph
| besides noise.
| neonsunset wrote:
| My experience is complete opposite. You just need to
| construct a realistic load test for the code and the
| bottlenecks will stand out (more often than not).
|
| Also there is learning curve to grouping and aggregating
| data.
| trueismywork wrote:
| There's not yet a culture of writing reproducible benchmarks to
| gage these effects.
| zmj wrote:
| .NET is a little smarter about switch code generation these
| days: https://github.com/dotnet/roslyn/pull/66081
| KerrAvon wrote:
| > `if(x==1) elseif(x==2)...` because switch was "faster" and
| rejected my PR
|
| Yeah, that's never been true. Old compilers would often compile
| a switch to __slower__ code because they'd tend to always go to
| a jump table implementation.
|
| A better reason to use the switch is because it's better style
| in C-like languages. Using an if statement for that sort of
| thing looks like Python; it makes the code harder to maintain.
| wzdd wrote:
| And it's better style because it better conveys intent. An
| if-else chain in C/C++ implies there's something important
| about the ordering of cases. Though I'd say that for a very
| small number of cases it's fine.
|
| (Also, Python has a switch-like construct now.)
| mynameisnoone wrote:
| Yep. "Profiling or it didn't happen." The issue is that it's
| essentially impossible for even the most neckbeard of us to
| predict with a high degree of accuracy and precision the
| performance on modern systems impact of change A vs. change B
| due to the unpredictable nature of the many variables that are
| difficult to control including compiler optimization passes,
| architecture gotchas (caches, branch misses), and interplay of
| quirks on various platforms. Therefore, irreducible and
| necessary work to profile the differences become the primary
| viable path to resolving engineering decision points.
| Hopefully, LLMs now and in the future will be able to help
| build out boilerplate roughly in the direct of creating such
| profiling benchmarks and fixtures.
|
| PS: I'm presently revisiting C++14 because it's the most
| universal statically-compiled language to quickly answer
| interview problems. It would be unfair to impose Rust, Go,
| Elixir, or Haskell on an interviewer software engineer.
| pjmlp wrote:
| I would say it would be safer to go up to C++17, and there
| are some goodies there, specially for better compile time
| stuff.
| ot1138 wrote:
| >I don't do much C++, but I have definitely found that
| engineers will just assert that something is "faster" without
| any evidence to back that up.
|
| Very true, though there is one case where one can be highly
| confident that this is the case: code elimination.
|
| You can't get any faster than not doing something in the first
| place.
| konstantinua00 wrote:
| inb4 instruction (cache) alignment screws everythin up
| JackYoustra wrote:
| I really wish he'd listed all the flags he used. To add on to the
| flags already listed by some other commenters, `-mcpu` and
| related flags are really crucial in these microbenchmarks: over
| such a small change and such a small set of tight loops, you
| could just be regression on coincidences in the microarchitecture
| scheduler vs higher level assumptions.
| j_not_j wrote:
| And he didn't repeat each test case 5 or 9 times, and take the
| median (or even an average).
|
| There will be operating system noise that can be in the multi-
| percent range. This is defined as various OS services that run
| "in the background" taking up cpu time, emptying cache lines
| (which may be most important), and flushing a few translate
| lookaside entries.
|
| Once you recognize the variability from run to run, claiming
| "1%" becomes less credible. Depending on the noise level, of
| course.
|
| Linux benchmarks like SPECcpu tend to be run in "single-user
| mode" meaning almost no background processes are running.
| mgraczyk wrote:
| The main case where I use final and where I would expect benefits
| (not covered well by the article) is when you are using an
| external library with pure virtual interfaces that you implement.
|
| For example, the AWS C++ SDK uses virtual functions for
| everything. When you subclass their classes, marking your classes
| as final allows the compiler to devirtualize your own calls to
| your own functions (GCC does this reliably).
|
| I'm curious to understand better how clang is producing worse
| code in these cases. The code used for the blog post is a bit too
| complicated for me to look at, but I would love to see some
| microbenchmarks. My guess is that there is some kind of icache or
| code side problem. where inlining more produces worse code.
| cogman10 wrote:
| Could easily just be a bad optimization pathway.
|
| `final` tells the compiler that nothing extends this class.
| That means the compiler can theoretically do things like
| inlining class methods and eliminate virtual method calls
| (perhaps duplicating the method)?
|
| However, it's quite possible that one of those optimizations
| makes the code bigger or misaligns things with the cache in
| unexpected ways. Sometimes, a method call can bet faster than
| inlining. Especially with hot loops.
|
| All this being said, I'd expect final to offer very little
| benefit over PGO. Its main value is the constraint it imposes
| and not the optimization it might enable.
| lpapez wrote:
| > For example, the AWS C++ SDK uses virtual functions for
| everything. When you subclass their classes, marking your
| classes as final allows the compiler to devirtualize your own
| calls to your own functions (GCC does this reliably).
|
| I want to ask, and I sincerely mean no snark, what is the
| point?
|
| When working with AWS through an SDK your code will spend most
| of the time waiting on network calls.
|
| What is the point of devirtualizing your function calls to save
| an indirection when you will be spending several orders of
| magnitude more time just waiting for the RPC to resolve?
|
| It just doesn't seem like something even worth thinking about
| at all.
| mgraczyk wrote:
| Yeah that's was just the first public C++ library with this
| pattern that popped into my head. I just make all my classes
| final out of habit and don't think about it. I remove final
| if I want to subclass, but that almost never happens.
| jeffbee wrote:
| I profiled this project and there are abundant opportunities for
| devirtualization. The virtual interface `IHittable` is the hot
| one. However, the WITH_FINAL define is not sufficient, because
| the hot call is still virtual. At `hit_object |=
| _objects[node->object_index()]->hit` I am still seeing ` mov
| (%rdi),%rax; call *0x18(%rax)` so the application of final here
| was not sufficient to do the job. Whatever differences are being
| measures are caused by bogons.
| gpderetta wrote:
| I haven't looked at the code, but if you have multiple leaves,
| even marking all of them as final won't help if the call is
| through a base class.
| jeffbee wrote:
| Yeah the practical cases for devirtualization are when you
| have a base class, a derived class that you actually use, and
| another derived class that you use in tests. For your release
| binary the tests aren't visible so that can all be
| devirtualized.
|
| In cases where you have Dog and Goose that both derive from
| Animal and then you have std::vector<Animal>, what is the
| compiler supposed to do?
| kccqzy wrote:
| The compiler simply knows that the actual dynamic type is
| Animal because it is not a pointer. You need Animal* to
| trigger all the fun virtual dispatch stuff.
| froh wrote:
| I intuit vector<Animal*> is what was meant...
| jeffbee wrote:
| Yes. I reflexively avoid asterisks on this site because
| they can _hose your formatting_.
| akoboldfrying wrote:
| An interface, like IHittable, can't possibly be made final
| since its whole _purpose_ is to enable multiple different
| concrete subclasses that implement it.
|
| As you say, that's the hot one -- and making the concrete
| subclasses themselves "final" enables no devirtualisations
| since there are no opportunities for it.
| lanza wrote:
| If you're measuring a compiler you need to post the flags and
| version used. Otherwise the entire experiment is in the noise.
| LorenDB wrote:
| Man, I wish this blog had an RSS feed.
| magnat wrote:
| > I created a "large test suite" to be more intensive. On my dev
| machine it needed to run for 8 hours.
|
| During such long and compute-intensive tests, how are thermal
| considerations mitigated? Not saying that this was case here, but
| I can see how after saturating all cores for 8 hours, the whole
| PC might get hot to the point CPU starts throttling, so when you
| reboot to next OS or start another batch, overall performance
| could be a bit lower.
| lastgeniusua wrote:
| having recently done similar day-and-night long suites of
| benchmarks (on a laptop in heat dissipation conditions worse
| than on any decent desktop), I've found that there is no
| correlation between the order the benchmarks are run in and
| their performance (or energy consumption!). i would therefore
| assume that a non-overclocked processor would not exhibit the
| patterns you are thinking of here
| leni536 wrote:
| This is the gist of the difference in code generation when final
| is involved:
|
| https://godbolt.org/z/7xKj6qTcj
|
| edit: And a case involving inlining:
|
| https://godbolt.org/z/E9qrb3hKM
| fransje26 wrote:
| I'm actually more worried about Clang being close to 100% slower
| than GCC on Linux. That doesn't seem right.
|
| I am prepared to believe that there is some performance
| difference between the two, varying per case, but I would expect
| a few percent difference, not twice the run time..
| mastax wrote:
| Changes in the layout of the binary can have large impacts on the
| program performance [0] so it's possible that the unexpected
| performance decrease is caused by unpredictable changes in the
| layout of the binary between compilations. I think there is some
| tool which helps ensure layout is consistent for benchmarking,
| but I can't remember what it's called.
|
| [0]: https://research.facebook.com/publications/bolt-a-
| practical-...
| akoboldfrying wrote:
| I would expect "final" to have no effect on this type of code at
| all. That it does in some cases cause measurable differences I
| put down to randomly hitting internal compiler thresholds
| (perhaps one of the inlining heuristics is "Don't inline a
| function with more than 100 tokens", and the "final" keyword
| pushes a couple of functions to 101).
|
| Why would I expect no performance difference? I haven't looked at
| the code, but I would expect that for each pixel, it iterates
| through an array/vector/list etc. of objects that implement some
| common interface, and calls one or more methods (probably
| something called intersectRay() or similar) on that interface.
| _By design, that interface cannot be made final, and that 's what
| counts._ Whether the concrete derived classes are final or not
| makes no difference.
|
| In order to make this a good test of "final", the pointer type of
| that container should be constrained to a concrete object type,
| like Sphere. Of course, this means the scene is limited to
| spheres.
|
| The only case where final can make a difference, by
| devirtualising a call that couldn't otherwise be devirtualised,
| is when you hold a pointer to that type, _and_ the object it
| points at was allocated "uncertainly", e.g., by the caller. (If
| the object was allocated in the same basic block where the method
| call later occurs, the compiler already knows its runtime type
| and will devirtualise the call anyway, even without "final".)
| koyote wrote:
| > (perhaps one of the inlining heuristics is "Don't inline a
| function with more than 100 tokens", and the "final" keyword
| pushes a couple of functions to 101).
|
| That definitely is one of the heuristics in MSVC++.
|
| We have some performance critical code and at one point we
| noticed a slowdown of around ~4% in a couple of our performance
| tests. I investigated but the only change to that code base
| involved fixing up an error message (i.e. no logic difference
| and not even on the direct code path of the test as it would
| not hit that error).
|
| Turns out that: int some_func() {
| if (bad) throw std::exception("Error");
| return some_int; }
|
| Inlined just fine, but after adding more text to the exception
| error message it no longer inlined, causing the slow-down. You
| could either fix it with __forceinline or by moving the
| exception to a function call.
| Maxatar wrote:
| Since the inlining is performed in MSVC's backend, as opposed
| to its frontend, and hence operates strictly on MSVC's
| intermediate representation which lacks information about
| tokens or the AST, it's unlikely due to tokens.
|
| std::exception does not take a string in its constructor, so
| most likely you used std::runtime_error. std::runtime_error
| has a pretty complex constructor if you pass into it a long
| string. If it's a small string then there's no issue because
| it stores its contents in an internal buffer, but if it's a
| longer string then it has to use a reference counting scheme
| to allow for its copy constructor to be noexcept.
|
| That is why you can see different behavior if you use a long
| string versus a short string. You can also see vastly
| different codegen with plain std::string as well depending on
| whether you pass it a short string literal or a long string
| literal.
| koyote wrote:
| > std::exception does not take a string in its constructor
|
| You're right, I used it as a short-hand for our internal
| exception function, forgetting that the std one does not
| take a string. Our error handling function is a simple
| static function that takes an std::string and throws a
| newly constructed object with that string as a field.
|
| But yes, it could very well have been that the string
| surpassed the short string optimisation threshold or
| something similar. I did verify the assembly before and
| after and the function definitely inlined before and no
| longer inlined after. Moving the 'throw' (and, importantly,
| the string literal) into a separate function that was
| called from the same spot ensured it inlined again and the
| performance was back to normal.
| akoboldfrying wrote:
| Wow, I had no idea. And I thought I knew about most of
| C++'s weirdnesses.
| simonask wrote:
| Actually, the compiler can only implicitly devirtualize under
| very specific circumstances. For example, it cannot
| devirtualize if there was previously a non-inlined call through
| the same pointer.
|
| The reason is placement new. It is legal (given that certain
| invariants are upheld) in C++ to say `new(this) DerivedClass`,
| and compilers must assume that each method could potentially
| have done this, changing the vtable pointer of the object.
|
| The `final` keyword somewhat counteracts this, but even GCC
| still only opportunistically honors it - i.e. it inserts a
| check if the vtable is the expected value before calling the
| devirtualized function, falling back on the indirect call.
| akoboldfrying wrote:
| Fascinating, though a little sad. Are there any important
| kinds of behaviour that can only be implemented via this
| `new(this) DerivedClass` chicanery? Because if not, it seems
| a shame to make the optimiser pay such a heavy price just to
| support it.
| ndesaulniers wrote:
| As an LLVM developer, I really wish the author filed a bug report
| and waited for some analysis BEFORE publishing an article (that
| may never get amended) that recommends not using this keyword
| with clang for performance reasons. I suspect there's just a bug
| in clang.
| saagarjha wrote:
| Bug, misunderstanding, weird edge case...
| fransje26 wrote:
| Is there any logical reason why Clang is 50% slower than GCC on
| Ubuntu?
| pklausler wrote:
| Mildly related programming language trivia:
|
| Fortran has virtual functions ("type bound procedures"), and
| supports a NON_OVERRIDABLE attribute on them that is basically
| "final". (FINAL exists but means something else.). But it also
| has a means for localizing the non-overridable property.
|
| If a type bound procedure is declared in a module, and is
| PRIVATE, then overrides in subtypes ("extended derived types")
| work as usual for subtypes in the same module, but can't be
| affected by overrides that appear in other modules. This allows a
| compiler to notice when a type has no subtypes in the same
| module, and basically infer that it is non-overridable locally,
| and thus resolve calls at compilation time.
|
| Or it would, if compilers implemented this feature correctly.
| It's not well described in the standard, and only half of the
| Fortran compilers in the wild actually support it. So like too
| many things in the Fortran world, it might be useful, but it's
| not portable.
| MathMonkeyMan wrote:
| I think it was Chandler Carruth who said "If you're not
| measuring, then you don't care about performance." I agree, and
| by that measure, nobody I've ever worked with cares about
| performance.
|
| The best I'll see is somebody who cooked up a naive
| microbenchmark to show that style 1 takes fewer wall nanoseconds
| than style 2 on his laptop.
|
| People I've worked with don't use profilers, claiming that they
| can't trust it. Really they just can't be bothered to run it and
| interpret the output.
|
| The truth is, most of us don't write C++ because of performance;
| we write C++ because that's the language the code is written in.
|
| The performance gained by different C++ techniques seldom
| matters, and when it does you have to measure. Profiler reports
| almost always surprise me the first few times -- your mental
| model of what's going on and what matters is probably wrong.
| scottLobster wrote:
| It matters to some degree. If it's just a simple technique you
| can file away and repeat as muscle memory, well that means your
| code is that much better.
|
| From a user perspective it could be the difference between
| software that's pleasant to use and software that's annoying to
| use. From a philosophical perspective it's the difference
| between software that functions vs software that works well.
|
| Of course it depends on your context as to whether this is
| valued, but I wouldn't dismiss it. Once person's micro-
| optimization is another person's polish.
| chris_wot wrote:
| Surely "final" is a conceptual thing... in other words, you don't
| want anyone else to derive from the class for good reasons. It's
| for conceptual understanding, surely?
| manlobster wrote:
| This seems like a reasonable use of the preprocessor to me. I've
| seen similar use in high-quality codebases. I wonder why the
| author is so disgusted by it.
| headline wrote:
| re: final macro
|
| > I would never do this in an actual product
|
| what, why?
| alex_smart wrote:
| One thing that wasn't mentioned in the article that I wished it
| did was the size of the compiled binary with and without final.
| Only reason I would expect the final version to be slower is that
| we are emitting more code because of inlining and that is
| resulting in a larger portion of instruction cache misses.
|
| Also, now that I think of it, they should have run the code under
| perf and compared the stats.
| account42 wrote:
| Yeah, really unsatisfying that there was no attempt to explain
| _why_ it might be slower since it just gives the compiler more
| information to decide on optimizations which in theory should
| only make thins faster.
| kasajian wrote:
| I'm surprised by this article. the author genuinely believes that
| a language construct to benefit performance was added to the
| language without anyone ever running any metrics to verify. "just
| trust me bro", is the quote.
|
| It's is an insane level of ignorance about how these things are
| decided by the standards committee.
| kreetx wrote:
| And yet, results from current compilers show that results are
| mixed, in summary not making programs faster.
| kookamamie wrote:
| > And probably, that reason is performance.
|
| That's the first problem I see with the article. C++ isn't a fast
| language, as it is. There are far too many issues with e.g.
| aliasing rules, lack of proper vectorization (for the runtime
| arch), etc.
|
| If you wish to have a relatively good performance for your code,
| try ISPC, which still allows you to get great performance with
| vectorization up to AVX-512, without turning to intrisics.
| chipdart wrote:
| > That's the first problem I see with the article. C++ isn't a
| fast language, as it is. There are far too many issues with
| e.g. aliasing rules, lack of proper vectorization (for the
| runtime arch), etc.
|
| That's a bold statement due to the way it heavily contrasts
| with reality.
|
| C++ is ever present in high performance benchmarks as either
| the highest performing language or second only to C. It's weird
| seeing someone claim with a straight face that "C++ isn't a
| fast language, as it is".
|
| To make matters worse, you go on confusing what a programming
| language is, and confusing implementation details with language
| features. It's like claiming that C++ isn't a language for
| computational graphics just because no C++ standard dedicates a
| chapter to it.
|
| Just like every engineering domain,you need to have deep
| knowledge on details to milk the last drop of performance
| improvements out of a program. Low-latency C++ is a testament
| of how the smallest details can be critical of performance. But
| you need to be completely detached from reality to claim that
| C++ isn't a fast language.
| kookamamie wrote:
| > That's a bold statement due to the way it heavily contrasts
| with reality.
|
| I'm ready to back this up. And no, I'm not confusing things -
| I work in HPC (realtime computer vision) and in reality the
| only thing we'd use C++ for is "glue", i.e. binding
| implementations of the actual algorithms implemented in other
| languages together.
|
| Implementations could be e.g. in CUDA, ISPC, neural-inference
| via TensorRT, etc.
| jpc0 wrote:
| "We use extreme vectorisation and can't do it in native C++
| therefore the language is slow"
|
| You a junior or something? For 99% of use cases C++
| autovectorisation does plenty and will outperform the same
| code written in higher level languages. You are literally
| in the 1% and conflating your use case for that of the
| general case...
| chipdart wrote:
| I've worked in computer vision and real time image
| processing. We use C++ extensively in the field due to it's
| high performance. OpenCV is the tool of the trade. Both iOS
| and Android support C++ modules for performance reasons.
|
| But to add to all the nonsense,you claim otherwise.
|
| Frankly, your comments lack any credibility, which is
| confirmed by your lame appeal to authority.
| teeuwen wrote:
| I do not see how the final keyword would make a difference in
| performance at all in this case. The compiler should be able to
| build an inheritance tree and determine by itself which classes
| are to be treated as final.
|
| Now for libraries, this is a different story. There I can imagine
| final keyword could have an impact.
| connicpu wrote:
| But dynamically loaded libraries exist, so even if it knows the
| class is the most derived version out of all classes that exist
| in all of the statically-linked code through LTO or something,
| unless it can see the instantiation site it won't be able to
| devirtualize the function calls without the class being marked
| as final.
| pjmlp wrote:
| Only if the complete source code is available to the compiler.
| juliangmp wrote:
| >Personally, I'm not turning it on. And would in fact, avoid
| using it. It doesn't seem consistent.
|
| I feel like we'd have to repeat these tests quite a few times to
| get to a decent conclusion. Hell small variations in performance
| could be caused by all sorts of things outside the actual
| program.
| kreetx wrote:
| AFAIU, these tests were ran 30 times each and apparently some
| took minutes to run, so it's unlikely that you'll get any
| different conclusions.
| lionkor wrote:
| The only thing worse than no benchmark is a bad benchmark.
|
| I don't think this really shows what `final` does, not to code
| generation, not to performance, not to the actual semantics of
| the program. There is no magic bullet - if putting `final` on
| every single class would always make it faster, it wouldn't be a
| keyword, it'd be a compiler optimization.
|
| `final` does one specific thing: It tells a compiler that it can
| be sure that the given object is not going to have anything
| derive _from it_.
| opticfluorine wrote:
| Not disagreeing with your point, but it couldn't be a compiler
| optimization, could it? The compiler isn't able to infer that
| the class will not be inherited anywhere else, since another
| compilation unit unknown to the class could inherit.
| ftrobro wrote:
| I assume it could be or is part of link time optimization
| when compiling an application rather than a library?
| vedantk wrote:
| Possibly not in the default c++ language mode, but check out
| -fwhole-program-vtables. It can be a useful option in cases
| where all relevant inheritance relationships are known at
| compile time.
|
| https://reviews.llvm.org/D16821
| bluGill wrote:
| Which is good, but may not apply. I have an application
| where I can't do that because we support plugins and so a
| couple classes will get overridden outside of the
| compilation (this was in hindsight a bad decision, but too
| late to change now). Meanwhile most classes will never be
| overriden and so I use final to saw that. We are also a
| multi-repo project (which despite the hype I think is
| better for us than mono-repo), another reason why -f-whole-
| program-vtables would be difficult to use - but we could
| make it work with effort if it wasn't for the plugins.
| paulddraper wrote:
| > `final` does one specific thing: It tells a compiler that it
| can be sure that the given object is not going to have anything
| derive from it.
|
| ...and the compiler can optimize using that information.
|
| (It could also do the same without the keyword, with LTO.)
| bluGill wrote:
| LTO can only apply in specific situations though, if there is
| any possibility that a plugin derived from the class LTO can
| do nothing.
| Nevermark wrote:
| 'Final' cannot be assumed without complete knowledge of all
| final linking cases, and knowledge that this will not change in
| the future. The latter can never be assumed by a compiler
| without indication.
|
| "In theory" adding 'final' only gives a compiler more
| information, so should only result in same or faster code.
|
| In practice, some optimizations improve performance for more
| expected or important cases (in the compiler writer's
| estimation), with worse outcomes in other less expected, less
| important cases. Without a clear understanding the when and how
| of these 'final' optimizations, it isn't clear without
| benchmarking after the fact, when to use it, or not.
|
| That makes any given test much less helpful. Since all we know
| is 'final' was not helpful in this case. We have no basis to
| know how general these results are.
|
| But it would be deeply strange if 'final' was generally
| unhelpful. Informationally it does only one purely helpful
| thing: reduce the number of linking/runtime contexts the
| compiler needs to worry about.
| account42 wrote:
| I'm amused at the AI advert spam in the comments here that can't
| even be bothered to make the spam even vaguely normal looking
| comments.
| AtNightWeCode wrote:
| Most benchmarks are wrong. I doubt this is correct. Final should
| have been the default in the lang I think though.
|
| There are tons of these suggestions. Like always using sealed in
| C# or never use private in Java.
___________________________________________________________________
(page generated 2024-04-23 23:01 UTC)