[HN Gopher] Performance of the Python 3.14 tail-call interpreter
___________________________________________________________________
Performance of the Python 3.14 tail-call interpreter
Author : signa11
Score : 493 points
Date : 2025-03-10 06:44 UTC (16 hours ago)
(HTM) web link (blog.nelhage.com)
(TXT) w3m dump (blog.nelhage.com)
| IshKebab wrote:
| Very interesting! Clearly something else going on though if the
| 2% Vs 9% thing is true.
| jeeybee wrote:
| Kudos to the author for diving in and uncovering the real story
| here. The Python 3.14 tail-call interpreter is still a nice
| improvement (any few-percent gain in a language runtime is hard-
| won), just not a magic 15% free lunch. More importantly, this
| incident gave us valuable lessons about benchmarking rigor and
| the importance of testing across environments. It even helped
| surface a compiler bug that can now be fixed for everyone's
| benefit. It's the kind of deep-dive that makes you double-check
| the next big performance claim. Perhaps the most thought-
| provoking question is: how many other "X% faster" results out
| there are actually due to benchmarking artifacts or unknown
| regressions? And how can we better guard against these pitfalls
| in the future?
| ehsankia wrote:
| I guess the bigger question for me is, how was a 10% drop in
| Python performance not detected when that faulty compiler
| feature was pushed? Do we not benchmark the compilers
| themselves? Do the existing benchmarks on the compiler or
| python side not use that specific compiler?
| asicsp wrote:
| Related discussions:
|
| https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-...
| --> https://news.ycombinator.com/item?id=42999672 _(66 points |
| 25 days ago | 22 comments)_
|
| https://blog.reverberate.org/2025/02/10/tail-call-updates.ht...
| --> https://news.ycombinator.com/item?id=43076088 _(124 points |
| 18 days ago | 92 comments)_
| tempay wrote:
| Trying to assess the performance of a python build is extremely
| difficult as there are a lot of build tricks you can do to
| improve it. Recently the astral folks ran into this showing how
| the conda-forge build is notable faster than most others:
|
| https://github.com/astral-sh/python-build-standalone/pull/54...
|
| I'd be interested to know how the tail-call interpreter performs
| with other build optimisations that exist.
| eru wrote:
| Compare https://donsbot.com/2009/03/09/evolving-faster-haskell-
| progr...
|
| The author uses a genetic algorithm to try out lots of
| different compiler and optimisation flag combinations.
| vkazanov wrote:
| So, the compiler is tinkering with the way the loop is organised
| so the whole tail-call interpreter thing is not as effective as
| announced... Not surprised.
|
| 1. CPU arch (and arch version) matters a lot. The problem is 95%
| about laying out the instruction dispatching code for the branch
| predictor to work optimally. C was never meant to support this.
|
| 2. The C abstract machine is also not low-level enough to express
| the intent properly. Any implementation becomes supersensitivy to
| a particular compiler's (and compiler version) quirks.
|
| Certain paranoid interpreter implementation go back to writing
| assembly directly. luajit's famous for implementing a macro
| system to make its superefficient assembly loop implementation
| portable across architectures. This is also I find it fun to
| tinker with these!
|
| Anyways, a few years ago I've put together an article and a a
| test for popular interpreter loop implementation approaches:
|
| https://github.com/vkazanov/bytecode-interpreters-post
| nelhage wrote:
| (author here)
|
| > The problem is 95% about laying out the instruction
| dispatching code for the branch predictor to work optimally.
|
| A fun fact I learned while writing this post is that that's no
| longer true! Modern branch predictors can pretty much
| accurately predict through a single indirect jump, if the run
| is long enough and the interpreted code itself has stable
| behavior!
|
| Here's a paper that studied this (for both real hardware and a
| certain simulated branch predictor):
| https://inria.hal.science/hal-01100647/document
|
| My experiments on this project anecdotally agree; they didn't
| make it into the post but I also explored a few of the
| interpreters through hardware CPU counters and `perf stat`, and
| branch misprediction never showed up as a dominant factor.
| vkazanov wrote:
| Yes, this was already becoming true around the time I was
| writing the linked article. And I also read the paper. :-) I
| also remember I had access to a pre-Haswell era Intel CPUs vs
| something a bit more recent, and could see that the more
| complicated dispatcher no longer made as much sense.
|
| Conclusion: the rise of popular interpreter-based languages
| lead to CPUs with smarter branch predictors.
|
| What's interesting is that a token threaded interpreter
| dominated my benchmark (https://github.com/vkazanov/bytecode-
| interpreters-post/blob/...).
|
| This trick is meant to simplify dispatching logic and also
| spread branches in the code a bit.
| celeritascelery wrote:
| How do you reconcile that with the observation that moving to
| a computed goto style provides better codegen in zig[1]? They
| make the claim that using their "labeled switch" (which is
| essentially computed goto) allows you to have multiple
| branches which improves branch predictor performance. They
| even get a 13% speedup in their parser from switch to this
| style. If modern CPU's are good at predicting through a
| single branch, I wouldn't expect this feature to make any
| difference.
|
| [1] https://ziglang.org/download/0.14.0/release-
| notes.html#Code-...
| dwattttt wrote:
| While it's unlikely as neat as this, the blog post we're
| all commenting on is a "I thought we had a 10-15% speedup,
| but it turned out to be an LLVM optimisation misbehaving".
| And Zig (for now) uses LLVM for optimised builds too
| kryptiskt wrote:
| This is a very good example of how C is not "close to the
| machine" or "portable assembly", modern optimizers will do
| drastic changes to the logic as long as it has no observable
| effect.
|
| As stated in the post: "Thus, we end up in this odd world where
| clang-19 compiles the computed-goto interpreter "correctly" - in
| the sense that the resulting binary produces all the same value
| we expect - but at the same time it produces an output completely
| at odds with the intention of the optimization. Moreover, we also
| see other versions of the compiler applying optimizations to the
| "naive" switch()-based interpreter, which implement the exact
| same optimization we "intended" to perform by rewriting the
| source code."
| fouronnes3 wrote:
| Stretching "no observable effect" all the way to a 10000 word
| blog post.
| jmillikin wrote:
| > This is a very good example of how C is not "close to the
| machine" or > "portable assembly",
|
| C is very much "portable assembly" from the perspective of
| other systems programming languages of the 80s-90s era. The C
| expression `a += 1` can be trusted to increment a numeric
| value, but the same expression in C++ might allocate memory or
| unwind the call stack or do who knows what. Similarly, `a =
| "a"` is a simple pointer assignment in C, but in C++ it might
| allocate memory or [... etc].
|
| The phrase "C is portable assembly" isn't a claim that each
| statement gets compiled directly to equivalent machine code.
| eru wrote:
| > The C expression `a += 1` can be trusted to increment a
| numeric value, [...]
|
| Have you heard of undefined behaviour?
| jmillikin wrote:
| Show me a C compiler that miscompiles the following code
| and I'll concede the point: uint32_t
| add_1(uint32_t a) { a += 1; return a;
| }
| eru wrote:
| I know that for 'int a' the statement 'a += 1' can give
| rather surprising results.
|
| And you made a universal statement that 'a += 1' can be
| trusted. Not just that it can sometimes be trusted. In
| C++ the code you gave above can also be trusted as far as
| I can tell. At least as much as the C version.
| jmillikin wrote:
| I'll expand my point to be clearer.
|
| In C there is no operator overloading, so an expression
| like `a += 1` is easy to understand as incrementing a
| numeric value by 1, where that value's type is one of a
| small set of built-in types.
|
| You'd need to look further up in the function (and maybe
| chase down some typedefs) to see what that type is, but
| the set of possible types generally boils down to "signed
| int, unsigned int, float, pointer". Each of those types
| has well-defined rules for what `+= 1` means.
|
| That means if you see `int a = some_fn(); assert(a <
| 100); a += 1` in the C code, you can expect something
| like `ADD EAX,1` somewhere in the compiler output for
| that function. Or going the other direction, when you're
| in a GDB prompt and you disassemble the current EIP and
| you see `ADD EAX,1` then you can pretty much just look at
| the C code and figure out where you are.
|
| ---
|
| Neither of those is true in C++. The combination of
| completely ad-hoc operator overloading, function
| overloading, and implicit type conversion via
| constructors means that it can be really difficult to map
| between the original source and the machine code.
|
| You'll have a core dump where EIP is somewhere in the
| middle of a function like this:
| std::string some_fn() { some_ns::unsigned<int> a
| = 1; helper_fn(a, "hello"); a += 1;
| return true; }
|
| and the disassembly is just dozens of function calls for
| no reason you can discern, and you're staring at the
| return type of `std::string` and the returned value of
| `true`, and in that moment you'll _long_ for the happy
| days when undefined behavior on signed integer overflow
| was the worst you had to worry about.
| eru wrote:
| I heartily agree that C++ is a lot more annoying here
| than C, yes.
|
| I'm just saying that C is already plenty annoying enough
| by itself, thanks eg to undefined behaviour.
|
| > That means if you see `int a = some_fn(); assert(a <
| 100); a += 1` in the C code, you can expect something
| like `ADD EAX,1` somewhere in the compiler output for
| that function. Or going the other direction, when you're
| in a GDB prompt and you disassemble the current EIP and
| you see `ADD EAX,1` then you can pretty much just look at
| the C code and figure out where you are.
|
| No, there's no guarantee of that. C compilers are allowed
| to do all kinds of interesting things. However you are
| often right enough in practice, especially if you run
| with -O0, ie turn off the optimiser.
|
| See eg https://godbolt.org/z/YY69Ezxnv and tell me where
| the ADD instruction shows up in the compiler output.
| MaulingMonkey wrote:
| error: could not convert 'true' from 'bool' to
| 'std::string' {aka 'std::__cxx11::basic_string<char>'}
|
| I don't think anyone's claiming C nor C++'s dumpster
| fires have signed integer overflow at the top of the pile
| of problems, but when the optimizer starts deleting
| security or bounds checks and other fine things - because
| of signed integer overflow, or one of the million other
| causes of undefined behavior - I will _pray_ for
| something as straightforward as a core dump, no matter
| _where_ EIP has gone.
|
| Signed integer overflow UB is the kind of UB that has a
| nasty habit of causing _subtle_ heisenbugfuckery when
| triggered. The kind you _might, hopefully,_ make shallow
| with ubsan and good test suite coverage. In other words,
| the kind you _won 't_ make shallow.
| jmillikin wrote:
| For context, I did not pick that type signature at
| random. It was in actual code that was shipping to
| customers. If I remember correctly there was some sort of
| bool -> int -> char -> std::string path via `operator()`
| conversions and constructors that allowed it to compile,
| though I can't remember what the value was (probably
| "\x01").
|
| ---
|
| My experience with the C/C++ optimizer is that it's
| fairly timid, and only misbehaves when the input code is
| _really_ bad. Pretty much all of the (many, many) bugs I
| 've encountered and/or written in C would have also
| existed if I'd written directly in assembly.
|
| I know there are libraries out there with build
| instructions like "compile with -O0 or the results will
| be wrong", but aside from the Linux kernel I've never
| encountered developers who put the blame on the
| _compiler_.
| MaulingMonkey wrote:
| > but aside from the Linux kernel I've never encountered
| developers who put the blame on the compiler.
|
| I encounter them _frequently_.
|
| 99.99% of the time it's undefined behavior and they're
| "wrong".
|
| Frequently novices who have been failed by their teachers
| and documentation (see previous rant using atoi as an
| example of the poor quality of documentation about UB:
| https://news.ycombinator.com/item?id=14861917 .)
|
| Less frequently, it's experienced devs half joking out of
| a need for catharsis.
|
| Rarely, experienced devs finally getting to the end of
| their rope, and are finally beginning to seriously
| consider if they've got a codegen bug. They don't, but
| they're considering it. They know they were wrong the
| last 10 times they considered it, but they're considering
| it again damnit!
|
| The linux kernel devs aren't _quite_ unique in "just
| because you can, doesn't mean you should"ing their way
| into blaming the compiler for what could be argued to be
| defects in the standard or fundamental design of the
| language (the defect being making UB so common), but
| that's probably among the rarest slice of the pie of
| people blaming the compiler for UB. Few have the will to
| tilt at that windmill and voice their opinions when the
| compiler devs can easily just blame the standard - better
| to keep such unproductive rants close to heart instead,
| or switch to another language. Something _actually
| productive_.
|
| 0.01% of the time, it's a legitimate codegen bug on well-
| defined behavior code. Last one I tracked down to a bug
| tracker, was MSVC miscompiling 4x4 matrix multiplications
| by failing to spill a 17th value to stack when it only
| had 16 SSE register to work with. Caught by unit tests,
| but not by CI, since people updated compiler versions at
| their own random pace, and who runs `math_tests` on their
| personal machines when they're not touching `math`?
| gpderetta wrote:
| Counterexample: https://godbolt.org/z/z3jqjPT6o
| jmillikin wrote:
| I'm not sure that's a counter-example -- what assembly do
| you think should be emitted for floating-point math on an
| AVR microcontroller?
| gpderetta wrote:
| It means that "a += 1` is easy to understand as
| incrementing a numeric value by 1" is not true and
| instead "it can be really difficult to map between the
| original source and the machine code".
|
| More examples of non-trivial mapping from C code to
| generated code: https://godbolt.org/z/jab6vh6dM
| jmillikin wrote:
| All of those look pretty straightforward to me -- again,
| what assembly would you expect to be emitted in those
| cases?
|
| For contrast, here's the assembly generated for Haskell
| for integer addition: https://godbolt.org/z/vdeMKMETT
|
| And here's assembly for C++:
| https://godbolt.org/z/dedcof9x5
| gpderetta wrote:
| > All of those look pretty straightforward to me --
| again, what assembly would you expect to be emitted in
| those cases?
|
| It is very straightforward indeed, but it is still not
| mapping primitive operations to direct machine code, but
| it is forwarding to out-of-line code. Same as operator
| overloading in other languages.
|
| > And here's assembly for C++:
| https://godbolt.org/z/dedcof9x5
|
| That's just a symptom of allowing the compiler to inline
| the add code, otherwise the generated code is as
| straightforward: addOne(Int):
| push rax mov esi,0x1 call 4010c0
| <add_safe(int, int)>
|
| Ref: https://godbolt.org/z/xo1es9TcW
| jmillikin wrote:
| > It is very straightforward indeed, but it is still not
| mapping primitive > operations to direct machine
| code, but it is forwarding to out-of-line code. >
| Same as operator overloading in other languages.
|
| I am not claiming that C is a collection of assembler
| macros. There is no expectation that a C compiler emit
| machine code that has exact 1:1 correspondence with the
| input source code. > Same as operator
| overloading in other languages.
|
| The lack of operator overloading, and other hidden
| complex control flow, is the reason that someone can read
| C code and have a pretty good idea of what it compiles
| to. > That's just a symptom of allowing
| the compiler to inline the add code, > otherwise
| the generated code is as straightforward:
|
| No, that's just moving the instructions around. You've
| still got dynamic allocation and stack-unwinding being
| generated for a line that doesn't have any sign of
| entering a complex control flow graph.
| pjmlp wrote:
| > ... and other hidden complex control flow,....
|
| Until someone calls longjmp() or a signal() is triggered.
| Extra bonus of fun if it happens to be multithreaded
| application, or in the middle of a non-rentrant call.
| acdha wrote:
| > That means if you see `int a = some_fn(); assert(a <
| 100); a += 1` in the C code, you can expect something
| like `ADD EAX,1` somewhere in the compiler output for
| that function.
|
| I completely agree that C++ is orders of magnitude worse
| but I've seen at least a couple counter-examples with
| code almost that simple. A researcher I used to support
| compared each release against a set of reference results,
| and got a surprise when they didn't match but his program
| was working. This turned out to be a new compiler release
| being smart enough to inline and reorder his code to use
| a fused multiply-add instruction, which had greater
| internal precision and so the result was very slightly
| different from his saved referenced set. GCC has
| -fexcess-precision=standard for this but you have to
| understand the problem first.
| TinkersW wrote:
| a+=1 will not produce any surprising results, signed
| integer overflow is well defined on all platforms that
| matter.
|
| And we all know about the looping behavior, it isn't
| surprising.
|
| The only surprising part would be if the compiler decides
| to use inc vs add, not that it really matters to the
| result.
| eru wrote:
| > a+=1 will not produce any surprising results, signed
| integer overflow is well defined on all platforms that
| matter.
|
| I'm not sure what you are talking about?
|
| There's a difference between how your processor behaves
| when given some specific instructions, and what
| shenanigans your C compiler gets up to.
|
| See eg https://godbolt.org/z/YY69Ezxnv and tell me where
| the ADD instruction shows up in the compiler output. Feel
| free to pick a different compiler target than Risc-V.
| jmillikin wrote:
| I don't think "dead-code elimination removes dead code"
| adds much to the discussion.
|
| If you change the code so that the value of `a` is used,
| then the output is as expected:
| https://godbolt.org/z/78eYx37WG
| xmcqdpt2 wrote:
| The parent example can be made clearer like this:
| https://godbolt.org/z/MKWbz9W16
|
| Dead code elimination only works here because integer
| overflow is UB.
| jmillikin wrote:
| Take a closer look at 'eru's example and my follow-up.
|
| He wrote an example where the result of `a+1` isn't
| necessary, so the compiler doesn't emit an ADDI even
| though the literal text of the C source contains the
| substring "a += 1".
|
| Your version has the same issue: unsigned
| int square2(unsigned int num) { unsigned int a
| = num; a += 1; if (num < a) return
| num * num; return num; }
|
| The return value doesn't depend on `a+1`, so the compiler
| can optimize it to just a comparison.
|
| If you change it to this: unsigned int
| square2(unsigned int num) { unsigned int a =
| num; a += 1; if (num < a) return num
| * a; return num; }
|
| then the result of `a+1` is required to compute the
| result in the first branch, and therefore the ADDI
| instruction is emitted.
|
| The (implied) disagreement is whether a language can be
| considered to be "portable assembly" if its compiler
| elides unnecessary operations from the output. I think
| that sort of optimization is allowed, but 'eru
| (presumably) thinks that it's diverging too far from the
| C source code.
| gpderetta wrote:
| In what world the return value doesn't depends on 'a' in
| this code? if (num < a) return num * num;
| /*else*/ return num;
|
| A control dependency is still a dependency
| jmillikin wrote:
| `a = num; a += 1; if (num < a)` is the same as `if (num <
| (num + 1))`, which for unsigned integer addition can be
| rewritten as `if (num != UINT_MAX)`. So there's no need
| to actually compute `a+1`, the comparison is against a
| constant.
|
| If the code returns `num * a` then the value of `a` is
| now necessary, and must be computed before the function
| returns.
|
| For signed integer addition the compiler is allowed to
| assume that `(num < (num + 1))` is true, so the
| comparison can be removed entirely.
| tux3 wrote:
| You show one example where C doesn't have problems, but
| that's a much weaker claim than it sounds. "Here's one
| situation where this here gun won't blow your foot off!"
|
| For what it's worth, C++ also passes your test here. You
| picked an example so simple that it's not very
| interesting.
| eru wrote:
| Actually even here, C has some problems (and C++), too:
|
| I don't think the standard says much about how to handle
| stack overflows?
| jmillikin wrote:
| 'eru implied `a += 1` has undefined behavior; I provided
| a trivial counter-example. If you'd like longer examples
| of C code that performs unsigned integer addition then
| the internet has many on offer.
|
| I'm not claiming that C (or C++) is without problems. I
| wrote code in them for ~20 years and that was more than
| enough; there's a reason I use Rust for all my new low-
| level projects. In this case, writing C without undefined
| behavior requires lots of third-party static analysis
| tooling that is unnecessary for Rust (due to being built
| in to the compiler).
|
| But if you're going to be writing C as "portable
| assembly", then the competition isn't Rust (or Zig, or
| Fortran), it's _actual_ assembly. And it 's silly to
| object to C having undefined behavior for signed integer
| addition, when the alternative is to write your VM loop
| (or whatever) five or six times in platform-specific
| assembly.
| eru wrote:
| Forth might be a better competition for 'portable
| assembly', though.
| WJW wrote:
| C might be low level from the perspective of other
| systems languages, but that is like calling Apollo 11
| simple from the perspective of modern spacecraft. C as
| written is not all that close to what actually gets
| executed.
|
| For a small example, there are many compilers who would
| absolutely skip incrementing 'a' in the following code:
| uint32_t add_and_subtract_1(uint32_t a) { a += 1;
| a -= 1; return a; }
|
| Even though that code contains `a += 1;` clear as day,
| the chances of any incrementing being done are quite
| small IMO. It gets even worse in bigger functions where
| out-of-order execution starts being a thing.
| eru wrote:
| > It gets even worse in bigger functions where out-of-
| order execution starts being a thing.
|
| In addition, add that your processor isn't actually
| executing x86 (nor ARM etc) instructions, but
| interprets/compiles them to something more fundamental.
|
| So there's an additional layer of out-of-order
| instructions and general shenanigans happening.
| Especially with branch prediction in the mix.
| johnisgood wrote:
| Why would you want it to increment 1 if we decrement 1
| from the same variable? That would be a waste of cycles
| and a good compiler knows how to optimize it out, or what
| am I misunderstanding here? What do you expect "it" to do
| and what does it really do?
|
| See: https://news.ycombinator.com/item?id=43320495
| lifthrasiir wrote:
| It is unlikely as is, but it frequently arises from macro
| expansions and inlining.
| acdha wrote:
| That's a contrived example but in a serious program there
| would often be code in between or some level of
| indirection (e.g. one of those values is a lookup, a
| macro express, or the result of another function).
|
| Nothing about that is cheating, it just says that even C
| programmers cannot expect to look at the compiled code
| and see a direct mapping from their source code. Your
| ability to reason about what's actually executing
| requires you to internalize how the compiler works in
| addition to your understanding of the underlying hardware
| and your application.
| johnisgood wrote:
| In what languages can you do that that is not assembly
| though? The higher level the language is, the "worse" or
| difficult it gets, perhaps I am not following the thread
| right.
| acdha wrote:
| Yes, this is normal for languages. The only pushback here
| is against the term "portable assembler" being applied to
| C, where it's incomplete often enough that many people
| feel it's no longer a helpful label.
|
| I think it's also reflecting the maturity and growth of
| the industry. A turn of the century programmer could
| relatively easily find areas where dropping down to
| assembly was useful, but over the subsequent decades
| that's become not only uncommon but often actively
| harmful: your code hand-optimized for a particular
| processor is likely slower on newer processors than what
| a modern compiler emits and is definitely a barrier to
| portability in an era where not only are ARM and
| potentially RISC-V of interest but also where code is
| being run on SIMD units or GPUs. This makes the low-level
| "portable assembler" idea less useful because there's
| less code written in that middle ground when you want
| either a higher-level representation which gives
| compilers more flexibility or precise control. For
| example, cryptography implementers want not just high
| performance but also rigid control of the emitted code to
| avoid a compiler optimizing their careful constant-time
| implementation into a vulnerability.
| vikramkr wrote:
| I'm pretty sure that's replying directly to the comment
| about how c is close to assembly and that if you add that
| line of code somewhere you know there's a variable
| getting incremented. Doesn't really matter whether or not
| it's useful, the point is that the behavior isn't exactly
| what you wrote
| jmillikin wrote:
| To reiterate, claiming that C can be described as
| "portable assembly" is not a claim that it is literally a
| package of assembler macros that emit deterministic
| machine code for each individual source expression.
|
| I linked these in another comment, but here's some
| examples of straightforward-looking integer addition
| emitting more complex compiler output for other languages
| that compile to native code:
|
| Haskell: https://godbolt.org/z/vdeMKMETT
|
| C++: https://godbolt.org/z/dedcof9x5
| johnisgood wrote:
| https://godbolt.org/z/r39jK1ddv
|
| It increments, then decrements with -O0 though.
|
| I do not see the issue still, as the behavior is expected
| with -O0; increments then decrements.
| chasd00 wrote:
| I'm not an embedded expert but a friend of mine has
| complained about compiler optimizations breaking things
| in his programs. I could see incrementing by one being
| used to set some bits in a memory location for a cycle
| that may mean something to some peripheral and then
| decrementing by one to set some other bits that may mean
| something else. In that case, the compiler removing those
| two lines would cause a very hard to debug issue.
| johnisgood wrote:
| I understand compiler optimizations having unintended
| consequences, e.g. https://godbolt.org/z/r39jK1ddv but
| there are a lot of options he may use to enable or
| disable optimizations (assuming GCC here):
| https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
|
| This is even more difficult to do with higher-level
| languages.
| mkoubaa wrote:
| If you want the compiler to treat your code as literal
| portable assembly turn off optimizations.
| johnisgood wrote:
| Exactly, pretty much what I said, or enable / disable the
| optimizations you want.
| gpderetta wrote:
| If my misocompile, you mean that it fails the test that a
| "C expression `a += 1` can be trusted to increment a
| numeric value", then it is trivial:
| https://godbolt.org/z/G5dP9dM5q
| Retr0id wrote:
| Here's another one, just for fun
| https://godbolt.org/z/TM1Ke4d5E
| jmillikin wrote:
| I was being somewhat terse.
|
| The (implied) claim is that the C standard has enough
| sources of undefined behavior that even a simple integer
| addition can't be relied upon to actually perform integer
| addition.
|
| But the sources of undefined behavior for integer
| addition in C are well-known and very clear, and any
| instruction set that isn't an insane science project is
| going to have an instruction to add integers.
|
| Thus my comment. Show me a C compiler that takes that
| code and miscompiles it. I don't care if it returns a
| constant, spits out an infinite loop, jumps to 0x0000,
| calls malloc, whatever. Show me a C compiler that takes
| those four lines of C code and emits something other than
| an integer addition instruction.
| Retr0id wrote:
| Why are you talking about miscompilation? While the LLVM
| regression in the featured article makes the code slower,
| it is not a miscompilation. It is "correct" according to
| the contract of the C language.
| IshKebab wrote:
| That will be inline by any C compiler and then pretty
| much anything can happen to the 'a += 1'.
| kryptiskt wrote:
| When the code has hit the IR in clang or gcc, there is no 'a'
| (we know that with certainty, since SSA form doesn't mutate
| but assigns to fresh variables). We don't know if there will
| be an increment of 1, the additions could be coalesced (or
| elided if the result can be inferred another way). The number
| can even decrease, say if things have been handled in chunks
| of 16, and needs to be adjusted down in the last chunk. Or
| the code may be auto-vectorized and completely rewritten, so
| that none of the variables at the C level are reflected on
| the assembler level.
| jmillikin wrote:
| From a high-level academic view, yes, the compiler is
| allowed to perform any legal transformation. But in
| practice C compilers are pretty conservative about what
| they emit, especially when code is compiled without -march=
| .
|
| You don't have to take my word for it. Go find a moderately
| complex open-source library written in C, compile it, then
| open up the result in Hexrays/Ghidra/radare2/whatever.
| Compare the compiled functions with their original source
| and you'll see there's not that much magic going on.
| hun3 wrote:
| -O3 does autovectorization: turning your loops into a
| bunch of SIMD instructions, sometimes even drastically
| changing performance profile.
|
| If autovectorization is "not that much magic" then idk
| what else it is.
| lifthrasiir wrote:
| Nowadays it's -O2. I was also surprised when I first
| learned this.
| gpderetta wrote:
| Any optimization you are familiar with is trivial and
| expected. Everything else is broken compilers optimizing
| UB to win benchmarks.
| cogman10 wrote:
| They are as aggressive as they can be.
|
| Here's an example of a C compiler completely eliminating
| a loop because it has figure out how to transform the
| loop into a constant calculation.
|
| https://godbolt.org/z/cfndqMj4j
|
| The place where C compilers are conservative is when
| dealing with arrays and pointers. That's because it's
| impossible for C to know if a pointer is to an element of
| an array or something completely different. Pointer math
| further complicates what a pointer could actually
| reference.
| pjc50 wrote:
| > The phrase "C is portable assembly" isn't a claim that each
| statement gets compiled directly to equivalent machine code.
|
| Weasel words. Like a "self driving car" that requires a human
| driver with constant attention willing to take over within a
| few hundred milliseconds.
|
| People advocate for C and use it in a way that implies they
| think it can achieve specific machine outcomes, and it
| _usually_ does .. except when it doesn 't. If people want a
| portable assembler they should build one.
| jmillikin wrote:
| As a general rule if you're reading a technical discussion
| and every single participant is using a particular phrase
| in a way that doesn't make sense to you then you should
| probably do a quick double-check to make sure you're on the
| same page.
|
| For example, in this discussion about whether C is
| "portable assembly", you might be tempted to think back to
| the days of structured programming in assembly using
| macros. I no longer remember the exact syntax, but programs
| could be written to look like this:
| .include "some-macro-system.s" .include "posix-sym.s"
| .func _start(argc, argv) { .asciz message "Hello,
| world!" .call3 _write STDOUT message (.len message)
| .call1 _exit 0 }
|
| Assembly? Definitely! Portable? Eh, sort of! If you're
| willing to restrict yourself to DOS + POSIX and write an
| I/O abstraction layer then it'll probably run on
| i386/SPARC/Alpha/PA-RISC.
|
| But that's not really what people are discussing, is it?
|
| When someone says "C is portable assembly" they don't mean
| you can take C code and run it through a platform-specific
| macro expander. They don't mean it's _literally_ a portable
| dialect of assembly. They expect the C compiler to perform
| some transformations -- maybe propagate some constants,
| maybe inline a small function here and there. Maybe you 'd
| like to have named mutable local variables, which requires
| a register allocator. Reasonable people can disagree about
| exactly what transformations are legal, but at that point
| it's a matter of negotiation.
|
| Anyway, now you've got a language that is more portable
| than assembler macros but still compiles more-or-less
| directly to machine code -- not completely divorced from
| the underlying hardware like Lisp (RIP Symbolics). How
| would you describe it in a few words? "Like assembly but
| portable" doesn't seem unreasonable.
| pjc50 wrote:
| > still compiles more-or-less directly to machine code
|
| There's a lot hiding in "more or less". The same kind of
| example holds for e.g. C# :
| https://godbolt.org/noscript/csharp ; if you hit
| "Compile" it'll give you the native binary. If you write
| "x+1" it'll generate an add .. or be optimized away. Now
| does that mean it's portable assembler? Absolutely not.
|
| Conversely there's a bunch of things that people expect
| to do in C, do in real code, but are _not in the standard
| or are undefined or implementation-defined_. As well as
| things that are present in assemblers for various
| platforms (things like the overflow flag) which aren 't
| accessible from the C language.
|
| What people _actually_ seem to mean by "portable
| assembler" is "no guardrails". Memory unsafety as a
| feature.
|
| > Reasonable people can disagree about exactly what
| transformations are legal, but at that point it's a
| matter of negotiation
|
| And a matter of CVEs when you lose your negotiation with
| the compiler. Or less dramatic things like the
| performance fluctuations under discussion.
| pjmlp wrote:
| Systems languages that predated C already could do that, that
| is the typical myth.
| titzer wrote:
| Saying that something "is like XY" when you really mean "is
| like XY, at least in comparison to C++" isn't what most
| people mean.
|
| C is not a portable assembler.
|
| In C, "a += 1" could overflow, and signed overflow is
| undefined behavior--even though _every individual ISA_ has
| completely defined semantics for overflow, and nearly all of
| them these days do two 's complement wraparound arithmetic.
| With C's notion of undefined behavior, it doesn't even give
| you the same wraparound in different places in the same
| program. In fact, wraparound is so undefined that the program
| could do absolutely anything, and the compiler is not
| required to even tell you about it. Even without all the C++
| abstraction madness, a C compiler can give you absolutely
| wild results due to optimizations, e.g. by evaluating "a +=
| 1" at compile time and using a different overflow behavior
| than the target machine. Compile-time evaluation not matching
| runtime evaluation is one of a _huge_ number of dumb things
| that C gives you.
|
| Another is that "a += 1" may not even increment the variable.
| If this occurs as an expression, and not as a statement, e.g.
| "f(a += 1, a += 1)", you might only get one increment due to
| sequence points[1]--not to mention that the order of
| evaluation might be different depending on the target.
|
| _C is not a portable assembler._
|
| C is a low-level language where vague machine-like programs
| get compiled to machine code that may or may not work,
| depending on whether it violates UB rules or not, and there
| are precious few diagnostics to tell if that happened, either
| statically or dynamically.
|
| [1] https://en.wikipedia.org/wiki/Sequence_point
| MattPalmer1086 wrote:
| Benchmarking is just insanely hard to do well. There are so many
| things which can mislead you.
|
| I recently discovered a way to make an algorithm about 15%
| faster. At least, that's what all the benchmarks said. At some
| point I duplicated the faster function in my test harness, but
| did not call the faster version, just the original slower one...
| And it was still 15% faster. So code that never executed sped up
| the original code...!!! Obviously, this was due to code and
| memory layout issues, moving something so it aligned with some
| CPU cache better.
|
| It's actually really really hard to know if speedups you get are
| because your code is actually "better" or just because you lucked
| out with some better alignment somewhere.
|
| Casey Muratori has a really interesting series about things like
| this in his substack.
| eru wrote:
| I vaguely remember about some benchmarking project that
| deliberately randomised these compiler decisions, so that they
| could give you more stable estimates of how well your code
| actually performed, and not just how well you won or lost the
| linker lottery.
| Mond_ wrote:
| You're probably thinking of "Performance Matters" by Emery
| Berger, a Strange Loops talk.
| https://youtube.com/watch?v=r-TLSBdHe1A
| MattPalmer1086 wrote:
| There was Stabilizer [1] which did this, although it is no
| longer maintained and doesn't work with modern versions of
| LLVM. I think there is something more current now that
| automates this, but can't remember what it's called.
|
| [1] https://emeryberger.com/research/stabilizer/
| alpaca128 wrote:
| As already mentioned this is likely Emery Berger's project
| with the idea of intentionally slowing down different parts
| of the program, also to find out which parts are most
| sensitive to slowdowns (aka have the biggest effect on
| overall performance), with the assumption that these are also
| the parts that profit the most from optimisations.
| FridgeSeal wrote:
| The Coz profiler from Emery Berger.
|
| It can actually go a step further and give you decent
| estimate of what functions you need to change to have the
| desired latency/throughput increases.
| MattPalmer1086 wrote:
| Thanks, I was trying to remember that one!
| McP wrote:
| LLD has a new option "--randomize-section-padding" for this
| purpose: https://github.com/llvm/llvm-project/pull/117653
| MattPalmer1086 wrote:
| Interesting, thanks!
| igouy wrote:
| "Producing wrong data without doing anything obviously
| wrong!"
|
| https://doi.org/10.1145/1508244.1508275
| igouy wrote:
| "Producing wrong data without doing anything obviously
| wrong!"
|
| [pdf]
|
| https://users.cs.northwestern.edu/~robby/courses/322-2013-s
| p...
| porridgeraisin wrote:
| That linker lottery led to a 15% improvement? I'm surprised. Do
| you know in what cases you get such a big improvement from
| something like that? Is it rare? How did you end up reasoning
| about it?
| MattPalmer1086 wrote:
| Various research has shown that the variation can be much
| higher than 15% due to things like this. It's not that rare;
| I keep bumping into it. Compilers and linkers do a reasonable
| job but fundamentally modern CPUs are extremely complex
| beasts.
|
| I found Casey Muratori's series the best explanation of what
| is going on at the CPU level.
| MattPalmer1086 wrote:
| Some additional context. I was actually writing a
| benchmarking tool for certain kinds of search algorithm. I
| spent a long time reducing and controlling for external
| sources of noise. CPU pinning, doing hundreds of those runs
| with different data, and then repeating them several times
| and taking the best of each score with the same data (to
| control for transient performance issues due to the machine
| doing other things).
|
| I got the benchmarking tool itself to give reasonably
| repeatable measurements.
|
| The tool had high precision, but the accuracy of "which
| algorithm is better" was not reliable just due to these code
| layout issues.
|
| I basically gave up and shelved the benchmarking project at
| that point, because it wasn't actually useful to determine
| which algorithm was better.
| albertzeyer wrote:
| To clarify: The situation is still not completely understood?
| It's not just only the computed gotos, but there is some other
| regression in Clang19? Basically, the difference between
| clang19.nocg and clang19 is not really clear?
|
| Btw, what about some clang18.tc comparison, i.e. Clang18 with the
| new tail-call interpreter? I wonder how that compares to
| clang19.tc.
| aeyes wrote:
| The article includes a link to the pull request which fixes the
| regression in LLVM: https://github.com/llvm/llvm-
| project/pull/114990
| nelhage wrote:
| > Btw, what about some clang18.tc comparison
|
| (post author here) Oh, this is something I could have called
| out explicitly: The tail-calling interpreter relies on a
| feature (the `preserve_none` calling convention) that only
| landed in clang-19. That means you can only test it on that
| version. That coincidence (that 19 added both this feature,
| _and_ the regression) is part of why this was so easy to miss
| at first, and why I had to "triangulate" with so many
| different benchmarks to be confident I understood what was
| going on.
| thrdbndndn wrote:
| Great article! One detail caught my attention.
|
| In one of the referenced articles,
| https://simonwillison.net/2025/Feb/13/python-3140a5/, the author
| wrote: "So 3.14.0a5 scored 1.12 times faster than 3.13 on the
| benchmark (on my extremely overloaded M2 MacBook Pro)."
|
| I'm quite confused by this. Did the author run the benchmark
| while the computer was overloaded with other processes? Wouldn't
| that make the results completely unreliable? I would have thought
| these benchmarks are conducted in highly controlled environments
| to eliminate external variables.
| ambivalence wrote:
| Simon Willison is a great guy, but he's not a Python core
| developer and his ad hoc benchmark is not what the CPython core
| team members are using. For the latter, see
| https://github.com/faster-cpython/benchmarking-public
| motbus3 wrote:
| I recently made some benchmarking from python 3.9 to 3.13 And up
| to 3.11 it only got better. Python 3.12 and 3.13 were about 10%
| slower than 3.11.
|
| I thought my homemade benchmark wasn't great enough so I deployed
| it to a core service anyway and I saw same changes in our
| collected metrics. Does anyone else have the same problem?
| sgarland wrote:
| Yes, I found a loop performance regression [0] in 3.12 and
| 3.13.
|
| [0]: https://github.com/python/cpython/issues/123540
| cb321 wrote:
| While some people here are citing 10% as "large" and 1% as
| "normal", there are optimizations like partial inlining of doubly
| recursive Fibonacci that can reduce the actual work (and so also
| time) _exponentially_ (factors >10X for double-digit arguments
| or 1000s of %, technically exponential in the difference of the
| recursion depth, not the problem size [1]).
|
| C compilers can also be very finicky about their code inlining
| metrics. So, whether that enormous speed-up is realized can be
| very sensitive to the shape of your code.
|
| So, while part of the problem is definitely that CPUs have gotten
| quite fancy/complex, another aspect is that compilers "beyond -O0
| or -O1" have _also_ gotten fancy /complex.
|
| The article here, while good & worth reading, is also but one of
| many examples of how two complex things interacting can lead to
| very surprising results (and this is true outside of computing).
| People have a rather strong tendency to oversimplify -- no matter
| how many times this lesson shows up.
|
| EDIT: Also, the article at least uses _two_ CPUs, Intel & Apple
| M1 and _two_ compilers (gcc & clang), but there are realistic
| deployment scenarios across _many more_ generations /impls of
| Intel, AMD, and ARM and probably other compilers, too. So, it
| only even samples a small part of the total complexity. Also, if
| you want to be more scientific, esp. for differences like "1.01X"
| then time measurements should really have error bars of _some
| kind_ (either std.dev of the mean or maybe better for a case like
| this std.dev of the min[2]) and to minimize those measurement
| errors you probably want CPU core affinity scheduling in the OS.
|
| [1] https://stackoverflow.com/questions/360748/computational-
| com...
|
| [2] https://github.com/c-blake/bu/blob/main/doc/tim.md
| kenjin4096 wrote:
| Hello. I'm the author of the PR that landed the tail-calling
| interpreter in CPython.
|
| First, I want to say thank you to Nelson for spending almost a
| month to get to the root of this issue.
|
| Secondly, I want to say I'm extremely embarrassed and sorry that
| I made such a huge oversight. I, and probably the rest of the
| CPython team did not expect the compiler we were using for the
| baseline to have that bug.
|
| I posted an apology blog post here. https://fidget-
| spinner.github.io/posts/apology-tail-call.htm...
| jxjnskkzxxhx wrote:
| I just wanted to say: respect for being able to say "sorry, I
| made a mistake". I hate the fake it till you make it mentality
| that seems to be the norm now.
| kzrdude wrote:
| I understand the frustration but I don't think it needed to
| be said (the part about mentality, the thanks is of course
| cool), because that's still not the norm.
|
| Why do I even bring this message - I want to say that let's
| not let what we see in the news influence our perception of
| the real people of the world. Just because fraud and crimes
| get elevated in the news, does not mean that the common man
| is a criminal or a fraud. :)
| codr7 wrote:
| Divide and conquer; we're supposed to hate each other and
| trust the state/elite/technology, at least that's the plan.
|
| The real criminals, the people we should keep an eye on,
| are the plan's designers and implementers.
| ptx wrote:
| Why didn't this regression in baseline performance show up (or
| did it?) on the faster-cpython benchmarks page [0]? Could the
| benchmarks be improved to prevent similar issues in the future?
|
| [0] https://github.com/faster-cpython/benchmarking-public
| kenjin4096 wrote:
| We don't normally test with bleeding-edge compilers on the
| faster cpython benchmarks page because that would invalidate
| historical data. E.g. if 2 years ago we used GCC 11 or
| something to compile and run a benchmark, we need to run it
| with GCC 11 again today to get comparable results.
|
| Clang 19 was released last year. We only benched it a few
| months ago. We did notice there was a significant slowdown on
| macOS, but that was against Xcode Clang, which is a different
| compiler. I thought it might've been an Xcode thing, which in
| the past has bit CPython before (such as Xcode LTO
| working/not working versus normal Clang) so I didn't
| investigate deeper (facepalming now in retrospect) and
| chalked it up to a compiler difference.
|
| TLDR: We didn't run benchmarks of clang 19 versus 18. We only
| ran benchmarks of clang 19 versus gcc, Xcode clang, and MSVC.
| All of which are not apples-to-apples to Clang 19, so I
| naiively thought it was a compiler difference.
|
| EDIT: As to how we could improve this process, I'm not too
| sure, but I know I'll be more discerning when there's a >4%
| perf hit now when upgrading compilers.
| cb321 wrote:
| That is a better than average benchmark page.
|
| As alluded to in
| https://news.ycombinator.com/item?id=43319010, I see these
| tests were collected against just 2 Intel and 2 ARM CPUs. So,
| if you are looking for feedback to improve, you should
| probably also include (at least) a AMD Zen4 or Zen5 in there.
| CPU & compiler people have been both trying to "help perf
| while not trusting the other camp" for as long as I can
| remember and I doubt that problem will ever go away.
|
| A couple more CPUs will help but not solve generalizability
| of results. E.g., if somebody tests against some ancient 2008
| Nehalem hardware, they might get _very_ different answers.
| Similarly, answers today might not reflect 2035 very well.
|
| The reality of our world of complex hardware deployment
| (getting worse with GPUs) is that "portable performance" is
| _almost_ a contradiction in terms. We all just do the best we
| can at some semblance of the combination. The result is some
| kind of weighted average of "not #ifdef'd too heavily
| source" and "a bit" faster "informally averaged over our user
| population & their informally averaged workloads" and this
| applies at _many levels_ of the whole computing stack.
|
| EDIT: And, of course, a compiled impl like Cython or Nim is
| another way to go if you care about performance, but I do
| understand the pull & network effects of the Python
| ecosystem. So, that may not always be practical.
| pseufaux wrote:
| The way this was handled is incredibly good form! Thank you for
| going to such lengths to make sure the mistake was corrected. I
| respect and thank you (and all the developers working in
| Python) for all the work you do!
| delijati wrote:
| One question arises: does the added code [1] bring any
| improvement, or does it merely complicate the source? Should it
| not be removed?
|
| [1] https://github.com/python/cpython/pull/128718
| kenjin4096 wrote:
| That's a fair question.
|
| The blog post mentions it brings a 1-5% perf improvement.
| Which is still significant for CPython. It does not
| complicate the source because we use a DSL to generate
| CPython's interpreters. So the only complexity is in
| autogenerated code, which is usually meant for machine
| consumption anyways.
|
| The other benefit (for us maintainers I guess), is that it
| compiles way faster and is more debuggable (perf and other
| tools work better) when each bytecode is a smaller function.
| So I'm inclined to keep it for perf and productivity reasons.
| coldtea wrote:
| There was a plan for a 5x speedup overall looking for
| funding back in 2022. Then a team with Guido and others
| involved (and MS backing?) got on the same bandwagon and
| made some announcements for speeding up CPython a lot.
|
| Several releases in, have we seen even a 2x speedup? Or
| more like 0.2x at best?
|
| Not trying to dismiss the interpreter changes - more want
| to know if those speedup plans were even remotely
| realistic, and if anything close enough to even 1/5 of what
| was promised will really come out of them...
| Twirrim wrote:
| There has been a lot of speed ups coming out of that
| project. It's not being done in one go, though, but
| spread out over the last several releases. So while
| individual speedups may not look that significant,
| remember that it is compound on the previous release's
| speed up.
| chippiewill wrote:
| The faster cpython project was for 5x over the 3.10
| baseline. CPython 3.13 is currently running at something
| like 1.6x speed compared to 3.10. With the JIT enabled it
| goes up by a few more single digit percentage point. With
| the changes in 3.14 it'll be something like 1.8x speed-
| up.
|
| So it's slowly getting there, I think the faster cpython
| project was mostly around the idea that the JIT can get a
| lot faster as it starts to optimise more and more and
| that only just got shipped in 3.13, so there's a lot of
| headroom. We know that PyPy (an existing JIT
| implementation) is close to 5x faster than CPython a lot
| of the time already.
|
| There's also now the experimental free-threading build
| which speeds up multithreaded Python applications (Not by
| a lot right now though unfortunately).
| ot wrote:
| Being more robust to fragile compiler optimizations is also
| a nontrivial benefit. An interpreter loop is an extremely
| specialized piece of code whose control flow is too
| important to be left to compiler heuristics.
|
| If the desired call structure can be achieved in a portable
| way, that's a win IMO.
| jraph wrote:
| Reading that you are extremely embarrassed and sorry that you
| made such a huge oversight, I was imagining you had broken
| something / worsened CPython's performance.
|
| But it's nothing like this. You announced a 10-15% perf
| improvement but that improvement is more like 1-5% on a non
| buggy compiler. It's not even like that 10-15% figure is
| _wrong_ , it's just that it's correct only under very specific
| conditions, unknowingly to you.
|
| IIUC, you did your homework: you made an improvement, you
| measured a 10-15% perf improvement, the PR was reviewed by
| other people, etc. It just so happens that this 10-15% figure
| is misleading because of an issue with the version of clang you
| happened to use to measure. Unless I'm missing something, it
| looks like a fair mistake anyone could have reasonably made. It
| even looks like it was hard to not fall into this trap. You
| _could_ have been more suspicious seeing such a high number,
| but hindsight is 20 /20.
|
| Apparently, you still brought significant performance
| improvements, your work also helped uncover a compiler
| regression. The wrong number seems quite minor in comparison. I
| wonder who was actually hurt by this. I only discover the
| "case" right now but at a first glance it doesn't feel like you
| owe an apology to anyone. Kudos for all this!
| ehsankia wrote:
| In some way, by indirectly helping fix this bug, they led to
| a ~10% performance increase for everyone who was using that
| faulty compiler! That's even better than an optional flag
| that many people won't know about or use.
| chowells wrote:
| That performance regression only hit code that was using a
| very large number of paths with the same table of computed
| gotos at the end. That's likely to only be relatively
| complex interpreters that were affected. So it's not a
| broad performance improvement. But it is nice to have an
| example of the compiler's new heuristic failing to prove
| evidence it needs to be tunable.
| ncruces wrote:
| Well, that includes at least everyone using _Python_
| built with that compiler.
| rtollert wrote:
| > IIUC, you did your homework: you made an improvement, you
| measured a 10-15% perf improvement, the PR was reviewed by
| other people, etc. It just so happens that this 10-15% figure
| is misleading because of an issue with the version of clang
| you happened to use to measure. Unless I'm missing something,
| it looks like a fair mistake anyone could have reasonably
| made. It even looks like it was hard to not fall into this
| trap. You could have been more suspicious seeing such a high
| number, but hindsight is 20/20.
|
| Hah! Is this a Gettier problem [0]?
|
| 1. True: The PR improves Python performance 15-20%. 2. True:
| Ken believes that the PR improves Python performance 15-20%.
| 3. True: Ken is justified in believing that the PR improves
| Python performance 15-20%.
|
| Of course, PR discussions don't generally revolve around
| whether or not the PR author "knows" that the PR does what
| they claim it does. Still: these sorts of epistemological
| brain teasers seem to come up in the performance measurement
| field distressingly often. I wholeheartedly agree that Ken
| deserves all the kudos he has received; still, I also wonder
| if some of the strategies used to resolve the Gettier problem
| might be useful for code reviewers to center themselves every
| once in a while. Murphy's Law and all that.
|
| [0]: https://en.wikipedia.org/wiki/Gettier_problem
| acdha wrote:
| I feel bad for you since a change on the order of 5% would
| still have been seen as a very nice bit of work. I appreciate
| the class with with you're dealing with an honest mistake, and
| all of the hard work you've given the Python community.
| DannyBee wrote:
| FWIW - the fix was merged since you wrote that blog post ;)
|
| Beyond that - 3-5% is a lot for something as old as the python
| interpreter if it holds up. I would still be highly proud of
| that.
|
| After 30 years, i've learned (like i expect you have) to be
| suspicious of any significant (IE >1%) performance improvement
| in a system that has existed a long time.
|
| They happen for sure, but are less common. Often, people are
| shifting time around, and so it just isn't part of your
| benchmark anymore[1]. Secondarily, benchmarking is often done
| in controlled environments, to try to isolate the effect. Which
| seems like the right thing to do. But then the software is run
| in non-isolated environments (IE with a million other things on
| a VM or desktop computer), which isn't what you benchmarked it
| in. I've watched plenty of verifiably huge isolated gains
| disappear or go negative when put in production environments.
|
| You have the particularly hard job that you have to target lots
| of environments - you can't even do something like say "okay
| look if it doesn't actually speed it up in production, it
| didn't really speed it up", because you have no single
| production target. That's a really hard world to live in and
| try to improve.
|
| In the end, performance tuning and measurement is really hard.
| You have nothing to be sorry for, except learning that :)
|
| Don't let this cause you to be afraid to get it wrong - you
| will get it wrong anyway. We all do. Just do what you are doing
| here - say 'whoops, i think we screwed this up', try to figure
| out how to deal with it, and try to figure out how to avoid it
| in the future (if you can).
|
| [1] This is common not just in performance, but in almost
| anything, including human processes. For example, to make
| something up, the team working on the code review tool would
| say "we've reduced code review time by 15% and thus sped up
| everyone's workflow!". Awesome! But actually, it turns out they
| made more work for some other part of the system, so the
| workflow didn't get any faster, they just moved the 15% into a
| part of the world they weren't measuring :)
| theLiminator wrote:
| > After 30 years, i've learned (like i expect you have) to be
| suspicious of any significant (IE >1%) performance
| improvement in a system that has existed a long time.
|
| _Laughs in corporate code_
| miroljub wrote:
| That's a different case. Corporate code is never optimized
| for performance. Performance as a factor doesn't play any
| factor.
| DannyBee wrote:
| Sure, let me amend it to "i'm suspicious of any significant
| performance improvement in as system where performance
| actually matters, and has existed in a state where
| performance matters for a long time".
| colechristensen wrote:
| I'll believe you if you say 0.5% improvement, I'll believe
| you if you say 10,000% improvement, but 10%? That's fishy.
| rbetts wrote:
| Thank you for doing work that makes python better for millions
| of people.
| robertlagrant wrote:
| You're doing great work pushing Python forwards. Thanks for all
| your efforts.
| tiagod wrote:
| It's still a performance improvement, which at the scale of
| python deployment will probably mean quite a bit of global
| power savings :)
| 9cb14c1ec0 wrote:
| Don't feel so bad. 1 to 5 % speedup is still an extremely
| valuable contribution.
| haberman wrote:
| I think it's important to note that a primary motivation of the
| tail call interpreter design is to be less vulnerable to the
| whims of the optimizer. From my original blog article about
| this technique
| (https://blog.reverberate.org/2021/04/21/musttail-
| efficient-i...):
|
| > Theoretically, this control flow graph paired with a profile
| should give the compiler all of the information it needs to
| generate the most optimal code [for a traditional
| switch()-based interpreter]. In practice, when a function is
| this big and connected, we often find ourselves fighting the
| compiler. It spills an important variable when we want it to
| keep it in a register. It hoists stack frame manipulation that
| we want to shrink wrap around a fallback function invocation.
| It merges identical code paths that we wanted to keep separate
| for branch prediction reasons. The experience can end up
| feeling like trying to play the piano while wearing mittens.
|
| That second-to-last sentence is exactly what has happened here.
| The "buggy" compiler merged identical code paths, leading to
| worse performance.
|
| The "fixed" compiler no longer does this, but the fix is
| basically just tweaking a heuristic inside the compiler.
| There's no actual guarantee that this compiler (or another
| compiler) will continue to have the heuristic tweaked in the
| way that benefits us the most.
|
| The tail call interpreter, on the other hand, lets us express
| the desired machine code pattern in the interpreter itself.
| Between "musttail", "noinline", and "preserve_none" attributes,
| we can basically constrain the problem such that we are much
| less at the mercy of optimizer heuristics.
|
| For this reason, I think the benefit of the tail call
| interpreter is more than just a 3-5% performance improvement.
| It's a _reliable_ performance improvement that may be even
| greater than 3-5% on some compilers.
| sunshowers wrote:
| In full agreement with this. There is tremendous value in
| having code whose performance is robust to various compiler
| configurations.
| sundbry wrote:
| You don't have to apologize, you did great work either way.
| immibis wrote:
| You don't need a long heartfelt apology - a simple "god dammit"
| is enough for this. You made no mistake - you only got unlucky.
|
| It's possible to improve your luck by applying more care, but
| it's also possible to apply so much care that you do not try
| things that _would_ have been useful, so I 'd rather you keep
| erring on the side that you did!
| SmellTheGlove wrote:
| I don't think you should be embarrassed or apologize. You still
| did a thing that improved performance - in the near term it
| worked around a bug that you weren't even aware of, and long
| term there are still gains even with that bug fixed. But even
| if that weren't the case, the only way anything gets done in
| software is to rely on abstractions. You can either get things
| done or know exactly how every last bit of your stack is
| implemented, but probably not both. It's a very reasonable
| trade off.
|
| Smart people go and build things. Other smart people find
| problems. Nothings broken with that.
| ggu wrote:
| Hello, I was one of the people who contributed to this
| interpreter change. Thank you Nelson for the excellent write-up
| and for going to such lengths to get to the bottom of this.
| That said, I wanted to defend Ken a bit. :-)
|
| Ken Jin is a volunteer who's been tirelessly working to make
| CPython faster over the past couple of years for not enough
| credit. IMHO, Ken did nothing wrong here, and there's really
| nothing to be embarrassed about. The fact that it took a month
| (more if you consider the folks helping to reproduce the
| original results) for anyone to notice anything wrong shows how
| complicated the situation was! Put yourself in Ken's shoes --
| having updated to LLVM-19 to enable the preserve_none flag,
| would you then hypothesize that LLVM-19 may have introduced an
| unrelated performance regression that no one noticed for five
| months? Lots of people underestimate how hard benchmarking is
| IMO.
|
| A 1-5% performance improvement is also pretty valuable, just
| not quite as spectacular as we thought originally :-)
| ok_dad wrote:
| Yea, don't feel too bad, you're just unlucky that your mistake
| was so public, most of us have these kinds of mistakes weekly
| in a more private setting. I think everyone who knows anything
| about how optimized Python is would be impressed that you
| managed even a few percent improvement in speed! That will also
| probably save many, many GWh of power, even those few percent!
|
| You have a new account here and your blog is just one article
| so far so you might be a new-ish developer(?), but you are
| doing great, keep it up! If you are not a new developer, you
| are still doing great, keep it up!
| bjourne wrote:
| This is exactly the kind of content I love to see on HN. But I
| wonder though how this optimization is related to tail-call
| optimization? How the interpreter jump table is implemented
| shouldn't affect how stack frames are created, should it?
| mbel wrote:
| It's explained under the first link from the article [0]:
|
| "A new type of interpreter has been added to CPython. It uses
| tail calls between small C functions that implement individual
| Python opcodes, rather than one large C case statement."
|
| [0]
| https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-...
| ambivalence wrote:
| A big chunk of the performance gain comes from the fact that
| with tail calls the CPU doesn't have to reset the registers
| (configured through the `preserve_none` calling convention).
| thunkingdeep wrote:
| Well if you'd bother to read it, you'd discover this is about
| tail calls in C, not in Python. It has nothing to do with tail
| recursion in Python. Guido has explicitly said that Python will
| never have it.
| throw132128 wrote:
| That is what happens in a project if you purge all opposition and
| a few people work on corporate projects that are supposed to
| succeed.
|
| The free-threading had massive regressions (50% slower), this one
| has almost no speedups.
|
| People are scared of corporate repressions or repressions from
| the Steering Council, so they shut up like in the Soviet Union.
| Retr0id wrote:
| It's getting harder to characterize the C language as "close to
| the metal". Obviously optimizing compilers are a good thing, but
| it's frustrating when it feels like you're fighting with them to
| get the code emitted just the way you want.
| Malp wrote:
| This is some solid work and a great blog post to go hand-in-hand
| with. Wow.
| AlexSW wrote:
| I'm struggling to confirm that my intuition of why the compiler
| optimisation is faster than the naive switch expansion.
|
| Is it about the layout of where the assembly instructions end up,
| and spacing around them? Or the CPU pipelining working better?
| Or...?
| csjh wrote:
| distributing the dispatch instructions lets the predictor pick
| up on specific patterns within the bytecode - for example,
| comparison instructions are probably followed by a conditional
| branch instruction
| AlexSW wrote:
| Ah of course! It's improving the branch target predictor.
| Makes sense, thanks!
| anarazel wrote:
| Hah, I found something similar in gcc a few years back. Affecting
| postgres' expression evaluation interpreter:
|
| https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785
| ram_rar wrote:
| Kudo to author and Cpython team to reckon this. Nevertheless, its
| still a very important improvement in +Ve direction. Thats all
| matters. This a great story as to why benchmarking are soo hard
| to get right. I always tell my team to benchmark your real world
| use case as closely as possible and not rely on external results.
| ashvardanian wrote:
| Thanks for the post! Would love to read similarly styled dives
| into GIL performance and experimental multi-threaded Python
| builds!
___________________________________________________________________
(page generated 2025-03-10 23:00 UTC)