[HN Gopher] Performance of the Python 3.14 tail-call interpreter
       ___________________________________________________________________
        
       Performance of the Python 3.14 tail-call interpreter
        
       Author : signa11
       Score  : 493 points
       Date   : 2025-03-10 06:44 UTC (16 hours ago)
        
 (HTM) web link (blog.nelhage.com)
 (TXT) w3m dump (blog.nelhage.com)
        
       | IshKebab wrote:
       | Very interesting! Clearly something else going on though if the
       | 2% Vs 9% thing is true.
        
       | jeeybee wrote:
       | Kudos to the author for diving in and uncovering the real story
       | here. The Python 3.14 tail-call interpreter is still a nice
       | improvement (any few-percent gain in a language runtime is hard-
       | won), just not a magic 15% free lunch. More importantly, this
       | incident gave us valuable lessons about benchmarking rigor and
       | the importance of testing across environments. It even helped
       | surface a compiler bug that can now be fixed for everyone's
       | benefit. It's the kind of deep-dive that makes you double-check
       | the next big performance claim. Perhaps the most thought-
       | provoking question is: how many other "X% faster" results out
       | there are actually due to benchmarking artifacts or unknown
       | regressions? And how can we better guard against these pitfalls
       | in the future?
        
         | ehsankia wrote:
         | I guess the bigger question for me is, how was a 10% drop in
         | Python performance not detected when that faulty compiler
         | feature was pushed? Do we not benchmark the compilers
         | themselves? Do the existing benchmarks on the compiler or
         | python side not use that specific compiler?
        
       | asicsp wrote:
       | Related discussions:
       | 
       | https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-...
       | --> https://news.ycombinator.com/item?id=42999672 _(66 points |
       | 25 days ago | 22 comments)_
       | 
       | https://blog.reverberate.org/2025/02/10/tail-call-updates.ht...
       | --> https://news.ycombinator.com/item?id=43076088 _(124 points |
       | 18 days ago | 92 comments)_
        
       | tempay wrote:
       | Trying to assess the performance of a python build is extremely
       | difficult as there are a lot of build tricks you can do to
       | improve it. Recently the astral folks ran into this showing how
       | the conda-forge build is notable faster than most others:
       | 
       | https://github.com/astral-sh/python-build-standalone/pull/54...
       | 
       | I'd be interested to know how the tail-call interpreter performs
       | with other build optimisations that exist.
        
         | eru wrote:
         | Compare https://donsbot.com/2009/03/09/evolving-faster-haskell-
         | progr...
         | 
         | The author uses a genetic algorithm to try out lots of
         | different compiler and optimisation flag combinations.
        
       | vkazanov wrote:
       | So, the compiler is tinkering with the way the loop is organised
       | so the whole tail-call interpreter thing is not as effective as
       | announced... Not surprised.
       | 
       | 1. CPU arch (and arch version) matters a lot. The problem is 95%
       | about laying out the instruction dispatching code for the branch
       | predictor to work optimally. C was never meant to support this.
       | 
       | 2. The C abstract machine is also not low-level enough to express
       | the intent properly. Any implementation becomes supersensitivy to
       | a particular compiler's (and compiler version) quirks.
       | 
       | Certain paranoid interpreter implementation go back to writing
       | assembly directly. luajit's famous for implementing a macro
       | system to make its superefficient assembly loop implementation
       | portable across architectures. This is also I find it fun to
       | tinker with these!
       | 
       | Anyways, a few years ago I've put together an article and a a
       | test for popular interpreter loop implementation approaches:
       | 
       | https://github.com/vkazanov/bytecode-interpreters-post
        
         | nelhage wrote:
         | (author here)
         | 
         | > The problem is 95% about laying out the instruction
         | dispatching code for the branch predictor to work optimally.
         | 
         | A fun fact I learned while writing this post is that that's no
         | longer true! Modern branch predictors can pretty much
         | accurately predict through a single indirect jump, if the run
         | is long enough and the interpreted code itself has stable
         | behavior!
         | 
         | Here's a paper that studied this (for both real hardware and a
         | certain simulated branch predictor):
         | https://inria.hal.science/hal-01100647/document
         | 
         | My experiments on this project anecdotally agree; they didn't
         | make it into the post but I also explored a few of the
         | interpreters through hardware CPU counters and `perf stat`, and
         | branch misprediction never showed up as a dominant factor.
        
           | vkazanov wrote:
           | Yes, this was already becoming true around the time I was
           | writing the linked article. And I also read the paper. :-) I
           | also remember I had access to a pre-Haswell era Intel CPUs vs
           | something a bit more recent, and could see that the more
           | complicated dispatcher no longer made as much sense.
           | 
           | Conclusion: the rise of popular interpreter-based languages
           | lead to CPUs with smarter branch predictors.
           | 
           | What's interesting is that a token threaded interpreter
           | dominated my benchmark (https://github.com/vkazanov/bytecode-
           | interpreters-post/blob/...).
           | 
           | This trick is meant to simplify dispatching logic and also
           | spread branches in the code a bit.
        
           | celeritascelery wrote:
           | How do you reconcile that with the observation that moving to
           | a computed goto style provides better codegen in zig[1]? They
           | make the claim that using their "labeled switch" (which is
           | essentially computed goto) allows you to have multiple
           | branches which improves branch predictor performance. They
           | even get a 13% speedup in their parser from switch to this
           | style. If modern CPU's are good at predicting through a
           | single branch, I wouldn't expect this feature to make any
           | difference.
           | 
           | [1] https://ziglang.org/download/0.14.0/release-
           | notes.html#Code-...
        
             | dwattttt wrote:
             | While it's unlikely as neat as this, the blog post we're
             | all commenting on is a "I thought we had a 10-15% speedup,
             | but it turned out to be an LLVM optimisation misbehaving".
             | And Zig (for now) uses LLVM for optimised builds too
        
       | kryptiskt wrote:
       | This is a very good example of how C is not "close to the
       | machine" or "portable assembly", modern optimizers will do
       | drastic changes to the logic as long as it has no observable
       | effect.
       | 
       | As stated in the post: "Thus, we end up in this odd world where
       | clang-19 compiles the computed-goto interpreter "correctly" - in
       | the sense that the resulting binary produces all the same value
       | we expect - but at the same time it produces an output completely
       | at odds with the intention of the optimization. Moreover, we also
       | see other versions of the compiler applying optimizations to the
       | "naive" switch()-based interpreter, which implement the exact
       | same optimization we "intended" to perform by rewriting the
       | source code."
        
         | fouronnes3 wrote:
         | Stretching "no observable effect" all the way to a 10000 word
         | blog post.
        
         | jmillikin wrote:
         | > This is a very good example of how C is not "close to the
         | machine" or       > "portable assembly",
         | 
         | C is very much "portable assembly" from the perspective of
         | other systems programming languages of the 80s-90s era. The C
         | expression `a += 1` can be trusted to increment a numeric
         | value, but the same expression in C++ might allocate memory or
         | unwind the call stack or do who knows what. Similarly, `a =
         | "a"` is a simple pointer assignment in C, but in C++ it might
         | allocate memory or [... etc].
         | 
         | The phrase "C is portable assembly" isn't a claim that each
         | statement gets compiled directly to equivalent machine code.
        
           | eru wrote:
           | > The C expression `a += 1` can be trusted to increment a
           | numeric value, [...]
           | 
           | Have you heard of undefined behaviour?
        
             | jmillikin wrote:
             | Show me a C compiler that miscompiles the following code
             | and I'll concede the point:                 uint32_t
             | add_1(uint32_t a) {         a += 1;         return a;
             | }
        
               | eru wrote:
               | I know that for 'int a' the statement 'a += 1' can give
               | rather surprising results.
               | 
               | And you made a universal statement that 'a += 1' can be
               | trusted. Not just that it can sometimes be trusted. In
               | C++ the code you gave above can also be trusted as far as
               | I can tell. At least as much as the C version.
        
               | jmillikin wrote:
               | I'll expand my point to be clearer.
               | 
               | In C there is no operator overloading, so an expression
               | like `a += 1` is easy to understand as incrementing a
               | numeric value by 1, where that value's type is one of a
               | small set of built-in types.
               | 
               | You'd need to look further up in the function (and maybe
               | chase down some typedefs) to see what that type is, but
               | the set of possible types generally boils down to "signed
               | int, unsigned int, float, pointer". Each of those types
               | has well-defined rules for what `+= 1` means.
               | 
               | That means if you see `int a = some_fn(); assert(a <
               | 100); a += 1` in the C code, you can expect something
               | like `ADD EAX,1` somewhere in the compiler output for
               | that function. Or going the other direction, when you're
               | in a GDB prompt and you disassemble the current EIP and
               | you see `ADD EAX,1` then you can pretty much just look at
               | the C code and figure out where you are.
               | 
               | ---
               | 
               | Neither of those is true in C++. The combination of
               | completely ad-hoc operator overloading, function
               | overloading, and implicit type conversion via
               | constructors means that it can be really difficult to map
               | between the original source and the machine code.
               | 
               | You'll have a core dump where EIP is somewhere in the
               | middle of a function like this:
               | std::string some_fn() {         some_ns::unsigned<int> a
               | = 1;         helper_fn(a, "hello");         a += 1;
               | return true;       }
               | 
               | and the disassembly is just dozens of function calls for
               | no reason you can discern, and you're staring at the
               | return type of `std::string` and the returned value of
               | `true`, and in that moment you'll _long_ for the happy
               | days when undefined behavior on signed integer overflow
               | was the worst you had to worry about.
        
               | eru wrote:
               | I heartily agree that C++ is a lot more annoying here
               | than C, yes.
               | 
               | I'm just saying that C is already plenty annoying enough
               | by itself, thanks eg to undefined behaviour.
               | 
               | > That means if you see `int a = some_fn(); assert(a <
               | 100); a += 1` in the C code, you can expect something
               | like `ADD EAX,1` somewhere in the compiler output for
               | that function. Or going the other direction, when you're
               | in a GDB prompt and you disassemble the current EIP and
               | you see `ADD EAX,1` then you can pretty much just look at
               | the C code and figure out where you are.
               | 
               | No, there's no guarantee of that. C compilers are allowed
               | to do all kinds of interesting things. However you are
               | often right enough in practice, especially if you run
               | with -O0, ie turn off the optimiser.
               | 
               | See eg https://godbolt.org/z/YY69Ezxnv and tell me where
               | the ADD instruction shows up in the compiler output.
        
               | MaulingMonkey wrote:
               | error: could not convert 'true' from 'bool' to
               | 'std::string' {aka 'std::__cxx11::basic_string<char>'}
               | 
               | I don't think anyone's claiming C nor C++'s dumpster
               | fires have signed integer overflow at the top of the pile
               | of problems, but when the optimizer starts deleting
               | security or bounds checks and other fine things - because
               | of signed integer overflow, or one of the million other
               | causes of undefined behavior - I will _pray_ for
               | something as straightforward as a core dump, no matter
               | _where_ EIP has gone.
               | 
               | Signed integer overflow UB is the kind of UB that has a
               | nasty habit of causing _subtle_ heisenbugfuckery when
               | triggered. The kind you _might, hopefully,_ make shallow
               | with ubsan and good test suite coverage. In other words,
               | the kind you _won 't_ make shallow.
        
               | jmillikin wrote:
               | For context, I did not pick that type signature at
               | random. It was in actual code that was shipping to
               | customers. If I remember correctly there was some sort of
               | bool -> int -> char -> std::string path via `operator()`
               | conversions and constructors that allowed it to compile,
               | though I can't remember what the value was (probably
               | "\x01").
               | 
               | ---
               | 
               | My experience with the C/C++ optimizer is that it's
               | fairly timid, and only misbehaves when the input code is
               | _really_ bad. Pretty much all of the (many, many) bugs I
               | 've encountered and/or written in C would have also
               | existed if I'd written directly in assembly.
               | 
               | I know there are libraries out there with build
               | instructions like "compile with -O0 or the results will
               | be wrong", but aside from the Linux kernel I've never
               | encountered developers who put the blame on the
               | _compiler_.
        
               | MaulingMonkey wrote:
               | > but aside from the Linux kernel I've never encountered
               | developers who put the blame on the compiler.
               | 
               | I encounter them _frequently_.
               | 
               | 99.99% of the time it's undefined behavior and they're
               | "wrong".
               | 
               | Frequently novices who have been failed by their teachers
               | and documentation (see previous rant using atoi as an
               | example of the poor quality of documentation about UB:
               | https://news.ycombinator.com/item?id=14861917 .)
               | 
               | Less frequently, it's experienced devs half joking out of
               | a need for catharsis.
               | 
               | Rarely, experienced devs finally getting to the end of
               | their rope, and are finally beginning to seriously
               | consider if they've got a codegen bug. They don't, but
               | they're considering it. They know they were wrong the
               | last 10 times they considered it, but they're considering
               | it again damnit!
               | 
               | The linux kernel devs aren't _quite_ unique in  "just
               | because you can, doesn't mean you should"ing their way
               | into blaming the compiler for what could be argued to be
               | defects in the standard or fundamental design of the
               | language (the defect being making UB so common), but
               | that's probably among the rarest slice of the pie of
               | people blaming the compiler for UB. Few have the will to
               | tilt at that windmill and voice their opinions when the
               | compiler devs can easily just blame the standard - better
               | to keep such unproductive rants close to heart instead,
               | or switch to another language. Something _actually
               | productive_.
               | 
               | 0.01% of the time, it's a legitimate codegen bug on well-
               | defined behavior code. Last one I tracked down to a bug
               | tracker, was MSVC miscompiling 4x4 matrix multiplications
               | by failing to spill a 17th value to stack when it only
               | had 16 SSE register to work with. Caught by unit tests,
               | but not by CI, since people updated compiler versions at
               | their own random pace, and who runs `math_tests` on their
               | personal machines when they're not touching `math`?
        
               | gpderetta wrote:
               | Counterexample: https://godbolt.org/z/z3jqjPT6o
        
               | jmillikin wrote:
               | I'm not sure that's a counter-example -- what assembly do
               | you think should be emitted for floating-point math on an
               | AVR microcontroller?
        
               | gpderetta wrote:
               | It means that "a += 1` is easy to understand as
               | incrementing a numeric value by 1" is not true and
               | instead "it can be really difficult to map between the
               | original source and the machine code".
               | 
               | More examples of non-trivial mapping from C code to
               | generated code: https://godbolt.org/z/jab6vh6dM
        
               | jmillikin wrote:
               | All of those look pretty straightforward to me -- again,
               | what assembly would you expect to be emitted in those
               | cases?
               | 
               | For contrast, here's the assembly generated for Haskell
               | for integer addition: https://godbolt.org/z/vdeMKMETT
               | 
               | And here's assembly for C++:
               | https://godbolt.org/z/dedcof9x5
        
               | gpderetta wrote:
               | > All of those look pretty straightforward to me --
               | again, what assembly would you expect to be emitted in
               | those cases?
               | 
               | It is very straightforward indeed, but it is still not
               | mapping primitive operations to direct machine code, but
               | it is forwarding to out-of-line code. Same as operator
               | overloading in other languages.
               | 
               | > And here's assembly for C++:
               | https://godbolt.org/z/dedcof9x5
               | 
               | That's just a symptom of allowing the compiler to inline
               | the add code, otherwise the generated code is as
               | straightforward:                  addOne(Int):
               | push   rax         mov    esi,0x1         call   4010c0
               | <add_safe(int, int)>
               | 
               | Ref: https://godbolt.org/z/xo1es9TcW
        
               | jmillikin wrote:
               | > It is very straightforward indeed, but it is still not
               | mapping primitive       > operations to direct machine
               | code, but it is forwarding to out-of-line code.       >
               | Same as operator overloading in other languages.
               | 
               | I am not claiming that C is a collection of assembler
               | macros. There is no expectation that a C compiler emit
               | machine code that has exact 1:1 correspondence with the
               | input source code.                 > Same as operator
               | overloading in other languages.
               | 
               | The lack of operator overloading, and other hidden
               | complex control flow, is the reason that someone can read
               | C code and have a pretty good idea of what it compiles
               | to.                 > That's just a symptom of allowing
               | the compiler to inline the add code,       > otherwise
               | the generated code is as straightforward:
               | 
               | No, that's just moving the instructions around. You've
               | still got dynamic allocation and stack-unwinding being
               | generated for a line that doesn't have any sign of
               | entering a complex control flow graph.
        
               | pjmlp wrote:
               | > ... and other hidden complex control flow,....
               | 
               | Until someone calls longjmp() or a signal() is triggered.
               | Extra bonus of fun if it happens to be multithreaded
               | application, or in the middle of a non-rentrant call.
        
               | acdha wrote:
               | > That means if you see `int a = some_fn(); assert(a <
               | 100); a += 1` in the C code, you can expect something
               | like `ADD EAX,1` somewhere in the compiler output for
               | that function.
               | 
               | I completely agree that C++ is orders of magnitude worse
               | but I've seen at least a couple counter-examples with
               | code almost that simple. A researcher I used to support
               | compared each release against a set of reference results,
               | and got a surprise when they didn't match but his program
               | was working. This turned out to be a new compiler release
               | being smart enough to inline and reorder his code to use
               | a fused multiply-add instruction, which had greater
               | internal precision and so the result was very slightly
               | different from his saved referenced set. GCC has
               | -fexcess-precision=standard for this but you have to
               | understand the problem first.
        
               | TinkersW wrote:
               | a+=1 will not produce any surprising results, signed
               | integer overflow is well defined on all platforms that
               | matter.
               | 
               | And we all know about the looping behavior, it isn't
               | surprising.
               | 
               | The only surprising part would be if the compiler decides
               | to use inc vs add, not that it really matters to the
               | result.
        
               | eru wrote:
               | > a+=1 will not produce any surprising results, signed
               | integer overflow is well defined on all platforms that
               | matter.
               | 
               | I'm not sure what you are talking about?
               | 
               | There's a difference between how your processor behaves
               | when given some specific instructions, and what
               | shenanigans your C compiler gets up to.
               | 
               | See eg https://godbolt.org/z/YY69Ezxnv and tell me where
               | the ADD instruction shows up in the compiler output. Feel
               | free to pick a different compiler target than Risc-V.
        
               | jmillikin wrote:
               | I don't think "dead-code elimination removes dead code"
               | adds much to the discussion.
               | 
               | If you change the code so that the value of `a` is used,
               | then the output is as expected:
               | https://godbolt.org/z/78eYx37WG
        
               | xmcqdpt2 wrote:
               | The parent example can be made clearer like this:
               | https://godbolt.org/z/MKWbz9W16
               | 
               | Dead code elimination only works here because integer
               | overflow is UB.
        
               | jmillikin wrote:
               | Take a closer look at 'eru's example and my follow-up.
               | 
               | He wrote an example where the result of `a+1` isn't
               | necessary, so the compiler doesn't emit an ADDI even
               | though the literal text of the C source contains the
               | substring "a += 1".
               | 
               | Your version has the same issue:                 unsigned
               | int square2(unsigned int num) {           unsigned int a
               | = num;           a += 1;           if (num < a) return
               | num * num;           return num;       }
               | 
               | The return value doesn't depend on `a+1`, so the compiler
               | can optimize it to just a comparison.
               | 
               | If you change it to this:                 unsigned int
               | square2(unsigned int num) {           unsigned int a =
               | num;           a += 1;           if (num < a) return num
               | * a;           return num;       }
               | 
               | then the result of `a+1` is required to compute the
               | result in the first branch, and therefore the ADDI
               | instruction is emitted.
               | 
               | The (implied) disagreement is whether a language can be
               | considered to be "portable assembly" if its compiler
               | elides unnecessary operations from the output. I think
               | that sort of optimization is allowed, but 'eru
               | (presumably) thinks that it's diverging too far from the
               | C source code.
        
               | gpderetta wrote:
               | In what world the return value doesn't depends on 'a' in
               | this code?                 if (num < a) return num * num;
               | /*else*/    return num;
               | 
               | A control dependency is still a dependency
        
               | jmillikin wrote:
               | `a = num; a += 1; if (num < a)` is the same as `if (num <
               | (num + 1))`, which for unsigned integer addition can be
               | rewritten as `if (num != UINT_MAX)`. So there's no need
               | to actually compute `a+1`, the comparison is against a
               | constant.
               | 
               | If the code returns `num * a` then the value of `a` is
               | now necessary, and must be computed before the function
               | returns.
               | 
               | For signed integer addition the compiler is allowed to
               | assume that `(num < (num + 1))` is true, so the
               | comparison can be removed entirely.
        
               | tux3 wrote:
               | You show one example where C doesn't have problems, but
               | that's a much weaker claim than it sounds. "Here's one
               | situation where this here gun won't blow your foot off!"
               | 
               | For what it's worth, C++ also passes your test here. You
               | picked an example so simple that it's not very
               | interesting.
        
               | eru wrote:
               | Actually even here, C has some problems (and C++), too:
               | 
               | I don't think the standard says much about how to handle
               | stack overflows?
        
               | jmillikin wrote:
               | 'eru implied `a += 1` has undefined behavior; I provided
               | a trivial counter-example. If you'd like longer examples
               | of C code that performs unsigned integer addition then
               | the internet has many on offer.
               | 
               | I'm not claiming that C (or C++) is without problems. I
               | wrote code in them for ~20 years and that was more than
               | enough; there's a reason I use Rust for all my new low-
               | level projects. In this case, writing C without undefined
               | behavior requires lots of third-party static analysis
               | tooling that is unnecessary for Rust (due to being built
               | in to the compiler).
               | 
               | But if you're going to be writing C as "portable
               | assembly", then the competition isn't Rust (or Zig, or
               | Fortran), it's _actual_ assembly. And it 's silly to
               | object to C having undefined behavior for signed integer
               | addition, when the alternative is to write your VM loop
               | (or whatever) five or six times in platform-specific
               | assembly.
        
               | eru wrote:
               | Forth might be a better competition for 'portable
               | assembly', though.
        
               | WJW wrote:
               | C might be low level from the perspective of other
               | systems languages, but that is like calling Apollo 11
               | simple from the perspective of modern spacecraft. C as
               | written is not all that close to what actually gets
               | executed.
               | 
               | For a small example, there are many compilers who would
               | absolutely skip incrementing 'a' in the following code:
               | uint32_t add_and_subtract_1(uint32_t a) {         a += 1;
               | a -= 1;         return a;       }
               | 
               | Even though that code contains `a += 1;` clear as day,
               | the chances of any incrementing being done are quite
               | small IMO. It gets even worse in bigger functions where
               | out-of-order execution starts being a thing.
        
               | eru wrote:
               | > It gets even worse in bigger functions where out-of-
               | order execution starts being a thing.
               | 
               | In addition, add that your processor isn't actually
               | executing x86 (nor ARM etc) instructions, but
               | interprets/compiles them to something more fundamental.
               | 
               | So there's an additional layer of out-of-order
               | instructions and general shenanigans happening.
               | Especially with branch prediction in the mix.
        
               | johnisgood wrote:
               | Why would you want it to increment 1 if we decrement 1
               | from the same variable? That would be a waste of cycles
               | and a good compiler knows how to optimize it out, or what
               | am I misunderstanding here? What do you expect "it" to do
               | and what does it really do?
               | 
               | See: https://news.ycombinator.com/item?id=43320495
        
               | lifthrasiir wrote:
               | It is unlikely as is, but it frequently arises from macro
               | expansions and inlining.
        
               | acdha wrote:
               | That's a contrived example but in a serious program there
               | would often be code in between or some level of
               | indirection (e.g. one of those values is a lookup, a
               | macro express, or the result of another function).
               | 
               | Nothing about that is cheating, it just says that even C
               | programmers cannot expect to look at the compiled code
               | and see a direct mapping from their source code. Your
               | ability to reason about what's actually executing
               | requires you to internalize how the compiler works in
               | addition to your understanding of the underlying hardware
               | and your application.
        
               | johnisgood wrote:
               | In what languages can you do that that is not assembly
               | though? The higher level the language is, the "worse" or
               | difficult it gets, perhaps I am not following the thread
               | right.
        
               | acdha wrote:
               | Yes, this is normal for languages. The only pushback here
               | is against the term "portable assembler" being applied to
               | C, where it's incomplete often enough that many people
               | feel it's no longer a helpful label.
               | 
               | I think it's also reflecting the maturity and growth of
               | the industry. A turn of the century programmer could
               | relatively easily find areas where dropping down to
               | assembly was useful, but over the subsequent decades
               | that's become not only uncommon but often actively
               | harmful: your code hand-optimized for a particular
               | processor is likely slower on newer processors than what
               | a modern compiler emits and is definitely a barrier to
               | portability in an era where not only are ARM and
               | potentially RISC-V of interest but also where code is
               | being run on SIMD units or GPUs. This makes the low-level
               | "portable assembler" idea less useful because there's
               | less code written in that middle ground when you want
               | either a higher-level representation which gives
               | compilers more flexibility or precise control. For
               | example, cryptography implementers want not just high
               | performance but also rigid control of the emitted code to
               | avoid a compiler optimizing their careful constant-time
               | implementation into a vulnerability.
        
               | vikramkr wrote:
               | I'm pretty sure that's replying directly to the comment
               | about how c is close to assembly and that if you add that
               | line of code somewhere you know there's a variable
               | getting incremented. Doesn't really matter whether or not
               | it's useful, the point is that the behavior isn't exactly
               | what you wrote
        
               | jmillikin wrote:
               | To reiterate, claiming that C can be described as
               | "portable assembly" is not a claim that it is literally a
               | package of assembler macros that emit deterministic
               | machine code for each individual source expression.
               | 
               | I linked these in another comment, but here's some
               | examples of straightforward-looking integer addition
               | emitting more complex compiler output for other languages
               | that compile to native code:
               | 
               | Haskell: https://godbolt.org/z/vdeMKMETT
               | 
               | C++: https://godbolt.org/z/dedcof9x5
        
               | johnisgood wrote:
               | https://godbolt.org/z/r39jK1ddv
               | 
               | It increments, then decrements with -O0 though.
               | 
               | I do not see the issue still, as the behavior is expected
               | with -O0; increments then decrements.
        
               | chasd00 wrote:
               | I'm not an embedded expert but a friend of mine has
               | complained about compiler optimizations breaking things
               | in his programs. I could see incrementing by one being
               | used to set some bits in a memory location for a cycle
               | that may mean something to some peripheral and then
               | decrementing by one to set some other bits that may mean
               | something else. In that case, the compiler removing those
               | two lines would cause a very hard to debug issue.
        
               | johnisgood wrote:
               | I understand compiler optimizations having unintended
               | consequences, e.g. https://godbolt.org/z/r39jK1ddv but
               | there are a lot of options he may use to enable or
               | disable optimizations (assuming GCC here):
               | https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
               | 
               | This is even more difficult to do with higher-level
               | languages.
        
               | mkoubaa wrote:
               | If you want the compiler to treat your code as literal
               | portable assembly turn off optimizations.
        
               | johnisgood wrote:
               | Exactly, pretty much what I said, or enable / disable the
               | optimizations you want.
        
               | gpderetta wrote:
               | If my misocompile, you mean that it fails the test that a
               | "C expression `a += 1` can be trusted to increment a
               | numeric value", then it is trivial:
               | https://godbolt.org/z/G5dP9dM5q
        
               | Retr0id wrote:
               | Here's another one, just for fun
               | https://godbolt.org/z/TM1Ke4d5E
        
               | jmillikin wrote:
               | I was being somewhat terse.
               | 
               | The (implied) claim is that the C standard has enough
               | sources of undefined behavior that even a simple integer
               | addition can't be relied upon to actually perform integer
               | addition.
               | 
               | But the sources of undefined behavior for integer
               | addition in C are well-known and very clear, and any
               | instruction set that isn't an insane science project is
               | going to have an instruction to add integers.
               | 
               | Thus my comment. Show me a C compiler that takes that
               | code and miscompiles it. I don't care if it returns a
               | constant, spits out an infinite loop, jumps to 0x0000,
               | calls malloc, whatever. Show me a C compiler that takes
               | those four lines of C code and emits something other than
               | an integer addition instruction.
        
               | Retr0id wrote:
               | Why are you talking about miscompilation? While the LLVM
               | regression in the featured article makes the code slower,
               | it is not a miscompilation. It is "correct" according to
               | the contract of the C language.
        
               | IshKebab wrote:
               | That will be inline by any C compiler and then pretty
               | much anything can happen to the 'a += 1'.
        
           | kryptiskt wrote:
           | When the code has hit the IR in clang or gcc, there is no 'a'
           | (we know that with certainty, since SSA form doesn't mutate
           | but assigns to fresh variables). We don't know if there will
           | be an increment of 1, the additions could be coalesced (or
           | elided if the result can be inferred another way). The number
           | can even decrease, say if things have been handled in chunks
           | of 16, and needs to be adjusted down in the last chunk. Or
           | the code may be auto-vectorized and completely rewritten, so
           | that none of the variables at the C level are reflected on
           | the assembler level.
        
             | jmillikin wrote:
             | From a high-level academic view, yes, the compiler is
             | allowed to perform any legal transformation. But in
             | practice C compilers are pretty conservative about what
             | they emit, especially when code is compiled without -march=
             | .
             | 
             | You don't have to take my word for it. Go find a moderately
             | complex open-source library written in C, compile it, then
             | open up the result in Hexrays/Ghidra/radare2/whatever.
             | Compare the compiled functions with their original source
             | and you'll see there's not that much magic going on.
        
               | hun3 wrote:
               | -O3 does autovectorization: turning your loops into a
               | bunch of SIMD instructions, sometimes even drastically
               | changing performance profile.
               | 
               | If autovectorization is "not that much magic" then idk
               | what else it is.
        
               | lifthrasiir wrote:
               | Nowadays it's -O2. I was also surprised when I first
               | learned this.
        
               | gpderetta wrote:
               | Any optimization you are familiar with is trivial and
               | expected. Everything else is broken compilers optimizing
               | UB to win benchmarks.
        
               | cogman10 wrote:
               | They are as aggressive as they can be.
               | 
               | Here's an example of a C compiler completely eliminating
               | a loop because it has figure out how to transform the
               | loop into a constant calculation.
               | 
               | https://godbolt.org/z/cfndqMj4j
               | 
               | The place where C compilers are conservative is when
               | dealing with arrays and pointers. That's because it's
               | impossible for C to know if a pointer is to an element of
               | an array or something completely different. Pointer math
               | further complicates what a pointer could actually
               | reference.
        
           | pjc50 wrote:
           | > The phrase "C is portable assembly" isn't a claim that each
           | statement gets compiled directly to equivalent machine code.
           | 
           | Weasel words. Like a "self driving car" that requires a human
           | driver with constant attention willing to take over within a
           | few hundred milliseconds.
           | 
           | People advocate for C and use it in a way that implies they
           | think it can achieve specific machine outcomes, and it
           | _usually_ does .. except when it doesn 't. If people want a
           | portable assembler they should build one.
        
             | jmillikin wrote:
             | As a general rule if you're reading a technical discussion
             | and every single participant is using a particular phrase
             | in a way that doesn't make sense to you then you should
             | probably do a quick double-check to make sure you're on the
             | same page.
             | 
             | For example, in this discussion about whether C is
             | "portable assembly", you might be tempted to think back to
             | the days of structured programming in assembly using
             | macros. I no longer remember the exact syntax, but programs
             | could be written to look like this:
             | .include "some-macro-system.s"       .include "posix-sym.s"
             | .func _start(argc, argv) {         .asciz message "Hello,
             | world!"         .call3 _write STDOUT message (.len message)
             | .call1 _exit 0       }
             | 
             | Assembly? Definitely! Portable? Eh, sort of! If you're
             | willing to restrict yourself to DOS + POSIX and write an
             | I/O abstraction layer then it'll probably run on
             | i386/SPARC/Alpha/PA-RISC.
             | 
             | But that's not really what people are discussing, is it?
             | 
             | When someone says "C is portable assembly" they don't mean
             | you can take C code and run it through a platform-specific
             | macro expander. They don't mean it's _literally_ a portable
             | dialect of assembly. They expect the C compiler to perform
             | some transformations -- maybe propagate some constants,
             | maybe inline a small function here and there. Maybe you 'd
             | like to have named mutable local variables, which requires
             | a register allocator. Reasonable people can disagree about
             | exactly what transformations are legal, but at that point
             | it's a matter of negotiation.
             | 
             | Anyway, now you've got a language that is more portable
             | than assembler macros but still compiles more-or-less
             | directly to machine code -- not completely divorced from
             | the underlying hardware like Lisp (RIP Symbolics). How
             | would you describe it in a few words? "Like assembly but
             | portable" doesn't seem unreasonable.
        
               | pjc50 wrote:
               | > still compiles more-or-less directly to machine code
               | 
               | There's a lot hiding in "more or less". The same kind of
               | example holds for e.g. C# :
               | https://godbolt.org/noscript/csharp ; if you hit
               | "Compile" it'll give you the native binary. If you write
               | "x+1" it'll generate an add .. or be optimized away. Now
               | does that mean it's portable assembler? Absolutely not.
               | 
               | Conversely there's a bunch of things that people expect
               | to do in C, do in real code, but are _not in the standard
               | or are undefined or implementation-defined_. As well as
               | things that are present in assemblers for various
               | platforms (things like the overflow flag) which aren 't
               | accessible from the C language.
               | 
               | What people _actually_ seem to mean by  "portable
               | assembler" is "no guardrails". Memory unsafety as a
               | feature.
               | 
               | > Reasonable people can disagree about exactly what
               | transformations are legal, but at that point it's a
               | matter of negotiation
               | 
               | And a matter of CVEs when you lose your negotiation with
               | the compiler. Or less dramatic things like the
               | performance fluctuations under discussion.
        
           | pjmlp wrote:
           | Systems languages that predated C already could do that, that
           | is the typical myth.
        
           | titzer wrote:
           | Saying that something "is like XY" when you really mean "is
           | like XY, at least in comparison to C++" isn't what most
           | people mean.
           | 
           | C is not a portable assembler.
           | 
           | In C, "a += 1" could overflow, and signed overflow is
           | undefined behavior--even though _every individual ISA_ has
           | completely defined semantics for overflow, and nearly all of
           | them these days do two 's complement wraparound arithmetic.
           | With C's notion of undefined behavior, it doesn't even give
           | you the same wraparound in different places in the same
           | program. In fact, wraparound is so undefined that the program
           | could do absolutely anything, and the compiler is not
           | required to even tell you about it. Even without all the C++
           | abstraction madness, a C compiler can give you absolutely
           | wild results due to optimizations, e.g. by evaluating "a +=
           | 1" at compile time and using a different overflow behavior
           | than the target machine. Compile-time evaluation not matching
           | runtime evaluation is one of a _huge_ number of dumb things
           | that C gives you.
           | 
           | Another is that "a += 1" may not even increment the variable.
           | If this occurs as an expression, and not as a statement, e.g.
           | "f(a += 1, a += 1)", you might only get one increment due to
           | sequence points[1]--not to mention that the order of
           | evaluation might be different depending on the target.
           | 
           |  _C is not a portable assembler._
           | 
           | C is a low-level language where vague machine-like programs
           | get compiled to machine code that may or may not work,
           | depending on whether it violates UB rules or not, and there
           | are precious few diagnostics to tell if that happened, either
           | statically or dynamically.
           | 
           | [1] https://en.wikipedia.org/wiki/Sequence_point
        
       | MattPalmer1086 wrote:
       | Benchmarking is just insanely hard to do well. There are so many
       | things which can mislead you.
       | 
       | I recently discovered a way to make an algorithm about 15%
       | faster. At least, that's what all the benchmarks said. At some
       | point I duplicated the faster function in my test harness, but
       | did not call the faster version, just the original slower one...
       | And it was still 15% faster. So code that never executed sped up
       | the original code...!!! Obviously, this was due to code and
       | memory layout issues, moving something so it aligned with some
       | CPU cache better.
       | 
       | It's actually really really hard to know if speedups you get are
       | because your code is actually "better" or just because you lucked
       | out with some better alignment somewhere.
       | 
       | Casey Muratori has a really interesting series about things like
       | this in his substack.
        
         | eru wrote:
         | I vaguely remember about some benchmarking project that
         | deliberately randomised these compiler decisions, so that they
         | could give you more stable estimates of how well your code
         | actually performed, and not just how well you won or lost the
         | linker lottery.
        
           | Mond_ wrote:
           | You're probably thinking of "Performance Matters" by Emery
           | Berger, a Strange Loops talk.
           | https://youtube.com/watch?v=r-TLSBdHe1A
        
           | MattPalmer1086 wrote:
           | There was Stabilizer [1] which did this, although it is no
           | longer maintained and doesn't work with modern versions of
           | LLVM. I think there is something more current now that
           | automates this, but can't remember what it's called.
           | 
           | [1] https://emeryberger.com/research/stabilizer/
        
           | alpaca128 wrote:
           | As already mentioned this is likely Emery Berger's project
           | with the idea of intentionally slowing down different parts
           | of the program, also to find out which parts are most
           | sensitive to slowdowns (aka have the biggest effect on
           | overall performance), with the assumption that these are also
           | the parts that profit the most from optimisations.
        
           | FridgeSeal wrote:
           | The Coz profiler from Emery Berger.
           | 
           | It can actually go a step further and give you decent
           | estimate of what functions you need to change to have the
           | desired latency/throughput increases.
        
             | MattPalmer1086 wrote:
             | Thanks, I was trying to remember that one!
        
           | McP wrote:
           | LLD has a new option "--randomize-section-padding" for this
           | purpose: https://github.com/llvm/llvm-project/pull/117653
        
             | MattPalmer1086 wrote:
             | Interesting, thanks!
        
           | igouy wrote:
           | "Producing wrong data without doing anything obviously
           | wrong!"
           | 
           | https://doi.org/10.1145/1508244.1508275
        
             | igouy wrote:
             | "Producing wrong data without doing anything obviously
             | wrong!"
             | 
             | [pdf]
             | 
             | https://users.cs.northwestern.edu/~robby/courses/322-2013-s
             | p...
        
         | porridgeraisin wrote:
         | That linker lottery led to a 15% improvement? I'm surprised. Do
         | you know in what cases you get such a big improvement from
         | something like that? Is it rare? How did you end up reasoning
         | about it?
        
           | MattPalmer1086 wrote:
           | Various research has shown that the variation can be much
           | higher than 15% due to things like this. It's not that rare;
           | I keep bumping into it. Compilers and linkers do a reasonable
           | job but fundamentally modern CPUs are extremely complex
           | beasts.
           | 
           | I found Casey Muratori's series the best explanation of what
           | is going on at the CPU level.
        
           | MattPalmer1086 wrote:
           | Some additional context. I was actually writing a
           | benchmarking tool for certain kinds of search algorithm. I
           | spent a long time reducing and controlling for external
           | sources of noise. CPU pinning, doing hundreds of those runs
           | with different data, and then repeating them several times
           | and taking the best of each score with the same data (to
           | control for transient performance issues due to the machine
           | doing other things).
           | 
           | I got the benchmarking tool itself to give reasonably
           | repeatable measurements.
           | 
           | The tool had high precision, but the accuracy of "which
           | algorithm is better" was not reliable just due to these code
           | layout issues.
           | 
           | I basically gave up and shelved the benchmarking project at
           | that point, because it wasn't actually useful to determine
           | which algorithm was better.
        
       | albertzeyer wrote:
       | To clarify: The situation is still not completely understood?
       | It's not just only the computed gotos, but there is some other
       | regression in Clang19? Basically, the difference between
       | clang19.nocg and clang19 is not really clear?
       | 
       | Btw, what about some clang18.tc comparison, i.e. Clang18 with the
       | new tail-call interpreter? I wonder how that compares to
       | clang19.tc.
        
         | aeyes wrote:
         | The article includes a link to the pull request which fixes the
         | regression in LLVM: https://github.com/llvm/llvm-
         | project/pull/114990
        
         | nelhage wrote:
         | > Btw, what about some clang18.tc comparison
         | 
         | (post author here) Oh, this is something I could have called
         | out explicitly: The tail-calling interpreter relies on a
         | feature (the `preserve_none` calling convention) that only
         | landed in clang-19. That means you can only test it on that
         | version. That coincidence (that 19 added both this feature,
         | _and_ the regression) is part of why this was so easy to miss
         | at first, and why I had to  "triangulate" with so many
         | different benchmarks to be confident I understood what was
         | going on.
        
       | thrdbndndn wrote:
       | Great article! One detail caught my attention.
       | 
       | In one of the referenced articles,
       | https://simonwillison.net/2025/Feb/13/python-3140a5/, the author
       | wrote: "So 3.14.0a5 scored 1.12 times faster than 3.13 on the
       | benchmark (on my extremely overloaded M2 MacBook Pro)."
       | 
       | I'm quite confused by this. Did the author run the benchmark
       | while the computer was overloaded with other processes? Wouldn't
       | that make the results completely unreliable? I would have thought
       | these benchmarks are conducted in highly controlled environments
       | to eliminate external variables.
        
         | ambivalence wrote:
         | Simon Willison is a great guy, but he's not a Python core
         | developer and his ad hoc benchmark is not what the CPython core
         | team members are using. For the latter, see
         | https://github.com/faster-cpython/benchmarking-public
        
       | motbus3 wrote:
       | I recently made some benchmarking from python 3.9 to 3.13 And up
       | to 3.11 it only got better. Python 3.12 and 3.13 were about 10%
       | slower than 3.11.
       | 
       | I thought my homemade benchmark wasn't great enough so I deployed
       | it to a core service anyway and I saw same changes in our
       | collected metrics. Does anyone else have the same problem?
        
         | sgarland wrote:
         | Yes, I found a loop performance regression [0] in 3.12 and
         | 3.13.
         | 
         | [0]: https://github.com/python/cpython/issues/123540
        
       | cb321 wrote:
       | While some people here are citing 10% as "large" and 1% as
       | "normal", there are optimizations like partial inlining of doubly
       | recursive Fibonacci that can reduce the actual work (and so also
       | time) _exponentially_ (factors  >10X for double-digit arguments
       | or 1000s of %, technically exponential in the difference of the
       | recursion depth, not the problem size [1]).
       | 
       | C compilers can also be very finicky about their code inlining
       | metrics. So, whether that enormous speed-up is realized can be
       | very sensitive to the shape of your code.
       | 
       | So, while part of the problem is definitely that CPUs have gotten
       | quite fancy/complex, another aspect is that compilers "beyond -O0
       | or -O1" have _also_ gotten fancy /complex.
       | 
       | The article here, while good & worth reading, is also but one of
       | many examples of how two complex things interacting can lead to
       | very surprising results (and this is true outside of computing).
       | People have a rather strong tendency to oversimplify -- no matter
       | how many times this lesson shows up.
       | 
       | EDIT: Also, the article at least uses _two_ CPUs, Intel  & Apple
       | M1 and _two_ compilers (gcc  & clang), but there are realistic
       | deployment scenarios across _many more_ generations /impls of
       | Intel, AMD, and ARM and probably other compilers, too. So, it
       | only even samples a small part of the total complexity. Also, if
       | you want to be more scientific, esp. for differences like "1.01X"
       | then time measurements should really have error bars of _some
       | kind_ (either std.dev of the mean or maybe better for a case like
       | this std.dev of the min[2]) and to minimize those measurement
       | errors you probably want CPU core affinity scheduling in the OS.
       | 
       | [1] https://stackoverflow.com/questions/360748/computational-
       | com...
       | 
       | [2] https://github.com/c-blake/bu/blob/main/doc/tim.md
        
       | kenjin4096 wrote:
       | Hello. I'm the author of the PR that landed the tail-calling
       | interpreter in CPython.
       | 
       | First, I want to say thank you to Nelson for spending almost a
       | month to get to the root of this issue.
       | 
       | Secondly, I want to say I'm extremely embarrassed and sorry that
       | I made such a huge oversight. I, and probably the rest of the
       | CPython team did not expect the compiler we were using for the
       | baseline to have that bug.
       | 
       | I posted an apology blog post here. https://fidget-
       | spinner.github.io/posts/apology-tail-call.htm...
        
         | jxjnskkzxxhx wrote:
         | I just wanted to say: respect for being able to say "sorry, I
         | made a mistake". I hate the fake it till you make it mentality
         | that seems to be the norm now.
        
           | kzrdude wrote:
           | I understand the frustration but I don't think it needed to
           | be said (the part about mentality, the thanks is of course
           | cool), because that's still not the norm.
           | 
           | Why do I even bring this message - I want to say that let's
           | not let what we see in the news influence our perception of
           | the real people of the world. Just because fraud and crimes
           | get elevated in the news, does not mean that the common man
           | is a criminal or a fraud. :)
        
             | codr7 wrote:
             | Divide and conquer; we're supposed to hate each other and
             | trust the state/elite/technology, at least that's the plan.
             | 
             | The real criminals, the people we should keep an eye on,
             | are the plan's designers and implementers.
        
         | ptx wrote:
         | Why didn't this regression in baseline performance show up (or
         | did it?) on the faster-cpython benchmarks page [0]? Could the
         | benchmarks be improved to prevent similar issues in the future?
         | 
         | [0] https://github.com/faster-cpython/benchmarking-public
        
           | kenjin4096 wrote:
           | We don't normally test with bleeding-edge compilers on the
           | faster cpython benchmarks page because that would invalidate
           | historical data. E.g. if 2 years ago we used GCC 11 or
           | something to compile and run a benchmark, we need to run it
           | with GCC 11 again today to get comparable results.
           | 
           | Clang 19 was released last year. We only benched it a few
           | months ago. We did notice there was a significant slowdown on
           | macOS, but that was against Xcode Clang, which is a different
           | compiler. I thought it might've been an Xcode thing, which in
           | the past has bit CPython before (such as Xcode LTO
           | working/not working versus normal Clang) so I didn't
           | investigate deeper (facepalming now in retrospect) and
           | chalked it up to a compiler difference.
           | 
           | TLDR: We didn't run benchmarks of clang 19 versus 18. We only
           | ran benchmarks of clang 19 versus gcc, Xcode clang, and MSVC.
           | All of which are not apples-to-apples to Clang 19, so I
           | naiively thought it was a compiler difference.
           | 
           | EDIT: As to how we could improve this process, I'm not too
           | sure, but I know I'll be more discerning when there's a >4%
           | perf hit now when upgrading compilers.
        
           | cb321 wrote:
           | That is a better than average benchmark page.
           | 
           | As alluded to in
           | https://news.ycombinator.com/item?id=43319010, I see these
           | tests were collected against just 2 Intel and 2 ARM CPUs. So,
           | if you are looking for feedback to improve, you should
           | probably also include (at least) a AMD Zen4 or Zen5 in there.
           | CPU & compiler people have been both trying to "help perf
           | while not trusting the other camp" for as long as I can
           | remember and I doubt that problem will ever go away.
           | 
           | A couple more CPUs will help but not solve generalizability
           | of results. E.g., if somebody tests against some ancient 2008
           | Nehalem hardware, they might get _very_ different answers.
           | Similarly, answers today might not reflect 2035 very well.
           | 
           | The reality of our world of complex hardware deployment
           | (getting worse with GPUs) is that "portable performance" is
           | _almost_ a contradiction in terms. We all just do the best we
           | can at some semblance of the combination. The result is some
           | kind of weighted average of  "not #ifdef'd too heavily
           | source" and "a bit" faster "informally averaged over our user
           | population & their informally averaged workloads" and this
           | applies at _many levels_ of the whole computing stack.
           | 
           | EDIT: And, of course, a compiled impl like Cython or Nim is
           | another way to go if you care about performance, but I do
           | understand the pull & network effects of the Python
           | ecosystem. So, that may not always be practical.
        
         | pseufaux wrote:
         | The way this was handled is incredibly good form! Thank you for
         | going to such lengths to make sure the mistake was corrected. I
         | respect and thank you (and all the developers working in
         | Python) for all the work you do!
        
         | delijati wrote:
         | One question arises: does the added code [1] bring any
         | improvement, or does it merely complicate the source? Should it
         | not be removed?
         | 
         | [1] https://github.com/python/cpython/pull/128718
        
           | kenjin4096 wrote:
           | That's a fair question.
           | 
           | The blog post mentions it brings a 1-5% perf improvement.
           | Which is still significant for CPython. It does not
           | complicate the source because we use a DSL to generate
           | CPython's interpreters. So the only complexity is in
           | autogenerated code, which is usually meant for machine
           | consumption anyways.
           | 
           | The other benefit (for us maintainers I guess), is that it
           | compiles way faster and is more debuggable (perf and other
           | tools work better) when each bytecode is a smaller function.
           | So I'm inclined to keep it for perf and productivity reasons.
        
             | coldtea wrote:
             | There was a plan for a 5x speedup overall looking for
             | funding back in 2022. Then a team with Guido and others
             | involved (and MS backing?) got on the same bandwagon and
             | made some announcements for speeding up CPython a lot.
             | 
             | Several releases in, have we seen even a 2x speedup? Or
             | more like 0.2x at best?
             | 
             | Not trying to dismiss the interpreter changes - more want
             | to know if those speedup plans were even remotely
             | realistic, and if anything close enough to even 1/5 of what
             | was promised will really come out of them...
        
               | Twirrim wrote:
               | There has been a lot of speed ups coming out of that
               | project. It's not being done in one go, though, but
               | spread out over the last several releases. So while
               | individual speedups may not look that significant,
               | remember that it is compound on the previous release's
               | speed up.
        
               | chippiewill wrote:
               | The faster cpython project was for 5x over the 3.10
               | baseline. CPython 3.13 is currently running at something
               | like 1.6x speed compared to 3.10. With the JIT enabled it
               | goes up by a few more single digit percentage point. With
               | the changes in 3.14 it'll be something like 1.8x speed-
               | up.
               | 
               | So it's slowly getting there, I think the faster cpython
               | project was mostly around the idea that the JIT can get a
               | lot faster as it starts to optimise more and more and
               | that only just got shipped in 3.13, so there's a lot of
               | headroom. We know that PyPy (an existing JIT
               | implementation) is close to 5x faster than CPython a lot
               | of the time already.
               | 
               | There's also now the experimental free-threading build
               | which speeds up multithreaded Python applications (Not by
               | a lot right now though unfortunately).
        
             | ot wrote:
             | Being more robust to fragile compiler optimizations is also
             | a nontrivial benefit. An interpreter loop is an extremely
             | specialized piece of code whose control flow is too
             | important to be left to compiler heuristics.
             | 
             | If the desired call structure can be achieved in a portable
             | way, that's a win IMO.
        
         | jraph wrote:
         | Reading that you are extremely embarrassed and sorry that you
         | made such a huge oversight, I was imagining you had broken
         | something / worsened CPython's performance.
         | 
         | But it's nothing like this. You announced a 10-15% perf
         | improvement but that improvement is more like 1-5% on a non
         | buggy compiler. It's not even like that 10-15% figure is
         | _wrong_ , it's just that it's correct only under very specific
         | conditions, unknowingly to you.
         | 
         | IIUC, you did your homework: you made an improvement, you
         | measured a 10-15% perf improvement, the PR was reviewed by
         | other people, etc. It just so happens that this 10-15% figure
         | is misleading because of an issue with the version of clang you
         | happened to use to measure. Unless I'm missing something, it
         | looks like a fair mistake anyone could have reasonably made. It
         | even looks like it was hard to not fall into this trap. You
         | _could_ have been more suspicious seeing such a high number,
         | but hindsight is 20 /20.
         | 
         | Apparently, you still brought significant performance
         | improvements, your work also helped uncover a compiler
         | regression. The wrong number seems quite minor in comparison. I
         | wonder who was actually hurt by this. I only discover the
         | "case" right now but at a first glance it doesn't feel like you
         | owe an apology to anyone. Kudos for all this!
        
           | ehsankia wrote:
           | In some way, by indirectly helping fix this bug, they led to
           | a ~10% performance increase for everyone who was using that
           | faulty compiler! That's even better than an optional flag
           | that many people won't know about or use.
        
             | chowells wrote:
             | That performance regression only hit code that was using a
             | very large number of paths with the same table of computed
             | gotos at the end. That's likely to only be relatively
             | complex interpreters that were affected. So it's not a
             | broad performance improvement. But it is nice to have an
             | example of the compiler's new heuristic failing to prove
             | evidence it needs to be tunable.
        
               | ncruces wrote:
               | Well, that includes at least everyone using _Python_
               | built with that compiler.
        
           | rtollert wrote:
           | > IIUC, you did your homework: you made an improvement, you
           | measured a 10-15% perf improvement, the PR was reviewed by
           | other people, etc. It just so happens that this 10-15% figure
           | is misleading because of an issue with the version of clang
           | you happened to use to measure. Unless I'm missing something,
           | it looks like a fair mistake anyone could have reasonably
           | made. It even looks like it was hard to not fall into this
           | trap. You could have been more suspicious seeing such a high
           | number, but hindsight is 20/20.
           | 
           | Hah! Is this a Gettier problem [0]?
           | 
           | 1. True: The PR improves Python performance 15-20%. 2. True:
           | Ken believes that the PR improves Python performance 15-20%.
           | 3. True: Ken is justified in believing that the PR improves
           | Python performance 15-20%.
           | 
           | Of course, PR discussions don't generally revolve around
           | whether or not the PR author "knows" that the PR does what
           | they claim it does. Still: these sorts of epistemological
           | brain teasers seem to come up in the performance measurement
           | field distressingly often. I wholeheartedly agree that Ken
           | deserves all the kudos he has received; still, I also wonder
           | if some of the strategies used to resolve the Gettier problem
           | might be useful for code reviewers to center themselves every
           | once in a while. Murphy's Law and all that.
           | 
           | [0]: https://en.wikipedia.org/wiki/Gettier_problem
        
         | acdha wrote:
         | I feel bad for you since a change on the order of 5% would
         | still have been seen as a very nice bit of work. I appreciate
         | the class with with you're dealing with an honest mistake, and
         | all of the hard work you've given the Python community.
        
         | DannyBee wrote:
         | FWIW - the fix was merged since you wrote that blog post ;)
         | 
         | Beyond that - 3-5% is a lot for something as old as the python
         | interpreter if it holds up. I would still be highly proud of
         | that.
         | 
         | After 30 years, i've learned (like i expect you have) to be
         | suspicious of any significant (IE >1%) performance improvement
         | in a system that has existed a long time.
         | 
         | They happen for sure, but are less common. Often, people are
         | shifting time around, and so it just isn't part of your
         | benchmark anymore[1]. Secondarily, benchmarking is often done
         | in controlled environments, to try to isolate the effect. Which
         | seems like the right thing to do. But then the software is run
         | in non-isolated environments (IE with a million other things on
         | a VM or desktop computer), which isn't what you benchmarked it
         | in. I've watched plenty of verifiably huge isolated gains
         | disappear or go negative when put in production environments.
         | 
         | You have the particularly hard job that you have to target lots
         | of environments - you can't even do something like say "okay
         | look if it doesn't actually speed it up in production, it
         | didn't really speed it up", because you have no single
         | production target. That's a really hard world to live in and
         | try to improve.
         | 
         | In the end, performance tuning and measurement is really hard.
         | You have nothing to be sorry for, except learning that :)
         | 
         | Don't let this cause you to be afraid to get it wrong - you
         | will get it wrong anyway. We all do. Just do what you are doing
         | here - say 'whoops, i think we screwed this up', try to figure
         | out how to deal with it, and try to figure out how to avoid it
         | in the future (if you can).
         | 
         | [1] This is common not just in performance, but in almost
         | anything, including human processes. For example, to make
         | something up, the team working on the code review tool would
         | say "we've reduced code review time by 15% and thus sped up
         | everyone's workflow!". Awesome! But actually, it turns out they
         | made more work for some other part of the system, so the
         | workflow didn't get any faster, they just moved the 15% into a
         | part of the world they weren't measuring :)
        
           | theLiminator wrote:
           | > After 30 years, i've learned (like i expect you have) to be
           | suspicious of any significant (IE >1%) performance
           | improvement in a system that has existed a long time.
           | 
           |  _Laughs in corporate code_
        
             | miroljub wrote:
             | That's a different case. Corporate code is never optimized
             | for performance. Performance as a factor doesn't play any
             | factor.
        
             | DannyBee wrote:
             | Sure, let me amend it to "i'm suspicious of any significant
             | performance improvement in as system where performance
             | actually matters, and has existed in a state where
             | performance matters for a long time".
        
             | colechristensen wrote:
             | I'll believe you if you say 0.5% improvement, I'll believe
             | you if you say 10,000% improvement, but 10%? That's fishy.
        
         | rbetts wrote:
         | Thank you for doing work that makes python better for millions
         | of people.
        
         | robertlagrant wrote:
         | You're doing great work pushing Python forwards. Thanks for all
         | your efforts.
        
         | tiagod wrote:
         | It's still a performance improvement, which at the scale of
         | python deployment will probably mean quite a bit of global
         | power savings :)
        
         | 9cb14c1ec0 wrote:
         | Don't feel so bad. 1 to 5 % speedup is still an extremely
         | valuable contribution.
        
         | haberman wrote:
         | I think it's important to note that a primary motivation of the
         | tail call interpreter design is to be less vulnerable to the
         | whims of the optimizer. From my original blog article about
         | this technique
         | (https://blog.reverberate.org/2021/04/21/musttail-
         | efficient-i...):
         | 
         | > Theoretically, this control flow graph paired with a profile
         | should give the compiler all of the information it needs to
         | generate the most optimal code [for a traditional
         | switch()-based interpreter]. In practice, when a function is
         | this big and connected, we often find ourselves fighting the
         | compiler. It spills an important variable when we want it to
         | keep it in a register. It hoists stack frame manipulation that
         | we want to shrink wrap around a fallback function invocation.
         | It merges identical code paths that we wanted to keep separate
         | for branch prediction reasons. The experience can end up
         | feeling like trying to play the piano while wearing mittens.
         | 
         | That second-to-last sentence is exactly what has happened here.
         | The "buggy" compiler merged identical code paths, leading to
         | worse performance.
         | 
         | The "fixed" compiler no longer does this, but the fix is
         | basically just tweaking a heuristic inside the compiler.
         | There's no actual guarantee that this compiler (or another
         | compiler) will continue to have the heuristic tweaked in the
         | way that benefits us the most.
         | 
         | The tail call interpreter, on the other hand, lets us express
         | the desired machine code pattern in the interpreter itself.
         | Between "musttail", "noinline", and "preserve_none" attributes,
         | we can basically constrain the problem such that we are much
         | less at the mercy of optimizer heuristics.
         | 
         | For this reason, I think the benefit of the tail call
         | interpreter is more than just a 3-5% performance improvement.
         | It's a _reliable_ performance improvement that may be even
         | greater than 3-5% on some compilers.
        
           | sunshowers wrote:
           | In full agreement with this. There is tremendous value in
           | having code whose performance is robust to various compiler
           | configurations.
        
         | sundbry wrote:
         | You don't have to apologize, you did great work either way.
        
         | immibis wrote:
         | You don't need a long heartfelt apology - a simple "god dammit"
         | is enough for this. You made no mistake - you only got unlucky.
         | 
         | It's possible to improve your luck by applying more care, but
         | it's also possible to apply so much care that you do not try
         | things that _would_ have been useful, so I 'd rather you keep
         | erring on the side that you did!
        
         | SmellTheGlove wrote:
         | I don't think you should be embarrassed or apologize. You still
         | did a thing that improved performance - in the near term it
         | worked around a bug that you weren't even aware of, and long
         | term there are still gains even with that bug fixed. But even
         | if that weren't the case, the only way anything gets done in
         | software is to rely on abstractions. You can either get things
         | done or know exactly how every last bit of your stack is
         | implemented, but probably not both. It's a very reasonable
         | trade off.
         | 
         | Smart people go and build things. Other smart people find
         | problems. Nothings broken with that.
        
         | ggu wrote:
         | Hello, I was one of the people who contributed to this
         | interpreter change. Thank you Nelson for the excellent write-up
         | and for going to such lengths to get to the bottom of this.
         | That said, I wanted to defend Ken a bit. :-)
         | 
         | Ken Jin is a volunteer who's been tirelessly working to make
         | CPython faster over the past couple of years for not enough
         | credit. IMHO, Ken did nothing wrong here, and there's really
         | nothing to be embarrassed about. The fact that it took a month
         | (more if you consider the folks helping to reproduce the
         | original results) for anyone to notice anything wrong shows how
         | complicated the situation was! Put yourself in Ken's shoes --
         | having updated to LLVM-19 to enable the preserve_none flag,
         | would you then hypothesize that LLVM-19 may have introduced an
         | unrelated performance regression that no one noticed for five
         | months? Lots of people underestimate how hard benchmarking is
         | IMO.
         | 
         | A 1-5% performance improvement is also pretty valuable, just
         | not quite as spectacular as we thought originally :-)
        
         | ok_dad wrote:
         | Yea, don't feel too bad, you're just unlucky that your mistake
         | was so public, most of us have these kinds of mistakes weekly
         | in a more private setting. I think everyone who knows anything
         | about how optimized Python is would be impressed that you
         | managed even a few percent improvement in speed! That will also
         | probably save many, many GWh of power, even those few percent!
         | 
         | You have a new account here and your blog is just one article
         | so far so you might be a new-ish developer(?), but you are
         | doing great, keep it up! If you are not a new developer, you
         | are still doing great, keep it up!
        
       | bjourne wrote:
       | This is exactly the kind of content I love to see on HN. But I
       | wonder though how this optimization is related to tail-call
       | optimization? How the interpreter jump table is implemented
       | shouldn't affect how stack frames are created, should it?
        
         | mbel wrote:
         | It's explained under the first link from the article [0]:
         | 
         | "A new type of interpreter has been added to CPython. It uses
         | tail calls between small C functions that implement individual
         | Python opcodes, rather than one large C case statement."
         | 
         | [0]
         | https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-...
        
         | ambivalence wrote:
         | A big chunk of the performance gain comes from the fact that
         | with tail calls the CPU doesn't have to reset the registers
         | (configured through the `preserve_none` calling convention).
        
         | thunkingdeep wrote:
         | Well if you'd bother to read it, you'd discover this is about
         | tail calls in C, not in Python. It has nothing to do with tail
         | recursion in Python. Guido has explicitly said that Python will
         | never have it.
        
       | throw132128 wrote:
       | That is what happens in a project if you purge all opposition and
       | a few people work on corporate projects that are supposed to
       | succeed.
       | 
       | The free-threading had massive regressions (50% slower), this one
       | has almost no speedups.
       | 
       | People are scared of corporate repressions or repressions from
       | the Steering Council, so they shut up like in the Soviet Union.
        
       | Retr0id wrote:
       | It's getting harder to characterize the C language as "close to
       | the metal". Obviously optimizing compilers are a good thing, but
       | it's frustrating when it feels like you're fighting with them to
       | get the code emitted just the way you want.
        
       | Malp wrote:
       | This is some solid work and a great blog post to go hand-in-hand
       | with. Wow.
        
       | AlexSW wrote:
       | I'm struggling to confirm that my intuition of why the compiler
       | optimisation is faster than the naive switch expansion.
       | 
       | Is it about the layout of where the assembly instructions end up,
       | and spacing around them? Or the CPU pipelining working better?
       | Or...?
        
         | csjh wrote:
         | distributing the dispatch instructions lets the predictor pick
         | up on specific patterns within the bytecode - for example,
         | comparison instructions are probably followed by a conditional
         | branch instruction
        
           | AlexSW wrote:
           | Ah of course! It's improving the branch target predictor.
           | Makes sense, thanks!
        
       | anarazel wrote:
       | Hah, I found something similar in gcc a few years back. Affecting
       | postgres' expression evaluation interpreter:
       | 
       | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785
        
       | ram_rar wrote:
       | Kudo to author and Cpython team to reckon this. Nevertheless, its
       | still a very important improvement in +Ve direction. Thats all
       | matters. This a great story as to why benchmarking are soo hard
       | to get right. I always tell my team to benchmark your real world
       | use case as closely as possible and not rely on external results.
        
       | ashvardanian wrote:
       | Thanks for the post! Would love to read similarly styled dives
       | into GIL performance and experimental multi-threaded Python
       | builds!
        
       ___________________________________________________________________
       (page generated 2025-03-10 23:00 UTC)