[HN Gopher] Parsing protobuf at 2+GB/s: how I learned to love ta...
       ___________________________________________________________________
        
       Parsing protobuf at 2+GB/s: how I learned to love tail calls in C
       (2021)
        
       Author : fanf2
       Score  : 317 points
       Date   : 2024-08-19 08:42 UTC (14 hours ago)
        
 (HTM) web link (blog.reverberate.org)
 (TXT) w3m dump (blog.reverberate.org)
        
       | gpderetta wrote:
       | > I very much hope that the attribute will catch on, spreading to
       | GCC, Visual C++, and other popular compilers,
       | 
       | AFAIK, attribute musttail is in the process of being added to GCC
       | (the patch is under review) with semantics compatible with clang.
        
         | ufo wrote:
         | What about the preserve_most attribute? Is there any chance
         | something like that will get into GCC? Without it, the non-tail
         | calls ruin the interpreter.
        
           | gpderetta wrote:
           | Maybe, but not there yet:
           | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899
           | 
           | In the meantime some inline assembly macro trickery might
           | help.
           | 
           | edit: code duplication can also be obviated by templating
           | your op function with a fast/slow parameter, with the fast
           | variant tail-calling the slow variant when it cannot perform
           | the fast path, while guarding the slow code via the compile
           | time parameter. The downside is yet more code obfuscation of
           | course.
        
         | katzinsky wrote:
         | Given some of the other scheme-like features GNUC has it's
         | surprising they're lagging on this one.
        
         | fweimer wrote:
         | It's a hard problem because many ABIs cannot do tail calls even
         | for very basic stuff, like calls to extern functions with
         | matching argument and return types. It looks like Clang has
         | some heuristics for changing call sequences for musttail calls.
         | For example, it switches to noplt calls on i686. None of this
         | is mentioned in the Clang documentation:
         | https://clang.llvm.org/docs/AttributeReference.html#musttail
         | 
         | What's realistic here is that you get a diagnostic if the
         | compiler cannot generate a tail call. For many users, that will
         | likely be good enough. Guaranteeing a tail call as in Scheme is
         | unlikely to happen.
        
       | irdc wrote:
       | I wonder how fast this would be when using a trampoline, i.e.
       | returning the next function as a function pointer and calling
       | that from an outer loop. That would have the advantage of being
       | portable C.
        
         | chii wrote:
         | the reason i suspect tail call optimization is fast would be
         | because the resultant loop is predictable and thus CPU
         | instruction prefetch and memory prefetch works very well.
         | 
         | Jumping via function pointers would probably not be as
         | predictable, and you'd unlikely to see the same benefit.
         | 
         | Of course, one must measure, and i haven't.
        
           | gpderetta wrote:
           | The tail call interpreter is also calling through a function
           | pointer. The cost here is purely the call+ret overhead, which
           | can be non-trivial when it is per opcode; on some micro-
           | architectures there is also a limit on taken jumps per cycle
           | (sometimes as low as one taken jump every other cycle).
           | 
           | edit: trampolining would also collapse all indirect jumps to
           | a single source address which is not ideal for branch
           | prediction.
        
         | soegaard wrote:
         | C is often used as target language for compiler from higher
         | level languages. The Scheme programming language requires all
         | tail calls not to grow the stack. Therefore implementors have
         | explored various techniques including trampolines. I don't have
         | a citation, but you can find the answer in the papers on
         | compiling Scheme to C. If there is no guarantee of TCO in the
         | target language, then the generated programs will be slower.
         | 
         | Incidentally, this is why implementors of (especially) high
         | level languages are annoyed that TCO was removed from the
         | JavaScript specification. There are even solution for having
         | TCO and still have stack inspection.
         | 
         | https://github.com/schemedoc/bibliography/blob/master/page8....
        
       | stabbles wrote:
       | Maybe the example is too simple, but it does not require
       | `__attribute__((musttail))` for good code gen.
       | 
       | Also if the error handling function is unlikely, you wouldn't
       | care too much about how fast it is to call it?
       | 
       | To me it seems like a structure of the form
       | <read data>             if (unlikely(malformed))          return
       | error();             <prep more stuff>             switch
       | (data_type) {          case x:            return handle_x();
       | case y:            return handle_y();        }
       | 
       | generates a nice jump table quite reliably.
        
         | OskarS wrote:
         | Obviously compilers have been doing tail call elimination for
         | ever, but for this trick "generates [tail calls] quite
         | reliably" is not enough: you have to GUARANTEE it (or fail
         | compilation), otherwise this structure does not work (it will
         | blow the stack immediately). That's the point of [[musttail]],
         | tail call elimination is _required_ , that's the only choice
         | the compiler has.
        
           | stkdump wrote:
           | That means that any such code is not portable across
           | compilers anymore. It is effectively written in a non-
           | standard C dialect, because it requires a language extension
           | to work correctly.
        
             | ufo wrote:
             | Yes. But the alternative is assembly language, which is
             | even less portable.
        
               | kachapopopow wrote:
               | assembly is the most portable of them all even across
               | platforms! (x86 to arm compilation meme)
        
               | Karliss wrote:
               | The portable alternative is being explicit with your
               | loops, possibly in combination with gigantic unwieldy
               | switch statements or some regular goto (it is still part
               | of standard). But that comes at the cost of readability
               | and sometimes performance.Whether it's practical depends
               | on the usecase. For something like recursive data
               | structure algorithms which are relatively small and self
               | contained, I would say it's perfectly doable. Simple
               | interpreters - maybe. Complex interpreters - here it
               | becomes messy.
        
               | mananaysiempre wrote:
               | See Mike Pall's posts on the subject--the performance
               | cost is considerable, for two reasons. First, you're
               | forcing the compiler to do register allocation for the
               | whole interpreter at once, which it can virtually never
               | do a good job of. (This is actually the more important
               | part.)
               | 
               | Second, given the existence of branch target buffers (and
               | the ruinous cost of mispredicted branches), you really
               | want the instruction dispatch to be a single indirect
               | branch at the end of each instruction implementation, and
               | for that standard tools are somewhere between unhelpful
               | (you can write a macro containing switch (*insn++) { case
               | INSN_FOO: goto impl_foo; /* ... */ } but it's anybody's
               | guess whether you're getting a single jump table for all
               | copies of that) and actively obstructive ("tail merging"
               | in older versions of Clang would actively destroy any
               | attempts at copying dispatch code). Granted, sometimes
               | things work out (new Clang versions can sometimes go
               | "looks like you're writing an interpreter" and turn a
               | vanilla switch in a loop into duplicated dispatch code).
               | Then again, sometimes they don't, and you can't actually
               | know.
        
               | OskarS wrote:
               | Yeah, that's the way virtual machines have been written
               | forever, some version of                   for
               | instruction in instructions {             switch
               | (instruction) {                 case OPCODE_X: //....
               | case OPCODE_Y: //....                 case OPCODE_Z:
               | //....             }         }
               | 
               | This is how VMs have been written since the dawn of time
               | (or using computed gotos, another non-standard addition
               | to C). It has problems though, like the fact that the
               | `switch` branch is extremely unpredictable, and that you
               | get a _massive_ function which is hard to optimize. This
               | [[musttail]] trick is a huge improvement. But yeah, if
               | you got to support compilers that don 't have
               | [[musttail]], you in essence have to have two
               | implementations, the [[musttail]] one and the loop/switch
               | one.
        
               | ufo wrote:
               | A switch-case is the default way to write an interpreter
               | and I'd even argue it's the most readable.
               | 
               | In the context of this article, it's all about the
               | performance. Switch-case generates suboptimal code for
               | the commonly used fast paths of the protobuf parser,
               | because the mere existence of the slow paths is enough to
               | interfere with the performance of the code around them.
        
             | OskarS wrote:
             | Yes, that is correct. You cannot do this trick in standard
             | C, C++ or Rust, it requires some version of [[musttail]].
             | Strong argument for adding it to the C standard, IMHO.
        
             | dist1ll wrote:
             | Fwiw, many C projects are written in a non-standard C
             | dialect, including the Linux kernel.
        
             | flohofwoe wrote:
             | The typical way to deal with this is to put the
             | __attribute__ into a C macro which expands to nothing on
             | compilers which don't understand GCC/Clang's __attribute__
             | keyword. The code without the attribute will still compile
             | and most likely also apply the tail call optimization, you
             | just don't get an error if the compiler can't apply the
             | optimization.
             | 
             | Also TBF, hardly any real-world C code is strictly standard
             | compliant. Many C compilers just agree on a common syntax
             | that includes both the C standard and some popular non-
             | standard extensions.
             | 
             | PS: C++ compilers actually ignore unknown attributes since
             | the `[[attribute]]` syntax has been standardized in C++11.
             | In GCC and Clang you'll get a warning in the standard
             | warning set, but not in MSVC.
             | 
             | PPS: C23 also standardized the `[[attribute]]` syntax and
             | also added a way to check for supported attributes:
             | 
             | https://en.cppreference.com/w/c/language/attributes
        
               | pjmlp wrote:
               | It will compile, and eventually blow up nicely with a
               | stack overflow OS fault.
               | 
               | Ah, the joys of writing "portable" C code with #ifdef
               | spaghetti, across commercial UNIXes and their own C
               | compilers, 20 years ago.
               | 
               | It only got better because for many people just like they
               | assume Web == Chrome, C gets C == GCC, blessifully
               | ignoring everything else.
               | 
               | Nowadays clang is also considered, mostly because a
               | couple of companies wanted to replace GCC, and naturally
               | clang needs to be able to match whatever GCC offers.
        
               | Filligree wrote:
               | > It will compile, and eventually blow up nicely with a
               | stack overflow OS fault.
               | 
               | Not at all guaranteed. Stack overflow is undefined
               | behaviour, which means compilers can optimise your
               | program on the assumption that it doesn't happen.
        
               | pjmlp wrote:
               | Ah true, yet another case that one tends to forget about.
        
               | flohofwoe wrote:
               | Well, if you use recursive code, you better know what
               | you're doing. With or without tail call optimization.
        
               | eru wrote:
               | Loops are just a special, limited case of recursion.
               | 
               | (And only necessary in languages that have trouble
               | implementing function calls properly.)
        
             | taneq wrote:
             | There's no such thing as "standard C" that you can actually
             | write, due to UB and implementation defined behaviour.
             | There's just _(C, compiler version, platform)_ that defines
             | (if only through the compiler 's source code) what will
             | actually happen in any given situation.
        
               | perching_aix wrote:
               | so because there are implementation defined behaviors in
               | the standard, language extensions become okay?
        
               | flohofwoe wrote:
               | Language extensions are a feature, not a bug. They allow
               | C to evolve and C compilers to compete without requiring
               | committee consensus. Good extensions will eventually be
               | picked up by other compilers, and maybe even find their
               | way into the standard.
        
               | perching_aix wrote:
               | sure, i get the joke, i just don't like it. it's the same
               | story as browsers. proprietary extensions in the name of
               | progress because technically it's allowed, but also
               | unimplemented standardized features galore, necessitating
               | polyfill libraries and frequent checking of support
               | matrices.
               | 
               | it's just sprawl with popular support.
        
               | gpderetta wrote:
               | It is very different. A web browser is a (almost) fully
               | self contained environment. Same with something like
               | JAVA. On the other hand a standard C or C++
               | compiler+runtime has (by design) very little features.
               | Anything beyond trivial programs has to reach to platform
               | specific features. Even POSIX is often not enough.
        
               | perching_aix wrote:
               | not sure i'm following? i was referring exactly to those
               | "few" language features, e.g.:
               | https://en.cppreference.com/w/cpp/compiler_support
        
               | uecker wrote:
               | It is entirely possible to write strictly conforming C.
        
             | jonstewart wrote:
             | The article is pretty clear about this. When it comes to
             | fast lexing and parsing, it is typical for projects to make
             | portability tradeoffs in favor of performance. For example,
             | simdjson is full of assembly.
        
             | eru wrote:
             | You say it like it's a bad thing.
        
           | jerf wrote:
           | "you have to GUARANTEE it (or fail compilation)"
           | 
           | I've often pondered the utility of similar flags for other
           | optimizations. This is perhaps the largest one, but there are
           | other situations in certain code where I want to know that my
           | optimization has failed.
           | 
           | A more complicated example is, I've sometimes wanted an
           | assertion that a given function is inlined. We know from
           | hard, repeated experience over decades that letting users
           | annotate functions as inline directly doesn't end up working
           | very well, but I've often wondered about creating an
           | assertion that would fail the compile if it isn't. (Ideally
           | with at least a hint as to why the compiler failed it out,
           | but that's easier said than done.) Obviously you don't want
           | to go slapping this in some major open source library that's
           | going to be compiled by half-a-dozen compilers on dozens of
           | operating systems, but for my own code in my own situation it
           | can be an optimization that is the difference between success
           | and failure and it'd be nice to flag a failure.
           | 
           | (Bear in mind I am not proposing to blindly take any
           | particular action if it triggers. The point is to bring it up
           | to human attention and not have it buried deep in invisible,
           | yet potentially consequential, compiler decisions. The human
           | deciding "I guess this optimization isn't happening" and
           | removing the annotation would be one valid decision, for
           | instance.)
        
             | OskarS wrote:
             | I agree, this would be useful. Another one I would like is
             | auto-vectorization, a way to mark a loop with an attribute
             | and if it fails to auto-vectorize, the compiler should
             | print out the auto-vectorization report for that loop,
             | explaining why it happened. It's such a brittle
             | optimization, but it's crucial for a tiny number of
             | extremely hot loops, you would want to know if it failed
             | due to some code change or compiler upgrade. Also, it's
             | just a pain to use auto-vectorization report normally.
        
             | ndesaulniers wrote:
             | > I've sometimes wanted an assertion that a given function
             | is inlined.
             | 
             | Try `__attribute__((error("not inlined")))` or `warning` on
             | the callee.
        
         | mananaysiempre wrote:
         | As the disassembly in the post demonstrates, the problem with
         | the fallback path (which is not necessarily the error path) is
         | not how fast the call to it is, it's that the mere existence of
         | that call can force the compiler to create a stack frame and
         | spill registers into it for the whole function, including the
         | fast path.
         | 
         | OK, maybe "force" is not the right word--nobody says the
         | compiler has to have a single stack frame structure for all
         | possible execution paths of a function. Nobody even says it has
         | to use the standard ABI for a no-linkage (static or anonymous-
         | namespace) function (that doesn't have its address taken). But
         | the reality is, all compilers I've seen do, including Clang, so
         | we want a way to tell them to not worry about the ABI and avoid
         | wasting time on preserving registers across the call.
         | 
         | Re your nice jump table, sure it does. But if you try running
         | the result under, say, perf report, and your test bytecode
         | doesn't describe a short loop, you'll see one of two things:
         | either you had a branch mispredict on each dispatch; or the
         | compiler went "looks like you're trying to write an
         | interpreter" and moved the indirect jump to the end of each
         | case (I've seen Clang do this). And either way the register
         | allocation in the resulting code probably sucks.
        
           | jcelerier wrote:
           | > so we want a way to tell them to not worry about the ABI
           | and avoid wasting time on preserving registers across the
           | call
           | 
           | that's what -fvisibility=internal already does, no?
        
             | mananaysiempre wrote:
             | That's what static _could_ do (if the function's address is
             | not taken, or given sufficiently powerful dataflow
             | analysis), but C and C++ compilers don't take advantage of
             | that. Getting that out of -fvisibility=hidden -flto would
             | also be possible, but requires even more nonexistent
             | compiler smarts. (From a quick web search, I can 't figure
             | out what internal visibility brings over hidden.)
             | 
             | (Granted, it's not like this is completely impossible--I
             | seem to remember GHC and MLton can invent custom calling
             | conventions for Haskell and SML respectively. But the
             | popular C or C++ compilers can't.)
        
               | kccqzy wrote:
               | Really? I've always assumed that static gives C/C++
               | compilers the right to disregard calling conventions
               | altogether. Now that I think about it, this assumption
               | might be unjustified. Was there any strong reason for
               | compilers to keep the calling convention?
        
               | mananaysiempre wrote:
               | I mean, it does give them that right as far as I know
               | [with, again, the caveat of function pointers--you can't
               | pass a non-ABI-compliant function to qsort(), however
               | static it is--and an additional one of inline assembly],
               | it's just that they choose not to exercise it. Why? Not
               | sure. Maybe it's too difficult (register allocation and
               | spilling is NP-complete and fairly expensive). Maybe it's
               | too difficult when your compiler starts out as a
               | function-at-a-time one and builds on that. The popular
               | ones don't do that, is my point. (Would be interesting to
               | check what SDCC does, by the way, on register-starved
               | micros such tricks could make sense.)
        
               | zzo38computer wrote:
               | I think that "static register" should make the compiler
               | allowed to disregard calling conventions (if
               | optimizations are enabled), but neither "static" nor
               | "register" alone should do so. (In a discussion on an IRC
               | some time ago, I had suggested what "register" alone
               | should do in my opinion, which is something else than
               | this.)
        
       | PaulRobinson wrote:
       | Perhaps I've just been looking more (because I've been going back
       | to the books to pick up C again after a 20 year absence for a
       | side project), but it feels like in recent years there has been a
       | little uptick in people appreciating the level of control C
       | provides, and "going back to basics".
       | 
       | I think the work on C23 (which would have still been C2x when
       | this article was written), means people are starting to want to
       | see innovation in the language again. I don't think that would
       | have happened without Go and Rust showing some great thinking,
       | but it's interesting that it's happening at all.
       | 
       | Tail calls are an interesting idea, and now I want to know more
       | about both this extension, and also how to write my code in such
       | a way as to help a compiler optimize for tail calls even when
       | this extension isn't present. This somehow feels more fun to me
       | than writing more Python or Ruby...
        
         | mrweasel wrote:
         | > it feels like in recent years there has been a little uptick
         | in people appreciating the level of control C provides, and
         | "going back to basics".
         | 
         | I don't write C, and maybe it's because I somehow seek out
         | these types of article and projects, but I'd agree, I'm seeing
         | the same thing. It might be a backlash against programming
         | languages like Rust or even JavaScript. Rust being really
         | complicated, but clearly safer, and JavaScript... well it's
         | just really hard to set up a development environment and the
         | tooling is almost more predominant than the language itself.
         | 
         | While I don't write C myself, only for fun, I can see many just
         | reverting to it, because it's "right there", it's fast, the
         | language is fairly simple, even if the standard library is
         | anything but simple, but you can take the stuff you need an
         | ignore the rest.
        
           | tazu wrote:
           | I've personally been very motivated to learn C (coming from
           | Go) by witnessing @jart's[1] progress on Cosmopolitan[2] and
           | Readbean[3]. Writing C is almost like an exercise in
           | sovereignty: ultimate freedom (and danger).
           | 
           | [1]: https://justine.lol/
           | 
           | [2]: https://github.com/jart/cosmopolitan
           | 
           | [3]: https://news.ycombinator.com/item?id=26271117
        
             | queuebert wrote:
             | jart is the best thing to happen to C since K&R.
        
             | jerf wrote:
             | I think it's fine to go back to C and maybe play around a
             | bit to learn about some of the things that can be done, but
             | I would implore you to bear in mind that the decades have
             | taught us that the "ultimate danger" in question is
             | basically that you're building sand castles in a minefield.
             | We're not talking "oh ha ha I guess I wrote a few more bugs
             | that a stronger type system would have caught", we're
             | talking "oh ha ha I guess remote unauthenticated attackers
             | can run arbitrary assembler as root in my network code
             | because I tripped over one of C's nominally well-known
             | mines that I did not personally know about and all the
             | attackers had to do was slightly tweak one of the exploit
             | scripts already in metasploit to install root kits on my
             | system and add it to their bot net".
             | 
             | The world has gotten a lot more dangerous than people
             | realize. People generally quite correctly assume that
             | hackers aren't going to spend person-months attacking their
             | system personally but don't realize that the attacker tools
             | are quite sophisticated now and _they don 't have to_.
             | Shoving code into a small little buffer overflow to lift to
             | a larger buffer overflow to load exploit code over the
             | network that will run a pre-packaged root priv escalation
             | to install a rootkit to add them to their botnet is no
             | longer the work of a team of hackers for months. It's all
             | off-the-shelf tech now and it's more like writing a 20 line
             | function now; you don't need to attract very much attention
             | now to attract that level of sophistication.
             | 
             | We are leaving C behind collectively for very good reasons.
             | If you are playing with C and you do not intimately
             | understand those reasons, you're going to relearn them the
             | hard way.
        
               | uecker wrote:
               | I do not think we are close to "leaving C behind
               | collectively" and neither should we.
        
               | lanstin wrote:
               | I wrote 10s of thousands of lines of C code in the 90s
               | and early 00s (without buffer overflows that I learned
               | about; I did also write a evil network layer for inducing
               | buffer overflows in my and my dependencies code), and
               | have been doing a lot of other languages since then, and
               | then had occasion to write some more C where I was doing
               | string allocation and manipulation (for a LD_PRELOAD to
               | monitor what various programs I lacked source to where
               | doing), and it was absolutely nerve wracking. Linux
               | kernel might be mostly C for a long time, but it would be
               | crazy to start a new thing in C. There's growing re-write
               | projects from C to Rust. It would be farther along except
               | the Rust people seem to dislike laboriously recreating
               | decades of GNU long-opt functionality in all these base
               | packages to actually make Rust drop-in replacements for
               | C.
               | 
               | Maybe for embedded, I haven't done that, but for general
               | purpose, I can't imagine it being worth the risks.
        
               | jerf wrote:
               | It is definitely on its way down, and it is a good thing.
               | 
               | You may think you can handle C. I disagree. The evidence
               | is on my side. We need safer languages.
               | 
               | Heck, that even overstates it. It's not like we need
               | super-safe languages per se. Maybe the 2070s will
               | disagree with me. But we need languages that aren't
               | _grotesquely unsafe_. It 's not just that C isn't as safe
               | as a language could be; it is that it is recklessly
               | unsafe.
               | 
               | Interrupting that descent is foolishness, and pinning
               | your career to a resurrection in C even more foolish.
               | These technologies always have this meta-meta-contrarian
               | phase to them just before they expire. I got into the
               | computer field just as the meta-meta-contrarian
               | "everything must be written in assembler" phase was at
               | its peak. It was a false resurrection and I pity anyone
               | who overinvested in learning how to write large
               | applications in 100% assembler as a result of reading
               | someone's over-enthusiastic screed about the virtues of
               | writing pure assembler in 1998.
               | 
               | So I write this in the hopes of helping some other young
               | HN reader not be fooled. C may be a thing you learn at
               | some point, some day, just as assembler is still
               | something you may learn. But don't get overexcited about
               | the last meta-meta-contrarian false resurrection before
               | death. Which is still years away, but at this point I
               | think it's pretty much set on an irrevocable course.
        
         | queuebert wrote:
         | C is great for its basics, but at some level, when you gain
         | awareness of all of your unpunished pointer lifetime sins and
         | how annoying the lack of basic data structures is in stdlib,
         | then you realize that biting the bullet and learning something
         | like Rust is the way to go. Maybe a better stdlib would help,
         | but I've found that the pain of becoming proficient in Rust has
         | paid back multiples over the continual toil of doing certain
         | repetitive tasks in C.
        
           | uecker wrote:
           | It is not terribly hard to find some library for data
           | structures in C. What I do not like about Rust is: syntax,
           | complexity, long compile times, cargo, ...
        
         | coldpie wrote:
         | C is great and I definitely recommend learning it & becoming
         | proficient in it. I don't think I would recommend it for a new
         | project, for all the reasons jerf mentions in a sibling
         | comment[1]. But there's a ton of existing code out there, and I
         | think it provides a nice little peek into what the hardware is
         | actually doing that you don't get from higher level languages.
         | 
         | > I think the work on C23 means people are starting to want to
         | see innovation in the language again.
         | 
         | I have to admit this makes me nervous. Part of what's nice
         | about C is its simplicity and its stability. It's relatively
         | easy to write a compiler, and the lack of having many different
         | ways to do the same thing means all code written in C tends to
         | "feel" roughly the same (outside some domain-specific areas),
         | which means it's easy to dive into a new codebase and get up to
         | speed. It's basically all structs & functions and abstractions
         | built upon them. That's it.
         | 
         | While I definitely am not opposed to making improvements to the
         | C spec, seeing the inline anonymous functions in a proposal
         | linked elsewhere in this thread[2] really turns me off. It
         | starts to feel like the hideous mess that C++ has turned into,
         | with oodles of implicit behavior, un-greppable code, generated
         | symbol names leading to compiler error hell, and a thousand
         | different colors for your bike shed.
         | 
         | We already have languages that can do all this stuff. I welcome
         | making changes to C to make it nicer to program in C, but I'm
         | wary of proposals that turn C into yet another messy kitchen
         | sink programming language. Let's keep it clean, obvious, and
         | simple, please.
         | 
         | [1] https://news.ycombinator.com/item?id=41291206
         | 
         | [2] https://www.open-
         | std.org/jtc1/sc22/wg14/www/docs/n3266.htm#u...
        
       | sp1rit wrote:
       | I wonder how this behaves in combination with
       | __attribute__((cleanup(...))). Especially if the to be cleaned
       | variable is passed into the tail function as parameter.
        
         | Karliss wrote:
         | You get an error that tail call can't be performed which is
         | kind of the point of tailcall attribute. Same thing with
         | regular c++ destructors or dozen other language features which
         | interfere with tail calls, no need for extensions like
         | __attribute__((cleanup()).
         | 
         | https://godbolt.org/z/Yex74Wjz7
        
       | fuhsnn wrote:
       | A C standard proposal exists for tail call [0], in the form of
       | "return goto (expression);".
       | 
       | What I like about it, over standardizing [[musttail]], is that
       | lifetimes of local objects are guaranteed to end. This makes it
       | possible to implement without extensive escape analysis.
       | 
       | [0] https://www.open-
       | std.org/jtc1/sc22/wg14/www/docs/n3266.htm#5...
        
         | nxobject wrote:
         | As an aside, I'm excited - but also with lots of trepidation -
         | about the new energy in adding new functionality to C. Excited
         | because there are changes and additions that are crying to be
         | made, even clarifying ideas... trepidation because C++'s gung
         | ho update cadence has somehow ended up in wart patched over
         | wart. Especially when feature reacts with feature in an
         | unfelicitous way far earlier than anticipated. I hope the
         | standards process finds a way to be very conservative, really
         | thoroughly test features in large and diverse codebases, rather
         | than just relying on rationales alone, when choosing to include
         | feature.
        
           | pjmlp wrote:
           | What new energy? We have had C89, C99, C11, C17, C23
           | 
           | And yet, no fix in sight for proper strings, arrays, or at
           | very least bounds checked slices, like Dennis Ritchie
           | originally proposed to ANSI/ISO.
        
         | haberman wrote:
         | Thanks for the pointer, I'll have to check this out.
         | 
         | Can you elaborate on how "return goto" is easier to implement?
         | [[musttail]] also ends the lifetime of local objects AFAICS.
         | 
         | One thing I'll mention from a quick scan:
         | 
         | > [The] function called in tail position must have identical
         | type to the callee. This ensures both that the return value
         | does not require any conversion, and also that argument passing
         | space is available and calling convention (if relevant) is
         | maintained.
         | 
         | One complaint I've seen repeatedly about [[musttail]] (which I
         | implemented in Clang) is that this constraint is unnecessarily
         | strict, since some architectures will allow tail calls even for
         | functions that do not perfectly match:
         | https://github.com/llvm/llvm-project/issues/54964
         | 
         | "But then the code would be nonportable." True, but tail call
         | optimization is inherently nonportable, since some targets
         | fundamentally do not support tail call optimization (eg. WASM
         | without the tail call extension).
        
           | tines wrote:
           | > some targets fundamentally do not support tail call
           | optimization
           | 
           | Can't any tail call be rewritten as a loop? Couldn't a WASM
           | compiler without the tail call extension implement it this
           | way?
        
             | metadat wrote:
             | Only with the aforementioned escape analysis. The function
             | call stack frames serve a critical purpose in most
             | nontrivial logic.
        
             | tonyg wrote:
             | > Can't any tail call be rewritten as a loop?
             | 
             | No. In general tail calls cannot be rewritten into loops.
        
               | cobbal wrote:
               | More specifically, tail _recursion_ is usually easy to
               | turn into a loop. Tail calls can be difficult to turn
               | into loops when they call a different function, or worse
               | a function passed in as a variable.
        
             | slaymaker1907 wrote:
             | It can't be rewritten as loop due to function pointers.
             | Using JS notation to avoid noise:
             | 
             | function logAndCall(statement, func) {
             | console.log(statement); return func(); }
             | 
             | Tail call optimization is actually possible here since we
             | call func in "tail position". It might be unlikely to blow
             | up the stack, but it can definitely happen if you do a lot
             | of continuation passing.
             | 
             | Perhaps more commonly for C++/Rust, tail call optimization
             | would be enormously valuable to have behind the scenes for
             | destructors. It's actually very difficult to implement a
             | linked list in safe Rust that doesn't explode the stack for
             | large lists since no TCO is done for destroying subobjects.
             | You can still avoid stack overflow, but you have to do
             | things like manually enumerating the list.
        
               | derefr wrote:
               | I don't get the problem. The generated code for your
               | example would return the same function pointer +
               | evaluated args list to the wrapping trampoline (i.e. the
               | code that contains the loop), where that loop expects
               | each thing it invokes to return a sum type
               | NextInvoke(func, args) | Return(value).
               | 
               | And that's not a special case for passed-in function
               | pointers; that's what a trampoline always does.
               | Trampolines expect static-invoked tail-calls to be
               | codegen'ed into function-pointer references too.
        
             | keithwinstein wrote:
             | Yes, wasm2c implements the Wasm tail-call feature with
             | trampolines, exactly this way. (https://github.com/WebAssem
             | bly/wabt/blob/main/test/wasm2c/ta... has an example.)
             | 
             | Doing it with a trampoline is probably slower than if C
             | really had tail calls. On the other hand, adding "real"
             | tail calls to C would probably require changing the ABI
             | (e.g. to "tailcc" or "fastcc -tailcallopt"), and I think
             | there's some reason to think this would probably impose a
             | penalty everywhere
             | (https://llvm.org/docs/CodeGenerator.html#tail-call-
             | optimizat...).
        
               | haberman wrote:
               | > On the other hand, adding "real" tail calls to C would
               | probably require changing the ABI (e.g. to "tailcc" or
               | "fastcc -tailcallopt")
               | 
               | But [[musttail]] does exactly this while respecting
               | existing calling conventions: https://clang.llvm.org/docs
               | /AttributeReference.html#musttail
        
               | keithwinstein wrote:
               | No -- as discussed upthread, clang's musttail attribute
               | requires the target function to have the same number of
               | arguments as the caller and for each argument to be
               | similar to the corresponding caller argument. That's
               | stricter than the underlying LLVM musttail marker (when
               | targeting the tailcc/swifttailcc calling conventions) and
               | is too restrictive to implement Wasm's tail-call feature
               | (and probably Scheme's, etc.), at least if arguments are
               | getting passed to functions natively.
               | 
               | It would be nice if the more relaxed rules of the LLVM
               | musttail marker with tailcc could be exposed in clang
               | (and gcc). I think that's basically what "return goto"
               | would do.
        
           | fuhsnn wrote:
           | >[[musttail]] also ends the lifetime of local objects AFAICS.
           | 
           | That's good to know. I had this github issue [0] in the back
           | of my mind, as well as witnessing occasions of clang turning
           | [[musttail]] into inner loops, and concluded clang's
           | implementation must be more sophisticated than simply
           | replacing calls with jumps. Just a little paranoia from
           | trying to be serious with compiler dev[1]: fulfilling a laid-
           | out spec feels more sound versus imitating something out
           | there.
           | 
           | >this constraint is unnecessarily strict
           | 
           | I would agree, at least for x86 psABI, it can be pretty
           | elaborative as long as the return value is the same register
           | and argument stack don't exceed what's provided.
           | Debug/profiling side might hate it, though.
           | 
           | [0] https://github.com/llvm/llvm-project/issues/72555 [1]
           | https://github.com/fuhsnn/slimcc/
        
             | haberman wrote:
             | I certainly understand your caution. I don't have intimate
             | expertise with the implementation of musttail in the
             | backend -- when I implemented musttail in Clang, it was
             | piggybacking on an existing attribute from the LLVM IR:
             | https://llvm.org/docs/LangRef.html#call-instruction
             | 
             | That said, my rough understanding is that a tail call ends
             | the lifetime of all objects in the old stack frame. It
             | follows that it is UB to access any objects from the
             | previous stack frame after a tail call, and that would
             | include Gerben's first example in
             | https://github.com/llvm/llvm-project/issues/72555
             | 
             | Your slimcc project looks really interesting, thanks for
             | the pointer.
        
       | nu11ptr wrote:
       | If added to Clang then I would assume this means it got support
       | from LLVM. If so, does this mean that Rust can now rely on tail
       | calls in LLVM to support something like the `become` keyword?
        
         | ori_b wrote:
         | The problem with doing it in rust is that most calls aren't
         | tail calls, even if they look like tail calls. You need to
         | invoke the destructors for any code path that can drop.
        
           | nu11ptr wrote:
           | Isn't that the purpose of `become`? I thought it was to say
           | "this IS a tail call, error out if it is not". After that
           | validation is done, then the compiler can drop as needed.
        
       | iTokio wrote:
       | The remaining issue with tail calls used to switch context, is
       | that you're using functions that must use a calling convention.
       | And unfortunately they waste registers to restore state on
       | function exit.
       | 
       | See the luajit remake blog for an exhaustive analysis and
       | alternative using an intermediate compiler
       | https://sillycross.github.io/2022/11/22/2022-11-22/
        
         | superdimwit wrote:
         | Clang recently got a new calling convention that makes these
         | tail calls much cheaper (avoids the need for the caller to
         | preserve some registers). I can never remember the name - it's
         | either preserve_all or preserve_none (whose perspective is the
         | preservation from?).
        
           | haberman wrote:
           | preserve_none is the new one. It can be applied to the
           | functions performing tail calls to allow them use of the full
           | register space.
           | 
           | I even saw an enhancement recently that will make
           | preserve_none allocate arguments in the registers that are
           | usually callee-saved: https://github.com/llvm/llvm-
           | project/pull/88333
           | 
           | This will make [[musttail]] + preserve_none a winning
           | combination when used together, particularly when making non-
           | tail calls to fallback functions that use a regular calling
           | convention, because all the arguments to [[musttail]]
           | functions can stay pinned in the callee-save registers.
           | 
           | I'm delighted, because this matches what I originally
           | proposed back in 2021, except I called it "reverse_ccc"
           | instead of "preserve_none": https://github.com/haberman/llvm-
           | project/commit/e8d9c75bb35c...
           | 
           | preserve_all also exists, and has existed for a while. You
           | could use it on fallback functions to help the tail calling
           | functions avoid spilling. But this always seemed like an
           | unsatisfactory solution to me, because it's too intrusive
           | (and requires too much diligence) to tag a bunch of regular
           | functions as preserve_all. It's much more practical to tag
           | all the core interpreter functions as preserve_none.
        
             | superdimwit wrote:
             | Thanks!
        
             | jeffbee wrote:
             | I see signs that Google itself is using preserve_none
             | internally, since the public protobuf repo has PROTOBUF_CC
             | (Calling Convention) but it is defined as nothing
             | #define PROTOBUF_CC
             | 
             | Is there any chance of this getting out into the wild or is
             | it too dangerous for us mortals?
        
               | haberman wrote:
               | Since preserve_none is Clang-only and only available on
               | recent compilers, it would introduce an ABI hazard
               | between the core library and generated code. We don't
               | want to cause crashes if you compile protobuf with one
               | compiler and generated code with another.
               | 
               | Also, until quite recently preserve_none was incompatible
               | with the sanitizers, but I believe this may have been
               | fixed by: https://github.com/llvm/llvm-project/pull/81048
        
         | hinkley wrote:
         | I've seen a few languages over the years drop and reacquire JIT
         | layers. Some of it has to do with programmer skill and lessons
         | learned, but some is also down to CPU generation.
         | 
         | Like everything else in CS, when the cost balance shifts for
         | different kinds of operations, the best algorithm can shift
         | back to something we haven't used in fifteen, twenty years. It
         | contributes a lot to the faddishness of programming. Just
         | because we are bringing something back doesn't mean there's no
         | reason. But we forget the reasons it wasn't a panacea last time
         | so that's still a problem.
         | 
         | If your main JIT gets faster or slower, then the cost-benefit
         | for running it changes, so the threshold to trigger it gets
         | adjusted, and now the amount of code that runs in the other
         | tiers shifts, which might make the amortized cost of that tier
         | worse. It's like balancing a double pendulum.
         | 
         | If you can make a JIT tier fast and dirty enough, you can skip
         | the interpreter entirely. And, from my armchair position, it
         | seems that the cognitive load of bookkeeping tasks between the
         | interpreter and say two JITs is high enough that a few
         | languages have mothballed the interpreter and used a JIT
         | optimized for compile time not output speed.
         | 
         | And I don't recall what language, but I'm pretty sure at least
         | one team that did this ended up dropping an intermediate
         | compiler as well, because of that balancing act I mentioned
         | above. It was better to double down on two than to try to
         | handle three.
        
       | beyondCritics wrote:
       | With gcc [1] and clang [2] you always had the option "-foptimize-
       | sibling-calls", to get away with tail calls even for debug
       | builds. Of course having this standardized, guaranteed and at the
       | function level is a huge improvement.
       | 
       | [1] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html [2]
       | https://clang.llvm.org/docs/ClangCommandLineReference.html#t...
        
       | chmaynard wrote:
       | See also:
       | 
       | https://news.ycombinator.com/item?id=26931581
        
       | crabbone wrote:
       | I wrote C Protobuf decoder / encoder as well as IML parser and
       | bindings for Python. Here's something I have to say about
       | measuring parsing speed.
       | 
       | * Since this library is only offered as a binding to some managed
       | languages there are some extra caveats that, in term of
       | performance overshadow everything else. I cannot speak to Ruby or
       | PHP, but with Python I saw a dramatic speed increase when _not_
       | using enumerators. I.e. if you translate Protobuf 's enumerators
       | into Python's enumerators any possible gains you may have in your
       | C code will be trampled by creation times of different Python
       | objects. The difference is many orders of magnitude. Similarly,
       | you could go even further and implement all the supporting data-
       | structures in C, and only expose minimal interface to them in
       | Python. How fair would this sort of comparison be against code
       | that uses Python's built-in structures is the question I cannot
       | answer.
       | 
       | * Google's Protobuf parser for Python would be still "faster"
       | than 2+GB/s because... it doesn't parse anything beside the top-
       | level message. The internal structure of the message is parsed
       | on-demand. This will be probably slower than 2+GB/s if your code
       | instantly reads everything that was parsed, but the obvious
       | question would be: how can you possibly compare these two
       | approaches in practical terms? And, again, there isn't a clear
       | answer, because the practical results will depend on the nature
       | of your application.
       | 
       | * In general case, Protobuf parsing cannot be streamed (because
       | of the handling of duplicates). This means, that, in practice,
       | code that parses Protobuf content will be bottlenecked on I/O (it
       | needs to wait for the end of the message before any parsing
       | starts). Independently, depending on the typical Probobuf message
       | in your application, it might be possible to parallelize parsing,
       | which, again, will most likely outrun any single-threaded parser,
       | but, just as examples above, cannot be said to be a winning
       | strategy in general.
       | 
       | * It's usually _a lot_ more efficient to combine parsing with
       | creation of domain objects, which is a step an application will
       | almost always have to take anyways. How this functionality is
       | accessible from the parser will in many cases determine which
       | parser will win the race.
       | 
       | ----
       | 
       | Bottom line: Protobuf (or maybe parser in general) is just a bad
       | candidate to try to measure / compare speeds. It's too low-level
       | and poorly designed to be a good measuring stick for performance
       | benchmarks.
        
         | jeffbee wrote:
         | > In general case, Protobuf parsing cannot be streamed (because
         | of the handling of duplicates)
         | 
         | I don't see how last field wins stops you from streaming parse.
         | Please elaborate.
        
           | crabbone wrote:
           | Streaming in this context means that the parsing code hands
           | off some parsed structures to the code outside of the parser
           | before the entire message is processed. Suppose now you have
           | this message:                   { x: 1, y: 2, x: 3 }
           | 
           | The streaming parser then reads field "x" with the value of
           | "1", dispatches that value to the outside code, then reads
           | "y", then reads "x" again, but the outside code had a side-
           | effect associated with that field, which it already performed
           | (eg. the outside code was printing the value of the field).
           | Now, you have a program with an error. The right output
           | should've been:                   y: 2         x: 3
           | 
           | but you got:                   x: 1         y: 2         x: 3
           | 
           | Might not be so bad, depending on circumstances, but if you
           | are betting on a soccer game outcome...
        
             | jeffbee wrote:
             | I see. But if this ambiguous repetition must be resolved,
             | then it must be resolved either at input or output time.
             | Protobuf seems to have optimized for the output case by
             | allowing for updates to scalar fields by appending.
        
             | 10000truths wrote:
             | You could easily design a stream parser that rejects
             | duplicates before it passes them off to the user, by
             | maintaining a set of already encountered keys within the
             | parser state. The space overhead isn't a concern unless
             | your map/set has millions of entries, but your non-
             | streaming parser would have choked from the memory usage
             | long before then, anyways.
        
       | pizlonator wrote:
       | The way interpreters usually achieve this kind of speedup in C++
       | is using computed goto. Then there's no calling convention junk
       | on the path from one opcode to the next.
       | 
       | Also, the main reason why interpreters get faster using either
       | the computed goto style or the tail style versus a classic switch
       | loop is that it reduces pressure on the branch predictor since
       | there's statically one indirect branch per opcode rather than
       | statically just one indirect branch.
        
         | tylerhou wrote:
         | (As the article claims) even with computed goto register
         | assignment of the most frequently used variables is fragile
         | because the CFG of the function is so complicated.
         | 
         | Register assignment is much less fragile when each function is
         | small and takes the most important variables by argument.
        
           | pizlonator wrote:
           | It's also fragile, in a different way, if you're threading
           | state through tail calls.
           | 
           | In my experience writing computed goto interpreters, this
           | isn't a problem unless you have more state than what can be
           | stored in registers. But then you'd also have that same
           | problem with tail calls.
        
             | haberman wrote:
             | Fallback paths most definitely have more state than what
             | can be stored in registers. Fallback paths will do things
             | like allocate memory, initialize new objects, perform
             | complicated fallback logic, etc. These fallback paths will
             | inevitably spill the core interpreter state.
             | 
             | The goal is for fast paths to avoid spilling core
             | interpreter state. But the compiler empirically has a
             | difficult time doing this when the CFG is too connected. If
             | you give the compiler an op at a time, each in its own
             | function, it generally does much better.
        
               | pizlonator wrote:
               | I get that and that's also been my experience, just not
               | for interpreters.
               | 
               | In interpreters, my experience is that fallback paths are
               | well behaved if you just make them noinline and then
               | ensure that the amount of interpreter state is small
               | enough to fit in callee save regs.
        
               | haberman wrote:
               | Mike Pall makes an argument that interpreters are
               | especially susceptible to this problem, and I find it
               | convincing, since it matches my experience: https://web.a
               | rchive.org/web/20180331141631/http://article.gm...
        
               | pizlonator wrote:
               | There are a bunch of arguments in there that don't match
               | my experience, which includes the JSC interpreter. JSC
               | had an interpreter written in C++ and one written in
               | assembly, and the main reason for using the assembly one
               | was not raw perf - it was so the interpreter knows the
               | JIT's ABI for fast JIT<->interpreter calls.
               | 
               | Mike's argument about control flow diamonds being bad for
               | optimization is especially questionable. It's only bad if
               | one branch of the diamond uses a lot more registers than
               | the other, which as I said, can be fixed by using
               | noinline.
        
           | ufo wrote:
           | Exactly. Computed goto helps with branch prediction, but does
           | not help w.r.t register allocation & other compiler
           | optimizations.
        
             | pizlonator wrote:
             | As I mentioned in another part of the thread - the way you
             | get that under control in a computed goto interpreter (or a
             | switch loop interpreter) is careful use of noinline.
             | 
             | Also, it probably depends a lot on what you're
             | interpreting. I've written, and been tasked with
             | maintaining, computed goto interpreters for quite a few
             | systems and the top problem was always the branches and
             | never the register pressure. My guess is it's because all
             | of those systems had good noinline discipline, but it could
             | also just be how things fell out for other reasons.
        
         | highfrequency wrote:
         | > there's statically one indirect branch per opcode rather than
         | statically just one indirect branch.
         | 
         | Could you elaborate on this with a couple more paragraphs? What
         | do you mean by one indirect branch per opcode rather than just
         | one indirect branch? How is this achieved?
        
           | pizlonator wrote:
           | Sure.
           | 
           | Say you write an interpreter like this:                   for
           | (;;) {             switch (curInst->opcode) {
           | case Inst::Foo:                 setOperand(curInst->dest,
           | foo(getOperand(curInst->a), getOperand(curInst->b));
           | curInst += FooSize;                 break;             case
           | Inst::Bar:                 setOperand(curInst->dest,
           | foo(getOperand(curInst->a), getOperand(curInst->b));
           | curInst += BarSize;                 break;             ...
           | more stuff like this             }         }
           | 
           | Then the generated code will have an indirect branch for the
           | switch. This is costly for two reasons:
           | 
           | - Gotta load curInst->opcode, then lookup opcode in a
           | compiler-generated jump table to get the code address of the
           | case statement.
           | 
           | - Gotta indirect jump.
           | 
           | It turns out that the top cost (or even all of the cost) is
           | the indirect jump. Maybe, the indirection is also some cost,
           | at least on some CPUs. But remember, all jumps on modern CPUs
           | are fast if predicted - regardless of the work seemingly
           | required to do the jump. And they are slow if they are not
           | predicted - and that slowness is about much more than the
           | work required to do the jump (the cost is the CPU needing to
           | roll back its speculations).
           | 
           | Reason: the indirect jump prediction the CPU is doing is
           | keyed by the address of the indirect jump instruction. There
           | is one indirect jump in the code above. And for any real
           | program you'll execute, that indirect jump will end up
           | hitting all (or at least most) of the opcodes. Therefore, the
           | CPU has no easy path to predicting the indirect jump. Maybe
           | it'll sometimes get lucky with more sophisticated predictors
           | (like ones that track history and use that to salt the key
           | used to lookup the predicted destinations, or ones that do
           | more fancy stuff, like maybe some neural network to predict).
           | 
           | How do you make the indirect jump faster? Have one indirect
           | jump per instruction! Both the computed goto approach and the
           | tail call approach get you there.
           | 
           | Consider the computed goto version of the interpreter.
           | Inst_foo_label:             setOperand(curInst->dest,
           | foo(getOperand(curInst->a), getOperand(curInst->b));
           | curInst += FooSize;             goto *curInst->handler;
           | Inst_bar_label:             setOperand(curInst->dest,
           | foo(getOperand(curInst->a), getOperand(curInst->b));
           | curInst += BarSize;             goto *curInst->handler;
           | ... more stuff like this
           | 
           | In this formulation of the interpreter, there is one indirect
           | jump per instruction, rather than just one for the
           | interpreter. Therefore, we're asking the CPU's predictor to
           | do something simpler: to predict, for each instruction, what
           | the next instruction will be. And then on top of that, the
           | predictor still gets to use its neural network, or history
           | buffer, or whatever. In terms of mathematical probability
           | theory, it's like we're giving the CPU a first-order Markov
           | chain, and that's sure to improve prediction accuracy.
           | Empirically, it improves it a lot and it's a big speed-up.
           | Here's yet another way to think of it. If I asked you to
           | predict, "what's the next instruction", without any knowledge
           | of what the prior one was, then you'd have a hard time -
           | you'd only be able to reason about which instructions are
           | most likely generally. But this computed goto interpreter is
           | instead asking: "if I tell you the instruction I just
           | executed, what's the next instruction", which gives you more
           | leverage. Maybe adds are often followed by moves, for
           | example.
           | 
           | The tail call style also achieves this, because each
           | instruction's handler will have an indirect tail call
           | (literally an indirect jump) to the next instruction, which
           | again, gives the CPU that Markov chain goodness. So let's
           | call both the computed goto style and the tail call style the
           | "Markov style". If you could rewrite the switch statement so
           | that there was one switch statement per instruction (and you
           | could convince the compiler not to combine them into one,
           | lmao) then that would also be a Markov-style interpreter.
           | 
           | As for the cost of indirectly loading from the compiler's
           | switch table, or other issues like pushing and popping
           | registers: in my experimentation, these costs are like drops
           | in the ocean compared to the cost of indirect jumps. Even
           | with the Markov style interpreter, the CPU spends most of its
           | time mispredicting and rolling back. So the details of how
           | the work happens for individual instructions are usually less
           | important than what you do about the prediction of that
           | indirect jump.
        
             | highfrequency wrote:
             | Awesome, thank you for expanding. Now I see the intuition
             | that branch prediction accuracy could be much higher using
             | the knowledge of the last opcode, so this becomes a game of
             | massaging the code to prod the CPU to use more inputs into
             | its branch prediction. Also helpful color on your empirical
             | observation that branch prediction accuracy dominates other
             | factors like switch indirection and loading registers.
             | 
             | There's one thing I'm still missing. How exactly do we
             | force the CPU to use the last opcode in its branch
             | prediction model? In your first switch example, the CPU
             | "knows" the path it has followed to get to each iteration,
             | so in theory it could use the information of the last node
             | it visited (or the last two nodes, etc.) to aid branch
             | prediction right?
             | 
             | Related to that: in your second example, what exactly
             | happens in `goto *curInst->handler;`? Doesn't this need to
             | revert back to something like a switch statement which has
             | the same problem? (Unless you are doing polymorphism /
             | dynamic dispatch in that example? Which I assume has some
             | performance penalty that you're saying is dwarfed by the
             | extra branch prediction effectiveness). Analogous to the
             | line in the OP's article that says `MUSTTAIL return
             | dispatch(UPB_PARSE_ARGS);` - doesn't the generic dispatch
             | function need another switch statement? Probably missing a
             | couple obvious things here.
             | 
             | Lastly - if you have any books/article recommendations that
             | helped you learn some of these intricacies (esp. regarding
             | intuition about which performance quirks matter vs. don't
             | matter) that would be great as well. Thanks!
        
               | pizlonator wrote:
               | > How exactly do we force the CPU to use the last opcode
               | in its branch prediction model?
               | 
               | In the tail call case: because each opcode is implemented
               | by a function that has an indirect tail call at the end.
               | 
               | In the computed goto case: each `goto _curInst- >handler`
               | is its own indirect branch.
               | 
               | > Related to that: in your second example, what exactly
               | happens in `goto _curInst->handler;`?
               | 
               | `curInst->handler` is a pointer that points to some
               | label. The goto is an indirect branch to exactly that
               | pointer value, i.e. that label.
               | 
               | It's a super crazy and dangerous feature! It obviates the
               | need for a switch; it _is_ the switch.
        
           | hinkley wrote:
           | Instead of a for loop where you hit a switch at the top of
           | the loop, you jump to the code block for next instruction
           | from the end of the previous instruction. That stops you from
           | jumping to the top of the loop and then jumping a second
           | time.
           | 
           | On older CPUs, you're less likely to hit a pipeline stall.
           | This technique was called "super-pipelining" for that reason.
           | But a few years ago when this was discussed, someone pointed
           | out that's usually not necessary anymore. That branch
           | predictors can see through double jumps now.
           | 
           | But as I alluded to in another reply, CPU caches are finite,
           | and I have doubts whether in a fully realized interpreter,
           | particularly one living side by side with a JIT, if that
           | microbenchmark is lying to you about how fast the inner loop
           | is under production conditions.
        
         | hinkley wrote:
         | I've heard this went out of fashion a while back. That branch
         | predictors got good enough that it's not necessary anymore.
         | 
         | But I wonder if that stays true as the size of the interpreter
         | increases.
        
           | pizlonator wrote:
           | My most recent data says it's still relevant.
           | 
           | It might not matter for very small interpreters, but it does
           | matter for anything substantial.
           | 
           | Definitely worth remeasuring though.
        
             | titzer wrote:
             | Threaded dispatch is worth 15-30% in Wizard's fast
             | interpreter.
        
       | nly wrote:
       | perhaps the always_inline attribute could be useful to encourage
       | the compiler to do the right thing also.
        
       | highfrequency wrote:
       | Could someone clarify the example where an unlikely fallback path
       | is wrapped in a musttail?
       | 
       | My understanding is that musttail is useful because it prevents
       | stack frame construction and register spilling; the calling
       | function basically says "you can re-use the existing stack and
       | not worry about my local variables."
       | 
       | But doesn't the stack frame construction overhead / register
       | spilling only occur when the fallback path is actually invoked;
       | therefore if it is unlikely and you care less about performance
       | in this slow path it doesn't matter if you wrap the fallback path
       | in musttail?
       | 
       | (Is this purely a branch prediction effect, where the
       | _possibility_ of extra work needing to be done slows down the
       | fast path even when the work does not need to be done?)
        
         | fweimer wrote:
         | If the fallback path can be invoked repeatedly for one message
         | (depending on message size), the tail call is a correctness
         | issue because your stack is probably not going to be large
         | enough to keep all the fallback path frames.
        
       | mbStavola wrote:
       | For Rust enthusiasts, there is an old RFC[0] that would have
       | added a "become" keyword which guaranteed tco. It was originally
       | postponed in favor of focusing on the 2018 edition's goals (which
       | was the right call) but the initiative has been revisited
       | recently[1]. Maybe it'll make a comeback!
       | 
       | [0]: https://github.com/rust-lang/rfcs/pull/1888 [1]:
       | https://github.com/rust-lang/rfcs/pull/3407
        
       | MathMonkeyMan wrote:
       | One of the compilation passes in a Scheme compiler (e.g. Guile or
       | Racket) is conversion to continuation passing style. Here the
       | author applies the pass manually as a code design decision,
       | because later passes of the compiler (i.e. the actual C compiler)
       | produce better code given that input. It's neat.
        
       | aidenn0 wrote:
       | It mentions C++ support; it would seem to me that C++ has very
       | few tail-calls. Consider:                 foo()       {
       | auto a = SomeClassWithADestructor();         return bar(); //
       | This is not a tail-call because the destruction of a happens
       | *after* the call to bar();       }
        
       ___________________________________________________________________
       (page generated 2024-08-19 23:00 UTC)