[HN Gopher] Parsing protobuf at 2+GB/s: how I learned to love ta...
___________________________________________________________________
Parsing protobuf at 2+GB/s: how I learned to love tail calls in C
(2021)
Author : fanf2
Score : 317 points
Date : 2024-08-19 08:42 UTC (14 hours ago)
(HTM) web link (blog.reverberate.org)
(TXT) w3m dump (blog.reverberate.org)
| gpderetta wrote:
| > I very much hope that the attribute will catch on, spreading to
| GCC, Visual C++, and other popular compilers,
|
| AFAIK, attribute musttail is in the process of being added to GCC
| (the patch is under review) with semantics compatible with clang.
| ufo wrote:
| What about the preserve_most attribute? Is there any chance
| something like that will get into GCC? Without it, the non-tail
| calls ruin the interpreter.
| gpderetta wrote:
| Maybe, but not there yet:
| https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899
|
| In the meantime some inline assembly macro trickery might
| help.
|
| edit: code duplication can also be obviated by templating
| your op function with a fast/slow parameter, with the fast
| variant tail-calling the slow variant when it cannot perform
| the fast path, while guarding the slow code via the compile
| time parameter. The downside is yet more code obfuscation of
| course.
| katzinsky wrote:
| Given some of the other scheme-like features GNUC has it's
| surprising they're lagging on this one.
| fweimer wrote:
| It's a hard problem because many ABIs cannot do tail calls even
| for very basic stuff, like calls to extern functions with
| matching argument and return types. It looks like Clang has
| some heuristics for changing call sequences for musttail calls.
| For example, it switches to noplt calls on i686. None of this
| is mentioned in the Clang documentation:
| https://clang.llvm.org/docs/AttributeReference.html#musttail
|
| What's realistic here is that you get a diagnostic if the
| compiler cannot generate a tail call. For many users, that will
| likely be good enough. Guaranteeing a tail call as in Scheme is
| unlikely to happen.
| irdc wrote:
| I wonder how fast this would be when using a trampoline, i.e.
| returning the next function as a function pointer and calling
| that from an outer loop. That would have the advantage of being
| portable C.
| chii wrote:
| the reason i suspect tail call optimization is fast would be
| because the resultant loop is predictable and thus CPU
| instruction prefetch and memory prefetch works very well.
|
| Jumping via function pointers would probably not be as
| predictable, and you'd unlikely to see the same benefit.
|
| Of course, one must measure, and i haven't.
| gpderetta wrote:
| The tail call interpreter is also calling through a function
| pointer. The cost here is purely the call+ret overhead, which
| can be non-trivial when it is per opcode; on some micro-
| architectures there is also a limit on taken jumps per cycle
| (sometimes as low as one taken jump every other cycle).
|
| edit: trampolining would also collapse all indirect jumps to
| a single source address which is not ideal for branch
| prediction.
| soegaard wrote:
| C is often used as target language for compiler from higher
| level languages. The Scheme programming language requires all
| tail calls not to grow the stack. Therefore implementors have
| explored various techniques including trampolines. I don't have
| a citation, but you can find the answer in the papers on
| compiling Scheme to C. If there is no guarantee of TCO in the
| target language, then the generated programs will be slower.
|
| Incidentally, this is why implementors of (especially) high
| level languages are annoyed that TCO was removed from the
| JavaScript specification. There are even solution for having
| TCO and still have stack inspection.
|
| https://github.com/schemedoc/bibliography/blob/master/page8....
| stabbles wrote:
| Maybe the example is too simple, but it does not require
| `__attribute__((musttail))` for good code gen.
|
| Also if the error handling function is unlikely, you wouldn't
| care too much about how fast it is to call it?
|
| To me it seems like a structure of the form
| <read data> if (unlikely(malformed)) return
| error(); <prep more stuff> switch
| (data_type) { case x: return handle_x();
| case y: return handle_y(); }
|
| generates a nice jump table quite reliably.
| OskarS wrote:
| Obviously compilers have been doing tail call elimination for
| ever, but for this trick "generates [tail calls] quite
| reliably" is not enough: you have to GUARANTEE it (or fail
| compilation), otherwise this structure does not work (it will
| blow the stack immediately). That's the point of [[musttail]],
| tail call elimination is _required_ , that's the only choice
| the compiler has.
| stkdump wrote:
| That means that any such code is not portable across
| compilers anymore. It is effectively written in a non-
| standard C dialect, because it requires a language extension
| to work correctly.
| ufo wrote:
| Yes. But the alternative is assembly language, which is
| even less portable.
| kachapopopow wrote:
| assembly is the most portable of them all even across
| platforms! (x86 to arm compilation meme)
| Karliss wrote:
| The portable alternative is being explicit with your
| loops, possibly in combination with gigantic unwieldy
| switch statements or some regular goto (it is still part
| of standard). But that comes at the cost of readability
| and sometimes performance.Whether it's practical depends
| on the usecase. For something like recursive data
| structure algorithms which are relatively small and self
| contained, I would say it's perfectly doable. Simple
| interpreters - maybe. Complex interpreters - here it
| becomes messy.
| mananaysiempre wrote:
| See Mike Pall's posts on the subject--the performance
| cost is considerable, for two reasons. First, you're
| forcing the compiler to do register allocation for the
| whole interpreter at once, which it can virtually never
| do a good job of. (This is actually the more important
| part.)
|
| Second, given the existence of branch target buffers (and
| the ruinous cost of mispredicted branches), you really
| want the instruction dispatch to be a single indirect
| branch at the end of each instruction implementation, and
| for that standard tools are somewhere between unhelpful
| (you can write a macro containing switch (*insn++) { case
| INSN_FOO: goto impl_foo; /* ... */ } but it's anybody's
| guess whether you're getting a single jump table for all
| copies of that) and actively obstructive ("tail merging"
| in older versions of Clang would actively destroy any
| attempts at copying dispatch code). Granted, sometimes
| things work out (new Clang versions can sometimes go
| "looks like you're writing an interpreter" and turn a
| vanilla switch in a loop into duplicated dispatch code).
| Then again, sometimes they don't, and you can't actually
| know.
| OskarS wrote:
| Yeah, that's the way virtual machines have been written
| forever, some version of for
| instruction in instructions { switch
| (instruction) { case OPCODE_X: //....
| case OPCODE_Y: //.... case OPCODE_Z:
| //.... } }
|
| This is how VMs have been written since the dawn of time
| (or using computed gotos, another non-standard addition
| to C). It has problems though, like the fact that the
| `switch` branch is extremely unpredictable, and that you
| get a _massive_ function which is hard to optimize. This
| [[musttail]] trick is a huge improvement. But yeah, if
| you got to support compilers that don 't have
| [[musttail]], you in essence have to have two
| implementations, the [[musttail]] one and the loop/switch
| one.
| ufo wrote:
| A switch-case is the default way to write an interpreter
| and I'd even argue it's the most readable.
|
| In the context of this article, it's all about the
| performance. Switch-case generates suboptimal code for
| the commonly used fast paths of the protobuf parser,
| because the mere existence of the slow paths is enough to
| interfere with the performance of the code around them.
| OskarS wrote:
| Yes, that is correct. You cannot do this trick in standard
| C, C++ or Rust, it requires some version of [[musttail]].
| Strong argument for adding it to the C standard, IMHO.
| dist1ll wrote:
| Fwiw, many C projects are written in a non-standard C
| dialect, including the Linux kernel.
| flohofwoe wrote:
| The typical way to deal with this is to put the
| __attribute__ into a C macro which expands to nothing on
| compilers which don't understand GCC/Clang's __attribute__
| keyword. The code without the attribute will still compile
| and most likely also apply the tail call optimization, you
| just don't get an error if the compiler can't apply the
| optimization.
|
| Also TBF, hardly any real-world C code is strictly standard
| compliant. Many C compilers just agree on a common syntax
| that includes both the C standard and some popular non-
| standard extensions.
|
| PS: C++ compilers actually ignore unknown attributes since
| the `[[attribute]]` syntax has been standardized in C++11.
| In GCC and Clang you'll get a warning in the standard
| warning set, but not in MSVC.
|
| PPS: C23 also standardized the `[[attribute]]` syntax and
| also added a way to check for supported attributes:
|
| https://en.cppreference.com/w/c/language/attributes
| pjmlp wrote:
| It will compile, and eventually blow up nicely with a
| stack overflow OS fault.
|
| Ah, the joys of writing "portable" C code with #ifdef
| spaghetti, across commercial UNIXes and their own C
| compilers, 20 years ago.
|
| It only got better because for many people just like they
| assume Web == Chrome, C gets C == GCC, blessifully
| ignoring everything else.
|
| Nowadays clang is also considered, mostly because a
| couple of companies wanted to replace GCC, and naturally
| clang needs to be able to match whatever GCC offers.
| Filligree wrote:
| > It will compile, and eventually blow up nicely with a
| stack overflow OS fault.
|
| Not at all guaranteed. Stack overflow is undefined
| behaviour, which means compilers can optimise your
| program on the assumption that it doesn't happen.
| pjmlp wrote:
| Ah true, yet another case that one tends to forget about.
| flohofwoe wrote:
| Well, if you use recursive code, you better know what
| you're doing. With or without tail call optimization.
| eru wrote:
| Loops are just a special, limited case of recursion.
|
| (And only necessary in languages that have trouble
| implementing function calls properly.)
| taneq wrote:
| There's no such thing as "standard C" that you can actually
| write, due to UB and implementation defined behaviour.
| There's just _(C, compiler version, platform)_ that defines
| (if only through the compiler 's source code) what will
| actually happen in any given situation.
| perching_aix wrote:
| so because there are implementation defined behaviors in
| the standard, language extensions become okay?
| flohofwoe wrote:
| Language extensions are a feature, not a bug. They allow
| C to evolve and C compilers to compete without requiring
| committee consensus. Good extensions will eventually be
| picked up by other compilers, and maybe even find their
| way into the standard.
| perching_aix wrote:
| sure, i get the joke, i just don't like it. it's the same
| story as browsers. proprietary extensions in the name of
| progress because technically it's allowed, but also
| unimplemented standardized features galore, necessitating
| polyfill libraries and frequent checking of support
| matrices.
|
| it's just sprawl with popular support.
| gpderetta wrote:
| It is very different. A web browser is a (almost) fully
| self contained environment. Same with something like
| JAVA. On the other hand a standard C or C++
| compiler+runtime has (by design) very little features.
| Anything beyond trivial programs has to reach to platform
| specific features. Even POSIX is often not enough.
| perching_aix wrote:
| not sure i'm following? i was referring exactly to those
| "few" language features, e.g.:
| https://en.cppreference.com/w/cpp/compiler_support
| uecker wrote:
| It is entirely possible to write strictly conforming C.
| jonstewart wrote:
| The article is pretty clear about this. When it comes to
| fast lexing and parsing, it is typical for projects to make
| portability tradeoffs in favor of performance. For example,
| simdjson is full of assembly.
| eru wrote:
| You say it like it's a bad thing.
| jerf wrote:
| "you have to GUARANTEE it (or fail compilation)"
|
| I've often pondered the utility of similar flags for other
| optimizations. This is perhaps the largest one, but there are
| other situations in certain code where I want to know that my
| optimization has failed.
|
| A more complicated example is, I've sometimes wanted an
| assertion that a given function is inlined. We know from
| hard, repeated experience over decades that letting users
| annotate functions as inline directly doesn't end up working
| very well, but I've often wondered about creating an
| assertion that would fail the compile if it isn't. (Ideally
| with at least a hint as to why the compiler failed it out,
| but that's easier said than done.) Obviously you don't want
| to go slapping this in some major open source library that's
| going to be compiled by half-a-dozen compilers on dozens of
| operating systems, but for my own code in my own situation it
| can be an optimization that is the difference between success
| and failure and it'd be nice to flag a failure.
|
| (Bear in mind I am not proposing to blindly take any
| particular action if it triggers. The point is to bring it up
| to human attention and not have it buried deep in invisible,
| yet potentially consequential, compiler decisions. The human
| deciding "I guess this optimization isn't happening" and
| removing the annotation would be one valid decision, for
| instance.)
| OskarS wrote:
| I agree, this would be useful. Another one I would like is
| auto-vectorization, a way to mark a loop with an attribute
| and if it fails to auto-vectorize, the compiler should
| print out the auto-vectorization report for that loop,
| explaining why it happened. It's such a brittle
| optimization, but it's crucial for a tiny number of
| extremely hot loops, you would want to know if it failed
| due to some code change or compiler upgrade. Also, it's
| just a pain to use auto-vectorization report normally.
| ndesaulniers wrote:
| > I've sometimes wanted an assertion that a given function
| is inlined.
|
| Try `__attribute__((error("not inlined")))` or `warning` on
| the callee.
| mananaysiempre wrote:
| As the disassembly in the post demonstrates, the problem with
| the fallback path (which is not necessarily the error path) is
| not how fast the call to it is, it's that the mere existence of
| that call can force the compiler to create a stack frame and
| spill registers into it for the whole function, including the
| fast path.
|
| OK, maybe "force" is not the right word--nobody says the
| compiler has to have a single stack frame structure for all
| possible execution paths of a function. Nobody even says it has
| to use the standard ABI for a no-linkage (static or anonymous-
| namespace) function (that doesn't have its address taken). But
| the reality is, all compilers I've seen do, including Clang, so
| we want a way to tell them to not worry about the ABI and avoid
| wasting time on preserving registers across the call.
|
| Re your nice jump table, sure it does. But if you try running
| the result under, say, perf report, and your test bytecode
| doesn't describe a short loop, you'll see one of two things:
| either you had a branch mispredict on each dispatch; or the
| compiler went "looks like you're trying to write an
| interpreter" and moved the indirect jump to the end of each
| case (I've seen Clang do this). And either way the register
| allocation in the resulting code probably sucks.
| jcelerier wrote:
| > so we want a way to tell them to not worry about the ABI
| and avoid wasting time on preserving registers across the
| call
|
| that's what -fvisibility=internal already does, no?
| mananaysiempre wrote:
| That's what static _could_ do (if the function's address is
| not taken, or given sufficiently powerful dataflow
| analysis), but C and C++ compilers don't take advantage of
| that. Getting that out of -fvisibility=hidden -flto would
| also be possible, but requires even more nonexistent
| compiler smarts. (From a quick web search, I can 't figure
| out what internal visibility brings over hidden.)
|
| (Granted, it's not like this is completely impossible--I
| seem to remember GHC and MLton can invent custom calling
| conventions for Haskell and SML respectively. But the
| popular C or C++ compilers can't.)
| kccqzy wrote:
| Really? I've always assumed that static gives C/C++
| compilers the right to disregard calling conventions
| altogether. Now that I think about it, this assumption
| might be unjustified. Was there any strong reason for
| compilers to keep the calling convention?
| mananaysiempre wrote:
| I mean, it does give them that right as far as I know
| [with, again, the caveat of function pointers--you can't
| pass a non-ABI-compliant function to qsort(), however
| static it is--and an additional one of inline assembly],
| it's just that they choose not to exercise it. Why? Not
| sure. Maybe it's too difficult (register allocation and
| spilling is NP-complete and fairly expensive). Maybe it's
| too difficult when your compiler starts out as a
| function-at-a-time one and builds on that. The popular
| ones don't do that, is my point. (Would be interesting to
| check what SDCC does, by the way, on register-starved
| micros such tricks could make sense.)
| zzo38computer wrote:
| I think that "static register" should make the compiler
| allowed to disregard calling conventions (if
| optimizations are enabled), but neither "static" nor
| "register" alone should do so. (In a discussion on an IRC
| some time ago, I had suggested what "register" alone
| should do in my opinion, which is something else than
| this.)
| PaulRobinson wrote:
| Perhaps I've just been looking more (because I've been going back
| to the books to pick up C again after a 20 year absence for a
| side project), but it feels like in recent years there has been a
| little uptick in people appreciating the level of control C
| provides, and "going back to basics".
|
| I think the work on C23 (which would have still been C2x when
| this article was written), means people are starting to want to
| see innovation in the language again. I don't think that would
| have happened without Go and Rust showing some great thinking,
| but it's interesting that it's happening at all.
|
| Tail calls are an interesting idea, and now I want to know more
| about both this extension, and also how to write my code in such
| a way as to help a compiler optimize for tail calls even when
| this extension isn't present. This somehow feels more fun to me
| than writing more Python or Ruby...
| mrweasel wrote:
| > it feels like in recent years there has been a little uptick
| in people appreciating the level of control C provides, and
| "going back to basics".
|
| I don't write C, and maybe it's because I somehow seek out
| these types of article and projects, but I'd agree, I'm seeing
| the same thing. It might be a backlash against programming
| languages like Rust or even JavaScript. Rust being really
| complicated, but clearly safer, and JavaScript... well it's
| just really hard to set up a development environment and the
| tooling is almost more predominant than the language itself.
|
| While I don't write C myself, only for fun, I can see many just
| reverting to it, because it's "right there", it's fast, the
| language is fairly simple, even if the standard library is
| anything but simple, but you can take the stuff you need an
| ignore the rest.
| tazu wrote:
| I've personally been very motivated to learn C (coming from
| Go) by witnessing @jart's[1] progress on Cosmopolitan[2] and
| Readbean[3]. Writing C is almost like an exercise in
| sovereignty: ultimate freedom (and danger).
|
| [1]: https://justine.lol/
|
| [2]: https://github.com/jart/cosmopolitan
|
| [3]: https://news.ycombinator.com/item?id=26271117
| queuebert wrote:
| jart is the best thing to happen to C since K&R.
| jerf wrote:
| I think it's fine to go back to C and maybe play around a
| bit to learn about some of the things that can be done, but
| I would implore you to bear in mind that the decades have
| taught us that the "ultimate danger" in question is
| basically that you're building sand castles in a minefield.
| We're not talking "oh ha ha I guess I wrote a few more bugs
| that a stronger type system would have caught", we're
| talking "oh ha ha I guess remote unauthenticated attackers
| can run arbitrary assembler as root in my network code
| because I tripped over one of C's nominally well-known
| mines that I did not personally know about and all the
| attackers had to do was slightly tweak one of the exploit
| scripts already in metasploit to install root kits on my
| system and add it to their bot net".
|
| The world has gotten a lot more dangerous than people
| realize. People generally quite correctly assume that
| hackers aren't going to spend person-months attacking their
| system personally but don't realize that the attacker tools
| are quite sophisticated now and _they don 't have to_.
| Shoving code into a small little buffer overflow to lift to
| a larger buffer overflow to load exploit code over the
| network that will run a pre-packaged root priv escalation
| to install a rootkit to add them to their botnet is no
| longer the work of a team of hackers for months. It's all
| off-the-shelf tech now and it's more like writing a 20 line
| function now; you don't need to attract very much attention
| now to attract that level of sophistication.
|
| We are leaving C behind collectively for very good reasons.
| If you are playing with C and you do not intimately
| understand those reasons, you're going to relearn them the
| hard way.
| uecker wrote:
| I do not think we are close to "leaving C behind
| collectively" and neither should we.
| lanstin wrote:
| I wrote 10s of thousands of lines of C code in the 90s
| and early 00s (without buffer overflows that I learned
| about; I did also write a evil network layer for inducing
| buffer overflows in my and my dependencies code), and
| have been doing a lot of other languages since then, and
| then had occasion to write some more C where I was doing
| string allocation and manipulation (for a LD_PRELOAD to
| monitor what various programs I lacked source to where
| doing), and it was absolutely nerve wracking. Linux
| kernel might be mostly C for a long time, but it would be
| crazy to start a new thing in C. There's growing re-write
| projects from C to Rust. It would be farther along except
| the Rust people seem to dislike laboriously recreating
| decades of GNU long-opt functionality in all these base
| packages to actually make Rust drop-in replacements for
| C.
|
| Maybe for embedded, I haven't done that, but for general
| purpose, I can't imagine it being worth the risks.
| jerf wrote:
| It is definitely on its way down, and it is a good thing.
|
| You may think you can handle C. I disagree. The evidence
| is on my side. We need safer languages.
|
| Heck, that even overstates it. It's not like we need
| super-safe languages per se. Maybe the 2070s will
| disagree with me. But we need languages that aren't
| _grotesquely unsafe_. It 's not just that C isn't as safe
| as a language could be; it is that it is recklessly
| unsafe.
|
| Interrupting that descent is foolishness, and pinning
| your career to a resurrection in C even more foolish.
| These technologies always have this meta-meta-contrarian
| phase to them just before they expire. I got into the
| computer field just as the meta-meta-contrarian
| "everything must be written in assembler" phase was at
| its peak. It was a false resurrection and I pity anyone
| who overinvested in learning how to write large
| applications in 100% assembler as a result of reading
| someone's over-enthusiastic screed about the virtues of
| writing pure assembler in 1998.
|
| So I write this in the hopes of helping some other young
| HN reader not be fooled. C may be a thing you learn at
| some point, some day, just as assembler is still
| something you may learn. But don't get overexcited about
| the last meta-meta-contrarian false resurrection before
| death. Which is still years away, but at this point I
| think it's pretty much set on an irrevocable course.
| queuebert wrote:
| C is great for its basics, but at some level, when you gain
| awareness of all of your unpunished pointer lifetime sins and
| how annoying the lack of basic data structures is in stdlib,
| then you realize that biting the bullet and learning something
| like Rust is the way to go. Maybe a better stdlib would help,
| but I've found that the pain of becoming proficient in Rust has
| paid back multiples over the continual toil of doing certain
| repetitive tasks in C.
| uecker wrote:
| It is not terribly hard to find some library for data
| structures in C. What I do not like about Rust is: syntax,
| complexity, long compile times, cargo, ...
| coldpie wrote:
| C is great and I definitely recommend learning it & becoming
| proficient in it. I don't think I would recommend it for a new
| project, for all the reasons jerf mentions in a sibling
| comment[1]. But there's a ton of existing code out there, and I
| think it provides a nice little peek into what the hardware is
| actually doing that you don't get from higher level languages.
|
| > I think the work on C23 means people are starting to want to
| see innovation in the language again.
|
| I have to admit this makes me nervous. Part of what's nice
| about C is its simplicity and its stability. It's relatively
| easy to write a compiler, and the lack of having many different
| ways to do the same thing means all code written in C tends to
| "feel" roughly the same (outside some domain-specific areas),
| which means it's easy to dive into a new codebase and get up to
| speed. It's basically all structs & functions and abstractions
| built upon them. That's it.
|
| While I definitely am not opposed to making improvements to the
| C spec, seeing the inline anonymous functions in a proposal
| linked elsewhere in this thread[2] really turns me off. It
| starts to feel like the hideous mess that C++ has turned into,
| with oodles of implicit behavior, un-greppable code, generated
| symbol names leading to compiler error hell, and a thousand
| different colors for your bike shed.
|
| We already have languages that can do all this stuff. I welcome
| making changes to C to make it nicer to program in C, but I'm
| wary of proposals that turn C into yet another messy kitchen
| sink programming language. Let's keep it clean, obvious, and
| simple, please.
|
| [1] https://news.ycombinator.com/item?id=41291206
|
| [2] https://www.open-
| std.org/jtc1/sc22/wg14/www/docs/n3266.htm#u...
| sp1rit wrote:
| I wonder how this behaves in combination with
| __attribute__((cleanup(...))). Especially if the to be cleaned
| variable is passed into the tail function as parameter.
| Karliss wrote:
| You get an error that tail call can't be performed which is
| kind of the point of tailcall attribute. Same thing with
| regular c++ destructors or dozen other language features which
| interfere with tail calls, no need for extensions like
| __attribute__((cleanup()).
|
| https://godbolt.org/z/Yex74Wjz7
| fuhsnn wrote:
| A C standard proposal exists for tail call [0], in the form of
| "return goto (expression);".
|
| What I like about it, over standardizing [[musttail]], is that
| lifetimes of local objects are guaranteed to end. This makes it
| possible to implement without extensive escape analysis.
|
| [0] https://www.open-
| std.org/jtc1/sc22/wg14/www/docs/n3266.htm#5...
| nxobject wrote:
| As an aside, I'm excited - but also with lots of trepidation -
| about the new energy in adding new functionality to C. Excited
| because there are changes and additions that are crying to be
| made, even clarifying ideas... trepidation because C++'s gung
| ho update cadence has somehow ended up in wart patched over
| wart. Especially when feature reacts with feature in an
| unfelicitous way far earlier than anticipated. I hope the
| standards process finds a way to be very conservative, really
| thoroughly test features in large and diverse codebases, rather
| than just relying on rationales alone, when choosing to include
| feature.
| pjmlp wrote:
| What new energy? We have had C89, C99, C11, C17, C23
|
| And yet, no fix in sight for proper strings, arrays, or at
| very least bounds checked slices, like Dennis Ritchie
| originally proposed to ANSI/ISO.
| haberman wrote:
| Thanks for the pointer, I'll have to check this out.
|
| Can you elaborate on how "return goto" is easier to implement?
| [[musttail]] also ends the lifetime of local objects AFAICS.
|
| One thing I'll mention from a quick scan:
|
| > [The] function called in tail position must have identical
| type to the callee. This ensures both that the return value
| does not require any conversion, and also that argument passing
| space is available and calling convention (if relevant) is
| maintained.
|
| One complaint I've seen repeatedly about [[musttail]] (which I
| implemented in Clang) is that this constraint is unnecessarily
| strict, since some architectures will allow tail calls even for
| functions that do not perfectly match:
| https://github.com/llvm/llvm-project/issues/54964
|
| "But then the code would be nonportable." True, but tail call
| optimization is inherently nonportable, since some targets
| fundamentally do not support tail call optimization (eg. WASM
| without the tail call extension).
| tines wrote:
| > some targets fundamentally do not support tail call
| optimization
|
| Can't any tail call be rewritten as a loop? Couldn't a WASM
| compiler without the tail call extension implement it this
| way?
| metadat wrote:
| Only with the aforementioned escape analysis. The function
| call stack frames serve a critical purpose in most
| nontrivial logic.
| tonyg wrote:
| > Can't any tail call be rewritten as a loop?
|
| No. In general tail calls cannot be rewritten into loops.
| cobbal wrote:
| More specifically, tail _recursion_ is usually easy to
| turn into a loop. Tail calls can be difficult to turn
| into loops when they call a different function, or worse
| a function passed in as a variable.
| slaymaker1907 wrote:
| It can't be rewritten as loop due to function pointers.
| Using JS notation to avoid noise:
|
| function logAndCall(statement, func) {
| console.log(statement); return func(); }
|
| Tail call optimization is actually possible here since we
| call func in "tail position". It might be unlikely to blow
| up the stack, but it can definitely happen if you do a lot
| of continuation passing.
|
| Perhaps more commonly for C++/Rust, tail call optimization
| would be enormously valuable to have behind the scenes for
| destructors. It's actually very difficult to implement a
| linked list in safe Rust that doesn't explode the stack for
| large lists since no TCO is done for destroying subobjects.
| You can still avoid stack overflow, but you have to do
| things like manually enumerating the list.
| derefr wrote:
| I don't get the problem. The generated code for your
| example would return the same function pointer +
| evaluated args list to the wrapping trampoline (i.e. the
| code that contains the loop), where that loop expects
| each thing it invokes to return a sum type
| NextInvoke(func, args) | Return(value).
|
| And that's not a special case for passed-in function
| pointers; that's what a trampoline always does.
| Trampolines expect static-invoked tail-calls to be
| codegen'ed into function-pointer references too.
| keithwinstein wrote:
| Yes, wasm2c implements the Wasm tail-call feature with
| trampolines, exactly this way. (https://github.com/WebAssem
| bly/wabt/blob/main/test/wasm2c/ta... has an example.)
|
| Doing it with a trampoline is probably slower than if C
| really had tail calls. On the other hand, adding "real"
| tail calls to C would probably require changing the ABI
| (e.g. to "tailcc" or "fastcc -tailcallopt"), and I think
| there's some reason to think this would probably impose a
| penalty everywhere
| (https://llvm.org/docs/CodeGenerator.html#tail-call-
| optimizat...).
| haberman wrote:
| > On the other hand, adding "real" tail calls to C would
| probably require changing the ABI (e.g. to "tailcc" or
| "fastcc -tailcallopt")
|
| But [[musttail]] does exactly this while respecting
| existing calling conventions: https://clang.llvm.org/docs
| /AttributeReference.html#musttail
| keithwinstein wrote:
| No -- as discussed upthread, clang's musttail attribute
| requires the target function to have the same number of
| arguments as the caller and for each argument to be
| similar to the corresponding caller argument. That's
| stricter than the underlying LLVM musttail marker (when
| targeting the tailcc/swifttailcc calling conventions) and
| is too restrictive to implement Wasm's tail-call feature
| (and probably Scheme's, etc.), at least if arguments are
| getting passed to functions natively.
|
| It would be nice if the more relaxed rules of the LLVM
| musttail marker with tailcc could be exposed in clang
| (and gcc). I think that's basically what "return goto"
| would do.
| fuhsnn wrote:
| >[[musttail]] also ends the lifetime of local objects AFAICS.
|
| That's good to know. I had this github issue [0] in the back
| of my mind, as well as witnessing occasions of clang turning
| [[musttail]] into inner loops, and concluded clang's
| implementation must be more sophisticated than simply
| replacing calls with jumps. Just a little paranoia from
| trying to be serious with compiler dev[1]: fulfilling a laid-
| out spec feels more sound versus imitating something out
| there.
|
| >this constraint is unnecessarily strict
|
| I would agree, at least for x86 psABI, it can be pretty
| elaborative as long as the return value is the same register
| and argument stack don't exceed what's provided.
| Debug/profiling side might hate it, though.
|
| [0] https://github.com/llvm/llvm-project/issues/72555 [1]
| https://github.com/fuhsnn/slimcc/
| haberman wrote:
| I certainly understand your caution. I don't have intimate
| expertise with the implementation of musttail in the
| backend -- when I implemented musttail in Clang, it was
| piggybacking on an existing attribute from the LLVM IR:
| https://llvm.org/docs/LangRef.html#call-instruction
|
| That said, my rough understanding is that a tail call ends
| the lifetime of all objects in the old stack frame. It
| follows that it is UB to access any objects from the
| previous stack frame after a tail call, and that would
| include Gerben's first example in
| https://github.com/llvm/llvm-project/issues/72555
|
| Your slimcc project looks really interesting, thanks for
| the pointer.
| nu11ptr wrote:
| If added to Clang then I would assume this means it got support
| from LLVM. If so, does this mean that Rust can now rely on tail
| calls in LLVM to support something like the `become` keyword?
| ori_b wrote:
| The problem with doing it in rust is that most calls aren't
| tail calls, even if they look like tail calls. You need to
| invoke the destructors for any code path that can drop.
| nu11ptr wrote:
| Isn't that the purpose of `become`? I thought it was to say
| "this IS a tail call, error out if it is not". After that
| validation is done, then the compiler can drop as needed.
| iTokio wrote:
| The remaining issue with tail calls used to switch context, is
| that you're using functions that must use a calling convention.
| And unfortunately they waste registers to restore state on
| function exit.
|
| See the luajit remake blog for an exhaustive analysis and
| alternative using an intermediate compiler
| https://sillycross.github.io/2022/11/22/2022-11-22/
| superdimwit wrote:
| Clang recently got a new calling convention that makes these
| tail calls much cheaper (avoids the need for the caller to
| preserve some registers). I can never remember the name - it's
| either preserve_all or preserve_none (whose perspective is the
| preservation from?).
| haberman wrote:
| preserve_none is the new one. It can be applied to the
| functions performing tail calls to allow them use of the full
| register space.
|
| I even saw an enhancement recently that will make
| preserve_none allocate arguments in the registers that are
| usually callee-saved: https://github.com/llvm/llvm-
| project/pull/88333
|
| This will make [[musttail]] + preserve_none a winning
| combination when used together, particularly when making non-
| tail calls to fallback functions that use a regular calling
| convention, because all the arguments to [[musttail]]
| functions can stay pinned in the callee-save registers.
|
| I'm delighted, because this matches what I originally
| proposed back in 2021, except I called it "reverse_ccc"
| instead of "preserve_none": https://github.com/haberman/llvm-
| project/commit/e8d9c75bb35c...
|
| preserve_all also exists, and has existed for a while. You
| could use it on fallback functions to help the tail calling
| functions avoid spilling. But this always seemed like an
| unsatisfactory solution to me, because it's too intrusive
| (and requires too much diligence) to tag a bunch of regular
| functions as preserve_all. It's much more practical to tag
| all the core interpreter functions as preserve_none.
| superdimwit wrote:
| Thanks!
| jeffbee wrote:
| I see signs that Google itself is using preserve_none
| internally, since the public protobuf repo has PROTOBUF_CC
| (Calling Convention) but it is defined as nothing
| #define PROTOBUF_CC
|
| Is there any chance of this getting out into the wild or is
| it too dangerous for us mortals?
| haberman wrote:
| Since preserve_none is Clang-only and only available on
| recent compilers, it would introduce an ABI hazard
| between the core library and generated code. We don't
| want to cause crashes if you compile protobuf with one
| compiler and generated code with another.
|
| Also, until quite recently preserve_none was incompatible
| with the sanitizers, but I believe this may have been
| fixed by: https://github.com/llvm/llvm-project/pull/81048
| hinkley wrote:
| I've seen a few languages over the years drop and reacquire JIT
| layers. Some of it has to do with programmer skill and lessons
| learned, but some is also down to CPU generation.
|
| Like everything else in CS, when the cost balance shifts for
| different kinds of operations, the best algorithm can shift
| back to something we haven't used in fifteen, twenty years. It
| contributes a lot to the faddishness of programming. Just
| because we are bringing something back doesn't mean there's no
| reason. But we forget the reasons it wasn't a panacea last time
| so that's still a problem.
|
| If your main JIT gets faster or slower, then the cost-benefit
| for running it changes, so the threshold to trigger it gets
| adjusted, and now the amount of code that runs in the other
| tiers shifts, which might make the amortized cost of that tier
| worse. It's like balancing a double pendulum.
|
| If you can make a JIT tier fast and dirty enough, you can skip
| the interpreter entirely. And, from my armchair position, it
| seems that the cognitive load of bookkeeping tasks between the
| interpreter and say two JITs is high enough that a few
| languages have mothballed the interpreter and used a JIT
| optimized for compile time not output speed.
|
| And I don't recall what language, but I'm pretty sure at least
| one team that did this ended up dropping an intermediate
| compiler as well, because of that balancing act I mentioned
| above. It was better to double down on two than to try to
| handle three.
| beyondCritics wrote:
| With gcc [1] and clang [2] you always had the option "-foptimize-
| sibling-calls", to get away with tail calls even for debug
| builds. Of course having this standardized, guaranteed and at the
| function level is a huge improvement.
|
| [1] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html [2]
| https://clang.llvm.org/docs/ClangCommandLineReference.html#t...
| chmaynard wrote:
| See also:
|
| https://news.ycombinator.com/item?id=26931581
| crabbone wrote:
| I wrote C Protobuf decoder / encoder as well as IML parser and
| bindings for Python. Here's something I have to say about
| measuring parsing speed.
|
| * Since this library is only offered as a binding to some managed
| languages there are some extra caveats that, in term of
| performance overshadow everything else. I cannot speak to Ruby or
| PHP, but with Python I saw a dramatic speed increase when _not_
| using enumerators. I.e. if you translate Protobuf 's enumerators
| into Python's enumerators any possible gains you may have in your
| C code will be trampled by creation times of different Python
| objects. The difference is many orders of magnitude. Similarly,
| you could go even further and implement all the supporting data-
| structures in C, and only expose minimal interface to them in
| Python. How fair would this sort of comparison be against code
| that uses Python's built-in structures is the question I cannot
| answer.
|
| * Google's Protobuf parser for Python would be still "faster"
| than 2+GB/s because... it doesn't parse anything beside the top-
| level message. The internal structure of the message is parsed
| on-demand. This will be probably slower than 2+GB/s if your code
| instantly reads everything that was parsed, but the obvious
| question would be: how can you possibly compare these two
| approaches in practical terms? And, again, there isn't a clear
| answer, because the practical results will depend on the nature
| of your application.
|
| * In general case, Protobuf parsing cannot be streamed (because
| of the handling of duplicates). This means, that, in practice,
| code that parses Protobuf content will be bottlenecked on I/O (it
| needs to wait for the end of the message before any parsing
| starts). Independently, depending on the typical Probobuf message
| in your application, it might be possible to parallelize parsing,
| which, again, will most likely outrun any single-threaded parser,
| but, just as examples above, cannot be said to be a winning
| strategy in general.
|
| * It's usually _a lot_ more efficient to combine parsing with
| creation of domain objects, which is a step an application will
| almost always have to take anyways. How this functionality is
| accessible from the parser will in many cases determine which
| parser will win the race.
|
| ----
|
| Bottom line: Protobuf (or maybe parser in general) is just a bad
| candidate to try to measure / compare speeds. It's too low-level
| and poorly designed to be a good measuring stick for performance
| benchmarks.
| jeffbee wrote:
| > In general case, Protobuf parsing cannot be streamed (because
| of the handling of duplicates)
|
| I don't see how last field wins stops you from streaming parse.
| Please elaborate.
| crabbone wrote:
| Streaming in this context means that the parsing code hands
| off some parsed structures to the code outside of the parser
| before the entire message is processed. Suppose now you have
| this message: { x: 1, y: 2, x: 3 }
|
| The streaming parser then reads field "x" with the value of
| "1", dispatches that value to the outside code, then reads
| "y", then reads "x" again, but the outside code had a side-
| effect associated with that field, which it already performed
| (eg. the outside code was printing the value of the field).
| Now, you have a program with an error. The right output
| should've been: y: 2 x: 3
|
| but you got: x: 1 y: 2 x: 3
|
| Might not be so bad, depending on circumstances, but if you
| are betting on a soccer game outcome...
| jeffbee wrote:
| I see. But if this ambiguous repetition must be resolved,
| then it must be resolved either at input or output time.
| Protobuf seems to have optimized for the output case by
| allowing for updates to scalar fields by appending.
| 10000truths wrote:
| You could easily design a stream parser that rejects
| duplicates before it passes them off to the user, by
| maintaining a set of already encountered keys within the
| parser state. The space overhead isn't a concern unless
| your map/set has millions of entries, but your non-
| streaming parser would have choked from the memory usage
| long before then, anyways.
| pizlonator wrote:
| The way interpreters usually achieve this kind of speedup in C++
| is using computed goto. Then there's no calling convention junk
| on the path from one opcode to the next.
|
| Also, the main reason why interpreters get faster using either
| the computed goto style or the tail style versus a classic switch
| loop is that it reduces pressure on the branch predictor since
| there's statically one indirect branch per opcode rather than
| statically just one indirect branch.
| tylerhou wrote:
| (As the article claims) even with computed goto register
| assignment of the most frequently used variables is fragile
| because the CFG of the function is so complicated.
|
| Register assignment is much less fragile when each function is
| small and takes the most important variables by argument.
| pizlonator wrote:
| It's also fragile, in a different way, if you're threading
| state through tail calls.
|
| In my experience writing computed goto interpreters, this
| isn't a problem unless you have more state than what can be
| stored in registers. But then you'd also have that same
| problem with tail calls.
| haberman wrote:
| Fallback paths most definitely have more state than what
| can be stored in registers. Fallback paths will do things
| like allocate memory, initialize new objects, perform
| complicated fallback logic, etc. These fallback paths will
| inevitably spill the core interpreter state.
|
| The goal is for fast paths to avoid spilling core
| interpreter state. But the compiler empirically has a
| difficult time doing this when the CFG is too connected. If
| you give the compiler an op at a time, each in its own
| function, it generally does much better.
| pizlonator wrote:
| I get that and that's also been my experience, just not
| for interpreters.
|
| In interpreters, my experience is that fallback paths are
| well behaved if you just make them noinline and then
| ensure that the amount of interpreter state is small
| enough to fit in callee save regs.
| haberman wrote:
| Mike Pall makes an argument that interpreters are
| especially susceptible to this problem, and I find it
| convincing, since it matches my experience: https://web.a
| rchive.org/web/20180331141631/http://article.gm...
| pizlonator wrote:
| There are a bunch of arguments in there that don't match
| my experience, which includes the JSC interpreter. JSC
| had an interpreter written in C++ and one written in
| assembly, and the main reason for using the assembly one
| was not raw perf - it was so the interpreter knows the
| JIT's ABI for fast JIT<->interpreter calls.
|
| Mike's argument about control flow diamonds being bad for
| optimization is especially questionable. It's only bad if
| one branch of the diamond uses a lot more registers than
| the other, which as I said, can be fixed by using
| noinline.
| ufo wrote:
| Exactly. Computed goto helps with branch prediction, but does
| not help w.r.t register allocation & other compiler
| optimizations.
| pizlonator wrote:
| As I mentioned in another part of the thread - the way you
| get that under control in a computed goto interpreter (or a
| switch loop interpreter) is careful use of noinline.
|
| Also, it probably depends a lot on what you're
| interpreting. I've written, and been tasked with
| maintaining, computed goto interpreters for quite a few
| systems and the top problem was always the branches and
| never the register pressure. My guess is it's because all
| of those systems had good noinline discipline, but it could
| also just be how things fell out for other reasons.
| highfrequency wrote:
| > there's statically one indirect branch per opcode rather than
| statically just one indirect branch.
|
| Could you elaborate on this with a couple more paragraphs? What
| do you mean by one indirect branch per opcode rather than just
| one indirect branch? How is this achieved?
| pizlonator wrote:
| Sure.
|
| Say you write an interpreter like this: for
| (;;) { switch (curInst->opcode) {
| case Inst::Foo: setOperand(curInst->dest,
| foo(getOperand(curInst->a), getOperand(curInst->b));
| curInst += FooSize; break; case
| Inst::Bar: setOperand(curInst->dest,
| foo(getOperand(curInst->a), getOperand(curInst->b));
| curInst += BarSize; break; ...
| more stuff like this } }
|
| Then the generated code will have an indirect branch for the
| switch. This is costly for two reasons:
|
| - Gotta load curInst->opcode, then lookup opcode in a
| compiler-generated jump table to get the code address of the
| case statement.
|
| - Gotta indirect jump.
|
| It turns out that the top cost (or even all of the cost) is
| the indirect jump. Maybe, the indirection is also some cost,
| at least on some CPUs. But remember, all jumps on modern CPUs
| are fast if predicted - regardless of the work seemingly
| required to do the jump. And they are slow if they are not
| predicted - and that slowness is about much more than the
| work required to do the jump (the cost is the CPU needing to
| roll back its speculations).
|
| Reason: the indirect jump prediction the CPU is doing is
| keyed by the address of the indirect jump instruction. There
| is one indirect jump in the code above. And for any real
| program you'll execute, that indirect jump will end up
| hitting all (or at least most) of the opcodes. Therefore, the
| CPU has no easy path to predicting the indirect jump. Maybe
| it'll sometimes get lucky with more sophisticated predictors
| (like ones that track history and use that to salt the key
| used to lookup the predicted destinations, or ones that do
| more fancy stuff, like maybe some neural network to predict).
|
| How do you make the indirect jump faster? Have one indirect
| jump per instruction! Both the computed goto approach and the
| tail call approach get you there.
|
| Consider the computed goto version of the interpreter.
| Inst_foo_label: setOperand(curInst->dest,
| foo(getOperand(curInst->a), getOperand(curInst->b));
| curInst += FooSize; goto *curInst->handler;
| Inst_bar_label: setOperand(curInst->dest,
| foo(getOperand(curInst->a), getOperand(curInst->b));
| curInst += BarSize; goto *curInst->handler;
| ... more stuff like this
|
| In this formulation of the interpreter, there is one indirect
| jump per instruction, rather than just one for the
| interpreter. Therefore, we're asking the CPU's predictor to
| do something simpler: to predict, for each instruction, what
| the next instruction will be. And then on top of that, the
| predictor still gets to use its neural network, or history
| buffer, or whatever. In terms of mathematical probability
| theory, it's like we're giving the CPU a first-order Markov
| chain, and that's sure to improve prediction accuracy.
| Empirically, it improves it a lot and it's a big speed-up.
| Here's yet another way to think of it. If I asked you to
| predict, "what's the next instruction", without any knowledge
| of what the prior one was, then you'd have a hard time -
| you'd only be able to reason about which instructions are
| most likely generally. But this computed goto interpreter is
| instead asking: "if I tell you the instruction I just
| executed, what's the next instruction", which gives you more
| leverage. Maybe adds are often followed by moves, for
| example.
|
| The tail call style also achieves this, because each
| instruction's handler will have an indirect tail call
| (literally an indirect jump) to the next instruction, which
| again, gives the CPU that Markov chain goodness. So let's
| call both the computed goto style and the tail call style the
| "Markov style". If you could rewrite the switch statement so
| that there was one switch statement per instruction (and you
| could convince the compiler not to combine them into one,
| lmao) then that would also be a Markov-style interpreter.
|
| As for the cost of indirectly loading from the compiler's
| switch table, or other issues like pushing and popping
| registers: in my experimentation, these costs are like drops
| in the ocean compared to the cost of indirect jumps. Even
| with the Markov style interpreter, the CPU spends most of its
| time mispredicting and rolling back. So the details of how
| the work happens for individual instructions are usually less
| important than what you do about the prediction of that
| indirect jump.
| highfrequency wrote:
| Awesome, thank you for expanding. Now I see the intuition
| that branch prediction accuracy could be much higher using
| the knowledge of the last opcode, so this becomes a game of
| massaging the code to prod the CPU to use more inputs into
| its branch prediction. Also helpful color on your empirical
| observation that branch prediction accuracy dominates other
| factors like switch indirection and loading registers.
|
| There's one thing I'm still missing. How exactly do we
| force the CPU to use the last opcode in its branch
| prediction model? In your first switch example, the CPU
| "knows" the path it has followed to get to each iteration,
| so in theory it could use the information of the last node
| it visited (or the last two nodes, etc.) to aid branch
| prediction right?
|
| Related to that: in your second example, what exactly
| happens in `goto *curInst->handler;`? Doesn't this need to
| revert back to something like a switch statement which has
| the same problem? (Unless you are doing polymorphism /
| dynamic dispatch in that example? Which I assume has some
| performance penalty that you're saying is dwarfed by the
| extra branch prediction effectiveness). Analogous to the
| line in the OP's article that says `MUSTTAIL return
| dispatch(UPB_PARSE_ARGS);` - doesn't the generic dispatch
| function need another switch statement? Probably missing a
| couple obvious things here.
|
| Lastly - if you have any books/article recommendations that
| helped you learn some of these intricacies (esp. regarding
| intuition about which performance quirks matter vs. don't
| matter) that would be great as well. Thanks!
| pizlonator wrote:
| > How exactly do we force the CPU to use the last opcode
| in its branch prediction model?
|
| In the tail call case: because each opcode is implemented
| by a function that has an indirect tail call at the end.
|
| In the computed goto case: each `goto _curInst- >handler`
| is its own indirect branch.
|
| > Related to that: in your second example, what exactly
| happens in `goto _curInst->handler;`?
|
| `curInst->handler` is a pointer that points to some
| label. The goto is an indirect branch to exactly that
| pointer value, i.e. that label.
|
| It's a super crazy and dangerous feature! It obviates the
| need for a switch; it _is_ the switch.
| hinkley wrote:
| Instead of a for loop where you hit a switch at the top of
| the loop, you jump to the code block for next instruction
| from the end of the previous instruction. That stops you from
| jumping to the top of the loop and then jumping a second
| time.
|
| On older CPUs, you're less likely to hit a pipeline stall.
| This technique was called "super-pipelining" for that reason.
| But a few years ago when this was discussed, someone pointed
| out that's usually not necessary anymore. That branch
| predictors can see through double jumps now.
|
| But as I alluded to in another reply, CPU caches are finite,
| and I have doubts whether in a fully realized interpreter,
| particularly one living side by side with a JIT, if that
| microbenchmark is lying to you about how fast the inner loop
| is under production conditions.
| hinkley wrote:
| I've heard this went out of fashion a while back. That branch
| predictors got good enough that it's not necessary anymore.
|
| But I wonder if that stays true as the size of the interpreter
| increases.
| pizlonator wrote:
| My most recent data says it's still relevant.
|
| It might not matter for very small interpreters, but it does
| matter for anything substantial.
|
| Definitely worth remeasuring though.
| titzer wrote:
| Threaded dispatch is worth 15-30% in Wizard's fast
| interpreter.
| nly wrote:
| perhaps the always_inline attribute could be useful to encourage
| the compiler to do the right thing also.
| highfrequency wrote:
| Could someone clarify the example where an unlikely fallback path
| is wrapped in a musttail?
|
| My understanding is that musttail is useful because it prevents
| stack frame construction and register spilling; the calling
| function basically says "you can re-use the existing stack and
| not worry about my local variables."
|
| But doesn't the stack frame construction overhead / register
| spilling only occur when the fallback path is actually invoked;
| therefore if it is unlikely and you care less about performance
| in this slow path it doesn't matter if you wrap the fallback path
| in musttail?
|
| (Is this purely a branch prediction effect, where the
| _possibility_ of extra work needing to be done slows down the
| fast path even when the work does not need to be done?)
| fweimer wrote:
| If the fallback path can be invoked repeatedly for one message
| (depending on message size), the tail call is a correctness
| issue because your stack is probably not going to be large
| enough to keep all the fallback path frames.
| mbStavola wrote:
| For Rust enthusiasts, there is an old RFC[0] that would have
| added a "become" keyword which guaranteed tco. It was originally
| postponed in favor of focusing on the 2018 edition's goals (which
| was the right call) but the initiative has been revisited
| recently[1]. Maybe it'll make a comeback!
|
| [0]: https://github.com/rust-lang/rfcs/pull/1888 [1]:
| https://github.com/rust-lang/rfcs/pull/3407
| MathMonkeyMan wrote:
| One of the compilation passes in a Scheme compiler (e.g. Guile or
| Racket) is conversion to continuation passing style. Here the
| author applies the pass manually as a code design decision,
| because later passes of the compiler (i.e. the actual C compiler)
| produce better code given that input. It's neat.
| aidenn0 wrote:
| It mentions C++ support; it would seem to me that C++ has very
| few tail-calls. Consider: foo() {
| auto a = SomeClassWithADestructor(); return bar(); //
| This is not a tail-call because the destruction of a happens
| *after* the call to bar(); }
___________________________________________________________________
(page generated 2024-08-19 23:00 UTC)