[HN Gopher] Parsing Protobuf at 2+GB/S: How I Learned to Love Ta...
___________________________________________________________________
Parsing Protobuf at 2+GB/S: How I Learned to Love Tail Calls in C
Author : signa11
Score : 369 points
Date : 2021-04-25 10:04 UTC (12 hours ago)
(HTM) web link (blog.reverberate.org)
(TXT) w3m dump (blog.reverberate.org)
| z77dj3kl wrote:
| Very cool! OT but is there a list of protobuf benchmarks
| somewhere across languages/implementations? How does it compare
| to JSON or just dumping raw bytes?
| slver wrote:
| You'd get equivalent performance with "goto", but not depend on
| implicit compiler optimizations to save you from stack overflow.
|
| I like the idea of tail calls, but I don't like it being an
| implicit "maybe" optimization of the compiler. Make it part of
| the standard, make the syntax explicit, and then both hands in.
| haberman wrote:
| You really wouldn't. We tried many approaches, including
| computed goto, but none of them were able to generate
| satisfactory code when compared with tail calls:
| https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html
|
| Also, it is not an implicit optimization when "musttail"
| support is available in the compiler.
| slver wrote:
| Tailcall optimization is compiled as a JMP, so is goto, so
| I'm curious where would the difference come from?
| PostThisTooFast wrote:
| One thing that appealed to us about protobufs is their alleged
| change-tolerance (versioning).
|
| Years later, this remains only alleged... not realized.
| a-dub wrote:
| even if you set aside the claimed gains over handwritten
| imperative loops, being able to tell the compiler to fail unless
| tail call conversion succeeds will be huge in terms of preventing
| performance or stack depth bugs from surfacing when someone
| accidentally makes a change that makes a tail call no longer
| convertible.
|
| if you ask me, it should be part of the language. the whole "just
| make sure you shouldn't need a stack and the compiler will
| probably take care of it" approach has bothered me for years.
| alfiedotwtf wrote:
| > the whole "just make sure you shouldn't need a stack and the
| compiler will probably take care of it" approach has bothered
| me for years.
|
| Good point. I love Rust's safety guarantees, but you're right -
| it doesn't take into account stack like everyone else
| coolreader18 wrote:
| It's funny, wasm3 was just on the front page with a tail-call
| based interpreter model and now this is. Now I wanna do some
| stuff with tail calls :)
| khiner wrote:
| This was a great read!
| tux1968 wrote:
| This is a really great result, but i'm curious why this isn't a
| very common and standard compiler optimization, at least as an
| option you can enable? It seems like the conditions where it can
| be applied are pretty easy to identify for a compiler.
| haberman wrote:
| Tail calls are a very common optimization, both Clang and GCC
| have been performing this optimization successfully for a
| while. What is new is getting a _guarantee_ that applies to all
| build modes, including non-optimized builds.
| tux1968 wrote:
| If you're interested in this optimization for performance
| reasons, why would you want an otherwise non-optimized build?
| It seems that the only important case is the optimized
| build... where for some reason you're not getting this
| optimization without explicitly asking for it.
|
| So the question remains... why is the compiler optimization
| missing this chance to optimize this tail call without it
| being explicitly marked for optimization?
| wffurr wrote:
| If you write this style of code and don't get the
| optimization, then your test code and debug sessions are
| dog slow. Much much slower than the old giant switch style
| parser.
|
| Similarly, if you make a mistake that would cause the
| optimization not to be applied, you'd rather get a compiler
| error than suddenly have a 10X performance regression.
| haberman wrote:
| If the optimization is not performed, you blow the stack
| because your stack usage is now O(n) instead of O(1).
|
| That's why it's important to get the optimization in debug
| mode.
| vitus wrote:
| What's the argument for building in debug mode vs
| specifying -g to the compiler to build with debug
| symbols?
|
| I've previously encountered bugs that stubbornly refused
| to reproduce in a non-optimized build.
|
| One thing that comes to mind is `#ifdef DEBUG`-style
| macros, although I'm curious if there's anything else I'm
| missing. Omitted stack frames, e.g. from inlining,
| perhaps?
| joppy wrote:
| One reason that this is not done by default is that it can
| make debugging a surprise: since each function does not
| leave a stack frame, stack traces are much less useful.
| This is not so bad in the case of a tail-recursive function
| which calls itself, but if proper tail calls are done
| between multiple functions, you could have A call B tail
| call C tail call D, and if D crashes the stack trace is
| just (D called from A).
|
| I'm sure there are some good ways around this problem, and
| I would love to have proper tail calls available in most
| languages.
| tyingq wrote:
| _" Applying this technique to protobuf parsing has yielded
| amazing results: we have managed to demonstrate protobuf parsing
| at over 2GB/s, more than double the previous state of the art."_
|
| I was somewhat surprised to see that was state of the art for
| protobufs. Simdjson boasts faster throughput, without the benefit
| of the length encoded headers that are in protobufs. I looked for
| examples of protobufs using SIMD, but could only find examples of
| speeding up varint encode/decode.
| haberman wrote:
| Article author here. I dumped my test protobuf payload (7506
| bytes) to JSON and got 19283 bytes.
|
| So parsing this particular payload at 2GB/s would be equivalent
| to parsing the JSON version at 5.1GB/s.
|
| SIMD doesn't end up helping protobuf parsing too much. Due to
| the varint-heavy nature of the format, it's hard to do very
| much SIMD parallelism. Instead our approach focuses on going
| for instruction-level parallelism, by trying to remove
| instruction dependencies as much as possible. With this design
| we get ~3.5 instructions per cycle in microbenchmarks (which
| represents a best case scenario).
| jeffbee wrote:
| Making a flagrantly wasteful data format and then using its
| bloated extent as the numerator in your benchmark will not
| exactly be a fair comparison. If a protobuf has a packed,
| repeated field that looks like \x0a\x02\x7f\x7f and json has
| instead { "myFieldNameIsBob": [ 127, 127 ] } the JSON
| interpreter has to be 20x faster just to stay even.
| tyingq wrote:
| That's true, would be interesting to see an "encoded entities
| per second" comparison. Or maybe a comparison with mostly
| stringy data where the size is probably comparable.
| haberman wrote:
| Article author here. I agree that would be a very
| englightening benchmark. Protobuf can dump to JSON, so it
| shouldn't be too much work to dump my benchmark data to
| JSON and benchmark the parsing with simdjson. Maybe I'll
| see if I can get this done while this article is still on
| the front page. :)
| tyingq wrote:
| Ah, wow, that's great. Apples-to-apples since you can
| dump the same data to JSON. And shows why HN remains a
| unique place. Wonder out loud, and maybe get an answer
| within an update to the article you just read :)
| haberman wrote:
| So I was not able to get simdjson going in time, but I
| did dump my test payload to JSON. In protobuf the payload
| was 7506 bytes, in JSON 19283 bytes.
|
| So parsing this particular payload at 2GB/s would be
| equivalent to parsing it at 5.1GB/s in JSON.
| kelnos wrote:
| I don't know if this is a fair comparison, as 2GB/s of protobuf
| will parse a lot more _information_ than 2GB /s of JSON will,
| since protobuf is a much more space-efficient way to encode
| your data.
| tyingq wrote:
| See https://news.ycombinator.com/item?id=26934063 , we may
| get a fairly sound comparison. Though I imagine the
| comparison varies a lot depending on the data. As another
| comment mentioned, a message with a lot of large strings
| would lean heavily towards protobufs. And in that case, the
| size wouldn't be much different.
| beached_whale wrote:
| I would just measure the size of the resulting data
| structures. In a language like C/C++ it could be as simple
| as memory usage before your parsing function and after
| secondcoming wrote:
| What benefit would length encoded headers provide other than to
| reduce payload size? With JSON you just have to scan for
| whitespace, whereas with protobuf you actually have to decode
| the length field.
| tyingq wrote:
| Not having to look at every byte to split out the contents.
| jeffbee wrote:
| Right, a protobuf message that consists of a single 100MB
| string will be "decoded" dramatically faster than 2GB/s,
| because you only need visit the tag and the length and
| store the address and length, aliasing the input.
|
| It's really quite impossible to discuss data encoding
| performance outside of a given schema.
| jsnell wrote:
| Isn't there also a significant difference in what the input is
| being parsed to? My expectation is for a protobuf library to
| parse messages to structs, with names resolved and giving
| constant time field access. Simdjson parses a json object to an
| iterator, with field access being linear time and requiring
| string comparisons rather than just indexing to a known memory
| offset.
|
| I.e. it seems like simdjson trades off performance at access
| time for making the parsing faster. Whether that tradeoff is
| good depends on the access pattern.
| touisteur wrote:
| But the same could be true for protobuf. Decode fields only
| when you need them, and 'parse' just to find the field
| boundaries and cardinality. Did stuff like that for internal
| protobuf-like tool and with precomputed message profiles you
| can get amazing perf. Just get the last or first bit of most
| bytes (vgather if not on AMD) and you can do some magic.
| beached_whale wrote:
| You cannot compare them. Decoding JSON is compressing it
| essentially, and in order to compare you would need to look at
| the resulting data structures and how long it takes. strings
| are comparable, but an integer is bigger exp2( log10(
| digit_count ) ) I think.
|
| But yeah, I had the same first thought too.
| rapidlua wrote:
| I've opened an issue in LLVM bugzilla concerning jump not being
| folded with address computation on x86 with a proposed fix. Would
| love if it gets some attention.
| https://bugs.llvm.org/show_bug.cgi?id=50042
|
| Also been working on a C language extension to enable guaranteed
| tail calls along with explicit control over registers used for
| argument passing. Provided that callee-save registers are used
| for arguments, calling fallback functions incurs no overhead.
|
| https://github.com/rapidlua/barebone-c
| haberman wrote:
| Barebone C looks really interesting! I wonder if the same gains
| could be achieved in a more architecture-independent way by
| simply introducing a new calling convention where parameters
| are passed in the registers that are normally callee-save.
|
| This would let the fast path functions use all registers
| without needing to preserve any of them. It would also let them
| call fallback functions that use a normal calling convention
| without needing to spill the parameters to the stack.
| rapidlua wrote:
| Thank you for the feedback! A new calling convention could
| probably nail it for many use cases. Sometimes you want
| something very special though. E.g. LuaJIT pre-decodes
| instruction prior to dispatch to squeeze some work into
| branch misprediction delay. There are limited callee save
| registers available hence it is better to use volatile
| registers for instruction arguments. It should be possible to
| reuse the implementation on different architectures and only
| tweak the register assignment.
| neoteric wrote:
| One thing I've been thinking about in C++ land is that just how
| much the idiomatic usage of RAII actually prevents the compiler
| from doing it's own tail call optimization. Any object
| instantiated in automatic storage with a non-trivial destructor
| is basically guaranteeing the compiler _can't_ emit a tailcall.
| It's rather unfortunate, but perhaps worth it if the traideoff is
| well understood. Example: https://godbolt.org/z/9WcYnc8YT
| ufo wrote:
| You can still use this trick in C++ if you ensure that the
| return statements are outside the scopes that have the RAII
| objects. It's awkward but it's better than nothing.
| https://godbolt.org/z/cnbjaK85T void foo(int
| x) { { MyObj obj; // ...
| } return bar(x); // tail call }
| gumby wrote:
| I am unable to understand this comment. You're saying that you
| can't generate a tail call by returning from the middle of a
| function which needs to clean things up. RAII* is merely
| syntactic sugar to write this control flow and make it
| mandatory.
|
| Perhaps it's easier to think of tail-call semantics as simply
| implementing iteration: it's another way of expressing a
| for(...) loop. And if you used RAII in the body of your for
| loop you would expect the same thing.
| neoteric wrote:
| Absolutely, RAII is an abstraction (and a useful one), but it
| has a cost in that it prevents a form of useful optimization
| because cleanup is required at the destruction of the stack
| frame. You'd expect the same in C if you explicitly had to
| call a cleanup function on return from a call.
|
| What C++ does with RAII is make this tradeoff not obvious.
| std::unique_ptr is a great example to show this: colloquially
| a std::unique_ptr is "just a pointer", but it isn't in this
| case because it's non-trivial destructor prevents TCO.
| hvdijk wrote:
| I can understand the comment: even if the cleanup can be done
| before the call, allowing the call to be done as a tail call,
| the compiler will not move up the cleanup code for you, and
| having the cleanup code last is the default state. With local
| variables defined inside for loops, they are always destroyed
| before the next iteration.
| skybrian wrote:
| These tail-call functions are part of a program's inner loop.
| It seems like we shouldn't expect allocation within an inner
| loop to be fast? Other than local variables, that is.
|
| In an interpreter, it seems like either you're allocating
| outside the interpreter loop (the lifetime is that of the
| interpreter) or it's within a particular (slow) instruction, or
| it's part of the language being interpreted and the lifetime
| can't be handled by RAII. There will be fast instructions that
| don't allocate and slower ones that do.
|
| Interpreters are a bit weird in that there are lots of
| instructions that do almost nothing so the overhead of the loop
| is significant, combined with having lots of complication
| inside the loop and little ability to predict what instruction
| comes next. This is unlike many loops that can be unrolled.
| sesuximo wrote:
| I suppose the compiler could reorder function calls if it can
| prove there is no change in behavior? If so, then it could
| hoist dtors above the call and emit a jump. I doubt any
| compilers do this.
| _3u10 wrote:
| Aren't protobufs supposed to be faster than JSON?
|
| I mean congrats but it doesn't seem that impressive given JSON
| has been parsing at that speed for years.
| alfiedotwtf wrote:
| I've never thought of tail calls in depth before, so quick
| question about them - the point here I'm getting is to minimise
| the pushing/popping of registers to the stack from `call` By
| using a `jmp` instead. That's all good... but what stack frame
| does the jumped procedure use? I'm assuming a new stack frame is
| still created but just without caller's pushed registers?
| dreamcompiler wrote:
| There's always some stack frame sitting there; typically the
| one that called the main loop routine in the first place. At
| some point a RET instruction will be executed, and control will
| return to that stack frame's return point. The idea of tail
| calls is to avoid creating a new stack frame when it doesn't
| add any value, but stack frames still exist--they have to exist
| when the last thing that happens in a routine is ordinary
| instruction processing rather than a CALL.
| pyler wrote:
| about jne issue, LLVM does it intentionally.
|
| Check line 1521
|
| https://llvm.org/doxygen/BranchFolding_8cpp_source.html
| ndesaulniers wrote:
| Nice find! I wonder what they mean by changing branch
| direction?
| haberman wrote:
| Wow thanks for the pointer! That seems unfortunate, I wonder if
| there is a way to evaluate whether the extra jump is actually
| worth it, and whether this optimization could be allowed.
| neilv wrote:
| Scheme has had tail calls as a basic idiom (some call it Proper
| Implementation of Tail Calls; others call it Tail Call
| Optimization) forever, and I keep them in mind any time I'm
| implementing anything nontrivial in Scheme.
|
| There's no special syntax -- there's just the notion of _tail
| positions_ , from which tail calls can occur. For example, both
| arms of an `if` form are tail position, if the `if` itself is.
| And if you introduce a sequencing block in one of those arms, the
| last position in the block is tail position. (A specialized IDE
| like DrRacket can indicate tail positions, and DrRacket will even
| hover arrows, tracing the tail positions back to the top of the
| code.)
|
| When implementing a state machine, for example, a state
| transition is simply a function application (call) to the new
| state (optionally with arguments), where the the called function
| represents a state. It's satisfying when you realize how elegant
| something that used to be messy is, and you can imagine it being
| very fast (even if your Scheme implementation isn't quite up to
| getting as fast as a C or Rust compiler that can recognize and
| special-case tail calls.)
|
| (For the state machine example, compare to more conventional
| approaches in C ,of "state ID" variables, `switch` statements on
| those, and record-keeping for additional state. Or doing it in
| data, with state ID being index into arrays, again with any
| additional recordkeeping. Or lots of function calls when you
| didn't really need the time overhead and stack usage of function
| calls with full returns.)
| CodeArtisan wrote:
| Any functional programming language shall have TCO since
| statements are forbidden (everything is a function).
| [deleted]
| bla3 wrote:
| It does make code harder to debug when something goes wrong.
| Since tail calls create no stack frames, a call sequence f->g->h
| where g tail calls h will show up as a stack of (f, h) in the
| debugger. The fix is easy, just make MUSTTAIL expand to nothing
| in debug builds. But it's something to keep in mind, and it means
| your debug mode code will have different memory use
| characteristics.
| lupire wrote:
| Programmer's have traditionally relied on the stack as an
| (expensive yet incomplete) execution logging system, but the
| east fix for that is to use an actually log structure instead
| of the abusing the stack.
| jhgb wrote:
| Seems like not much of a fix if your code depends on it. In
| Scheme such behavior would break standard compliance (and lots
| of programs). And since presumably you'd use this attribute
| _precisely_ when your code depended on it, disabling it may not
| be feasible.
|
| Fortunately the patterns generated by such code are no more
| difficult to debug than simple loops in other languages, so the
| lack of "history" of your computation seems no more concerning
| with tail calls than it is with loops (or at least that's how I
| always looked at it). And people don't seem to be avoiding
| loops in programming merely for the reason of loops not keeping
| track of their history for debugging purposes.
| cryptonector wrote:
| A while loop with a switch also creates no stack frames for
| each operation interpreted.
| ndesaulniers wrote:
| If using DWARF for unwinding there is a tag IIRC that
| represents that inlining has occurred.
| haberman wrote:
| That is true, but there is always https://rr-project.org for
| easy reverse debugging if you're having trouble figuring out
| where you came from.
|
| If the alternative is to drop to assembly, a C-based approach
| seems quite easy to debug. You can just add printf()
| statements! Previously when I had been using assembly language
| or a JIT, I had to resort to techniques like this:
| https://blog.reverberate.org/2013/06/printf-debugging-in-ass...
| jhgb wrote:
| > but there is always https://rr-project.org for easy reverse
| debugging if you're having trouble figuring out where you
| came from
|
| How did I not know about this? Thanks!
| jpgvm wrote:
| rr is really a hidden treasure. C/C++ wasn't the same for
| me after discovering it.
|
| It's sort of like the first time you learn about
| interactive debugging. Everything you did before that
| suddenly feels like "caveman coding" and you never want to
| go back.
| elcritch wrote:
| Perhaps a "ring logger" approach could be useful. Append the
| more useful bits of what would normally go into the stack
| frame but without the allocation overhead. Seems like a few
| memcpy's and modulo operations wouldn't hurt the code gen too
| much.
| kevincox wrote:
| Of course RR is too expensive to be deployed in production
| builds. So if you are getting core dumps from "production"
| you won't have this information. So while RR helps it doesn't
| completely mitigate the tradeoff.
| [deleted]
| touisteur wrote:
| You may be interested in Intel PT snapshots in Linux core
| dumps, that gdb knows how to interpret and will give you a
| detailed branch trace. Less cool than rr but still very
| interesting!
| ufo wrote:
| Only if the alternative is non-tail function calls. Often the
| alternative to tail calls is a while loop, which also doesn't
| leave a stack trace.
| leoc wrote:
| As so often, the moral is that pushing your problems in-band
| doesn't make them go away.
| OskarS wrote:
| > The fix is easy, just make MUSTTAIL expand to nothing in
| debug builds.
|
| That wouldn't work. The final call is a recursive call to
| dispatch, if that is not a tail call it will blow the stack
| instantly.
| pjmlp wrote:
| Lisp and Scheme graphical debuggers do just fine tracking them.
| [deleted]
| CyberRabbi wrote:
| Interesting but it seems like the essential function of the tail
| call in this example is to create a strong hint to the compiler
| which values should stay in registers. Isn't there a better way
| to do that than relying on a seemingly unrelated optimization?
| AndyKelley wrote:
| Here's a related Zig proposal:
| https://github.com/ziglang/zig/issues/8220
|
| Relevant section pasted here:
|
| > Other Possible Solution: Tail Calls
|
| > Tail calls solve this problem. Each switch prong would return
| foo() (tail call) and foo() at the end of its business would
| inline call a function which would do the switch and then tail
| call the next prong.
|
| > This is reasonable in the sense that it is doable right now;
| however there are some problems: * As far as I
| understand, tail calls don't work on some architectures.
| * (what are these? does anybody know?) * I'm also
| concerned about trying to debug when doing dispatch with tail
| calls. * It forces you to organize your logic into
| functions. That's another jump that maybe you did not
| want in your hot path.
|
| See also https://dino-lang.github.io/learn/dino.pdf, section 3.1
| "Byte Code Dispatch".
| haberman wrote:
| It sounds like the main question in the proposal is: how can we
| reliably get the compiler to generate threaded/replicated
| dispatch?
|
| While that is important, it is only one small part of what we
| were trying to achieve with tail calls. Ultimately we found
| that our tail call design significantly improved the register
| allocation and overall code generation compared with computed
| goto, see:
| https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html
| seg_lol wrote:
| I really want a timeline where we have tail calls everywhere
| and this drama can go away. The non-tail call folks feel like
| folks arguing for not using spaces in text.
|
| https://en.wikipedia.org/wiki/Scriptio_continua
| slaymaker1907 wrote:
| I do think tail call optimization can make debugging/stack
| traces confusing if you do general TCO. In the plain single
| function case, TCO probably makes it easier to debug, but
| it can get confusing if function A calls B and then B calls
| C in tail position so the stack trace is just C|A. It's
| probably not too bad if it is just one function getting
| elided, but this get very difficult if multiple important
| calls are being removed from the stack.
|
| This could also be a case of insufficient tooling for
| languages with TCO. Better support for reverse debugging
| would make detailed stacks less necessary. Most of the
| stack trace isn't that useful anyway when printing out
| exceptions compared to good application logging.
| spockz wrote:
| Don't we have debug symbols or something like it to be
| able to translate the optimised version into the thing
| you saw in code? If you need to dive into the assembly
| you are in for a hell of a ride anyway due to all kinds
| of optimisations happening.
| seg_lol wrote:
| And tail recursive functions encode their state in the
| args and the body. There isn't any information in the
| stack anyway, by definition.
|
| If one does need to debug mutually tail recursive
| functions, step through them, this isn't a problem. The
| _point_ is that they don 't consume stack space. If you
| _do_ want a stack of invocations, then don 't write tail
| recursive code?
|
| This style of argumentation that the GP showed should
| have a name.
|
| BTW my original comment is at -1, hate but only in the
| tail position.
| haberman wrote:
| I have to admit I had little interest in tail calls myself
| until I saw what they could achieve performance-wise. Now
| that I have seen that, I am sold. :)
| seg_lol wrote:
| Performance is only one small aspect, the things you can
| express in tail recursive functions are beautiful and
| compact. If feels like being in an inductive proof
| playground.
| SavantIdiot wrote:
| Just here to say that the use of those #defines for huge
| parameter lists makes sad. I realize that's a common pattern, but
| if your call list is that big, how about a struct pointer?
| haberman wrote:
| That would defeat the entire purpose: they need to be six
| parameters to ensure that they are all in registers.
|
| However I was considering whether they could be a struct with
| six members passed by value. If the ABI would pass that in six
| registers we could get rid of the #defines, which I agree would
| be nicer.
| ufo wrote:
| Unfortunately, I think that only structs with at most 2 words
| are passed via registers. Larger structs are passed via the
| stack.
|
| https://godbolt.org/z/frof3xjhW
| haberman wrote:
| Bummer, thanks for the info.
| londons_explore wrote:
| This article suggests a big reason to use this approach is to
| seperate hot and cold code.
|
| I assume that's for good use of the CPU's instruction and
| microcode caches.
|
| Yet in a protobuf parser, I'm surprised there is enough code to
| fill said caches, even if you put the hot and cold code together.
| Protobuf just isn't that complicated!
|
| Am I wrong?
| tooltower wrote:
| In particular, the author is talking about CPU registers being
| spilled to memory, and the need for setting up or tearing down
| stack frames. Those things can only be eliminated by the
| compiler for extremely simple functions. The error-handling
| often isn't.
| haberman wrote:
| > I assume that's for good use of the CPU's instruction and
| microcode caches.
|
| I don't think that is the reason. These are microbenchmark
| results, where realistically all the code will be hot in caches
| anyway.
|
| The problem is that a compiler optimizes an entire function as
| a whole. If you have slow paths in the same function as fast
| paths, it can cause the fast paths to get worse code, even if
| the slow paths are never executed!
|
| You might hope that using __builtin_expect(), aka
| LIKELY()/UNLIKELY() macros on the if statements would help.
| They do help somewhat, but not as much as just putting the slow
| paths in separate functions entirely.
| touisteur wrote:
| Similar if using the 'hot' attribute: separate functions or
| separate labels... https://stackoverflow.com/a/15034687
| zozbot234 wrote:
| Is this really higher performance than the existing protobuf
| support in Rust+serde? That uses macro programming to generate
| code at compile time based on a high-level description of your
| data format, so it can be quite fast and will certainly be a lot
| safer than cowboy-coding raw C.
| linkdd wrote:
| Every single time there is a link about C/C++, I do a quick
| "Ctrl+F Rust", and every single time there is some rant about
| "but Rust is safer than C".
|
| Every. Single. Time.
|
| This article is about a new Clang extension:
|
| > An exciting feature just landed in the main branch of the
| Clang compiler. Using the [[clang::musttail]] or
| __attribute__((musttail)) statement attributes, you can now get
| guaranteed tail calls in C, C++, and Objective-C.
|
| (literally the first line of the article)
|
| What does Rust have to do with this? Nothing.
|
| Does the author suggest that you should only use C for high
| performance? No.
| littlestymaar wrote:
| > Every single time there is a link about C/C++, I do a quick
| "Ctrl+F Rust", and every single time there is some rant about
| "but Rust is safer than C".
|
| If you had the same reflex with other languages when reading
| other HN comment thread you'll realize that everytime there's
| a thread about _language A_ there 's a ton of comments about
| _language B or C_. Rust threads are full of people talking
| about C++ or C, why is it supposed to be more OK? Go threads
| are full of comment about Java, Python threads are full of Go
| or Julia (depending on what kind of Python this is about).
| This isn 't specific to Rust in any ways.
|
| Yes, the GP isn't the most useful comment ever, Rust is hype
| and there are overly enthusiastic people, but there are also
| people who seem to be overly defensive about it (doing a text
| search to count Rust occurrences, every time, really?) and
| I'm not sure the latter is less childish than the former.
| linkdd wrote:
| > If you had the same reflex with other languages
|
| I do, always, just out of curiosity because seeing peoples
| view on different languages is always insightful.
|
| > This isn't specific to Rust in any ways.
|
| From what I've seen, Rust comments are the most
| condescending.
|
| > I'm not sure the latter is less childish than the former.
|
| It is childish to look for specific information (in this
| specific case: how C and Rust are compared by others) in
| threads that grow to hundreds of messages?
|
| FYI, I've also looked for C++ and Go comments in this
| thread, and none of them were a condescending comment like
| "C is bad, you should use Rust, C is evil and should not be
| sent to production".
| zozbot234 wrote:
| > This article is about a new Clang extension:
|
| > An exciting feature just landed in the main branch of the
| Clang compiler. Using the [[clang::musttail]] or
| __attribute__((musttail)) statement attributes, you can now
| get guaranteed tail calls in C, C++, and Objective-C.
|
| > What does Rust have to do with this? Nothing.
|
| Rust has a planned keyword for this exact feature, namely
| 'become'. So the author would be able to use it in Rust just
| as well, as soon as support for it lands in an upstream LLVM
| release.
|
| Regardless, writing raw C as the article suggests in order to
| parse a quasi-standard high-level format is cowboy coding.
| It's a nice hack to be sure, but it's not the kind of code
| that should be anywhere close to a real production workload.
| Instead, this feature should be implemented as part of some
| safe parser generator. Not necessarily written in Rust but
| something that's at least as safe.
| Cloudef wrote:
| I think you might need a reality check and see how many
| systems run C code in production.
| linkdd wrote:
| No cowboy would treat their servers like cattle.
|
| Joke aside, the Linux kernel is mainly developed in C and
| runs in production everywhere.
| linkdd wrote:
| > Rust has a planned keyword for this exact feature, namely
| 'become'. So the author would be able to use it in Rust
| just as well, as soon as support for it lands in an
| upstream LLVM release.
|
| The author is demonstrating a new Clang feature. Then,
| again, what is the point of mentioning Rust?
|
| > writing raw C as the article suggests in order to parse a
| quasi-standard high-level format is cowboy coding.
|
| So what? Cowboy coding lets you appreciate even more
| languages where you don't need to do it, AND gives you a
| better understanding of how they might work.
|
| > it's not the kind of code that should be anywhere close
| to a real production workload.
|
| Again, not a single time in the article the author suggests
| that.
| ma2rten wrote:
| Protobufs are very important for Google. A significant percentage
| of all compute cycles is used on parsing protobufs. I am
| surprised that the parsing is not doing using handwritten
| assembly if it's possible to improve performance so much.
| mkoubaa wrote:
| Given that fact I'm wondering if google ever researched custom
| chips or instruction sets for marshalling pbs, like the TPUs
| they worked on for ML.
| lupire wrote:
| Problem is once you parse the protobuf, you have to
| immediately do other computations on it in the same process.
| No one needs to parse protobufs all day long like an ML model
| or doing hashes for crypto.
| jeffbee wrote:
| That doesn't seem to preclude hardware assistance. For
| example they have also explored hardware acceleration for
| the tcmalloc fast path allocation by adding circuits to
| general purpose CPUs. Arguably, Intel BMI descends from a
| request that Google made years ago to speed up protobuf
| decoding (and other workloads) on x86.
| xxpor wrote:
| Handwritten ASM for perf is almost never worth it in modern
| times. C compiled with GCC/Clang will almost always be just as
| fast or faster. You might use some inline ASM to use a specific
| instruction if the compiler doesn't support generating it yet
| (like AVX512 or AES), but even for that there's probably an
| intrinsic available. You can still inspect the output to make
| sure it's not doing anything stupid.
|
| Plus it's C so it's infinitely more maintainable and way more
| portable.
| pjmlp wrote:
| Yet Microsoft was able to make a .NET implementation faster
| than the current Google's C++ one.
|
| A proof that they don't care enough about protobuf parsing
| performance.
|
| https://devblogs.microsoft.com/aspnet/grpc-performance-impro...
| haberman wrote:
| From your link:
|
| > Support for Protobuf buffer serialization was a multi-year
| effort between Microsoft and Google engineers. Changes were
| spread across multiple repositories.
|
| The implementation being discussed there is the main C#
| implementation from https://github.com/protocolbuffers/protob
| uf/tree/master/csha...
| pjmlp wrote:
| I know, the point was that they don't care to improve the
| C++ version.
| boulos wrote:
| haberman is part of the "they" here (he's on the core
| protobuf team, does things like this).
| haberman wrote:
| The C++ parser has been completely rewritten in the last
| few years to push the current code generation design to
| its limits. The biggest obstacle in the way of optimizing
| it further is legacy APIs, especially std::string for
| string fields, which forces tons of small heap
| allocations and frees and defeats the benefits of arena
| allocation.
|
| Changing very widely used APIs is not easy, these things
| take time.
| pjmlp wrote:
| Interesting, thanks for the clarification, however why do
| libraries depend on what should be implementation details
| of generated code to start with?
|
| Something that by good CI/CD practices shouldn't even be
| part of checked in code.
| haberman wrote:
| It's exposed in the API. When you access a string field
| in protobuf, say foo.my_str_field(), the return type is
| const std::string&. We have to have a real std::string
| internally to return a reference to it.
| pjmlp wrote:
| Thanks for the explanation.
| barbazoo wrote:
| In what kind of scenarios do they use Protobufs? I can think of
| messaging systems, streaming, RPC, that sort of thing?
| ch33zer wrote:
| Everything at google is built on RPCs between systems using
| protobufs
| ma2rten wrote:
| Protobuf is also used as a format to write data to disk.
| alephnan wrote:
| Why is this being down voted? It's accurate.
| CoolGuySteve wrote:
| Protobuf's abysmal performance, questionable integration into
| the C++ type system, append-only expandability, and annoying
| naming conventions and default values are why I usually try and
| steer away from it.
|
| As a lingua franca between interpreted languages it's about par
| for the course but you'd think the fast language should be the
| fast path (ie: zero parsing/marshalling overhead in Rust/C/C++,
| no allocations) as you're usually not writing in these
| languages for fun but because you need the thing to be fast.
|
| It's also the kind of choice that comes back to bite you years
| into a project if you started with something like Python and
| then need to rewrite a component in a systems language to make
| it faster. Now you not only have to rewrite your component but
| change the serialization format too.
|
| Unfortunately Protobuf gets a ton of mindshare because nobody
| ever got fired for using a Google library. IMO it's just not
| that good and you're inheriting a good chunk of Google's
| technical debt when adopting it.
| haberman wrote:
| Zero parse wire formats definitely have benefits, but they
| also have downsides such as significantly larger payloads,
| more constrained APIs, and typically more constraints on how
| the schema can evolve. They also have a wire size
| proportional to the size of the schema (declared fields)
| rather than proportional to the size of the data (present
| fields), which makes them unsuitable for some of the cases
| where protobuf is used.
|
| With the techniques described in this article, protobuf
| parsing speed is reasonably competitive, though if your
| yardstick is zero-parse, it will never match up.
| CoolGuySteve wrote:
| Situations where wire/disk bandwidth are constrained are
| usually better served by compressing the entire stream
| rather than trying to integrate some run encoding into the
| message format itself.
|
| You only need to pay for decompression once to load the
| message into ram rather than being forced to either make a
| copy or pay for decoding all throughout the program
| whenever fields are accessed. And if the link is bandwidth
| constrained then the added latency of decompression is
| probably negligible.
|
| The separation of concerns between compression format and
| encoding also allows specifically tuned compression
| algorithms to be used, for example like when switching
| zstd's many compression levels. Separating the compression
| from encoding also lets you compress/decompress on another
| processor core for higher throughput.
|
| Meanwhile you can also do a one shot decompression or skip
| compression of a stream for replay/analysis; when talking
| over a low latency high bandwidth link/IPC; or when
| serializing to/from an already compressed filesystem like
| btrfs+zstd/lzo.
|
| It's just more flexible this way with negligible drawbacks.
| gorset wrote:
| It's obviously possible to do protobuf with zero
| parsing/marshalling if you stick to fixed length messages and
| 4/8 byte fields. Not saying that's a good idea, since there
| are simpler binary encodings out there when you need that
| type of performance.
| fnord123 wrote:
| FWIW, the python protobuf library defaults to using the C++
| implementation with bindings. So even if this is a blog post
| about implementing protobuf in C, it can also help
| implementations in other languages.
|
| But yes, once you want real high performance, protobuf will
| disappoint you when you benchmark and find it responsible for
| all the CPU use. What are the options to reduce parsing
| overhead? flatbuffers? xdr?
| TheGuyWhoCodes wrote:
| Flatbuffers, Cap'n'Proto and Apache Arrow comes to mind.
| PostThisTooFast wrote:
| So important that they haven't bothered to create a protobuf
| generator for Kotlin, the primary development language for
| their own mobile operating system.
| ufo wrote:
| A very interesting trick! In my opinion, the big takeaway here is
| that if you are willing to write your C code in tail call style,
| you can get a lot of control over which variables are stored in
| machine registers. In the x86-64 ABI, the first 6 function
| parameters are guaranteed to be passed via registers.
|
| Obviously, this is architecture-specific. Other architectures may
| use a different calling convention that ruins this kind of manual
| register allocation. I'd imagine that it would be bad for
| performance in 32-bit systems, where function arguments are
| passed via the stack.
|
| --------
|
| > I think it's likely that all of the major language interpreters
| written in C (Python, Ruby, PHP, Lua, etc.) could get significant
| performance benefits by adopting this technique.
|
| I know that at least Lua and Python use "computed gotos" in their
| inner loops, which also helps the compiler generate better code.
| The architecture-dependent nature of the tail-call trick could be
| a problem here. Some of these interpreters also need to work well
| on 32-bit systems.
| haberman wrote:
| Yep, that's exactly the right takeaway. :) I have verified this
| on ARM64 also, but architectures that pass parameters on the
| stack will make this technique not really worth it.
|
| Re: PHP, Ruby, etc. yes, computed gotos help some, but I think
| tail calls would help much more. That was our experience at
| least. I wrote more about this here:
| https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html
|
| Yes, there are portability issues with the tail call approach,
| so there would need to be a fallback on non-x64/ARM64 platorms.
| This would add complexity. But it's exciting to think that it
| could unlock significant performance improvements.
| spockz wrote:
| If you like tail calls, look into CPS. Many forms of (pure)
| code can be rewritten in such a way.
|
| Everyone who writes Haskell quickly learns to write their
| recursive functions so that they are (mutually) tail
| recursive.
| ufo wrote:
| On the matter of portability, I wonder if it would be
| possible to use some macro magic and/or a code generator to
| convert the tail-call version back into a more traditional
| while/switch loop, for the stack-based architectures.
|
| While the tail call version is more architecture dependent,
| it's nevertheless more portable than assembly language. It's
| still C.
| haberman wrote:
| I haven't implemented the fallback yet, but my thought was
| to keep most of the structure the same (dispatch function
| tail calls to the function for an individual operation) but
| then have the individual operation just return instead of
| tail calling back to dispatch.
|
| I think this would be portable while still avoiding some of
| the performance problems associated with a traditional
| switch (like the bounds check).
| elcritch wrote:
| I'm curious how this approach would fair for
| microcontrollers, well cortex m0-m4's actually. They're
| arm32 bit and may not have enough useful call registers.
| Still parsing protobuf's or interpreting WASM faster
| would result directly in better battery performance, etc.
| tomp wrote:
| Modern compilers can compile tail calls of unknown functions
| (i.e. function pointers) to jumps; so instead of using
| "computed gotos" (which IIRC is a GCC extension), one can use
| ANSII C _and_ get more efficient code (because of control of
| registers)
| mappu wrote:
| Regarding PHP specifically, the interpreter's opcodes are all
| described in a meta language, and the actual PHP interpreter
| can be code-generated in 4 different styles depending on what
| is fastest for the target C compiler.
|
| See ZEND_VM_KIND_CALL / SWITCH / GOTO / HYBRID in php-
| src/Zend/zend_vm_gen.php
| kevincox wrote:
| > the first 6 function parameters are guaranteed to be passed
| via registers.
|
| This assumes that your function is being called via the regular
| calling convention. By the as-if rule there is nothing
| guaranteeing that an internal call is using the regular calling
| convention or that it hasn't been inlined.
| haberman wrote:
| We use the "noinline" attribute to get control over inlining,
| which gives a reasonable assurance that we'll get a normal
| standard ABI call with six parameters in registers on
| x64/ARM64.
| anarazel wrote:
| Are there cases where a compiler reasonably would internally
| use a different calling convention that supports fewer
| arguments passed via register passing?
|
| Yours is still a valid point (no reason doesn't imply a
| guarantee) even if there's none, to be clear. Just curious
| because I can't really see any reason for a compiler to do
| so.
|
| I can imagine some theoretical cases where compiler
| optimizations lead to additional arguments to be passed to a
| function [version].
| kevincox wrote:
| I agree it is unlikely. I can imagine something like the
| compiler realizing that some computation is pure, so it
| caches the value across invocations of the function,
| essentially adding addition arguments that push the actual
| arguments onto the stack. Of course this is an unlikely
| edge case but a reasonable set of possible optimizations
| that would work together I'm a surprising way.
| iamevn wrote:
| Wouldn't you still be able to get a lesser improvement from the
| tail call being able to overwrite the stack frame and jumping
| instead of calling?
___________________________________________________________________
(page generated 2021-04-25 23:00 UTC)