[HN Gopher] C Runtime Overhead (2015)
___________________________________________________________________
C Runtime Overhead (2015)
Author : Zababa
Score : 128 points
Date : 2022-01-03 17:41 UTC (5 hours ago)
(HTM) web link (ryanhileman.info)
(TXT) w3m dump (ryanhileman.info)
| cozzyd wrote:
| So, on my system (Fedora 34, glibc 2.33, gcc11, Ryzen 5 3600) a
| trivial program takes only about 1 ms according to time:
| $ cat trivial.c #include <stdio.h> int main(int
| nargs, char ** args) { FILE * f =
| fopen("/dev/null","w"); for (int i = 0; i < nargs;
| i++) { fprintf(f,"%s\n",args[i]);
| } return 0; } $ make trivial #
| just default CFLAGS $ time ./trivial real
| 0m0.001s user 0m0.000s sys 0m0.001s
|
| But if I add strace -tt, it does indeed take 9 ms, even if I
| redirect strace output $ time strace -tt
| ./trivial 2> /dev/null real 0m0.009s user
| 0m0.005s sys 0m0.004s
|
| So, is author just measuring strace overhead?
| justicezyx wrote:
| But the 1ms number is also measured with strace.
|
| Plus, the article is in 2015, and the author did not mention
| the CPU and other configuration. So that also make things
| muddy.
| cozzyd wrote:
| Yes, but if there are almost no syscalls, then strace is
| doing a lot less.
|
| Profiling my trivial program with callgrind shows
| (unsurprisingly) that the majority of the time is in dynamic
| library relocation and whatever __GI__tunables_init does:
| https://i.imgur.com/Yligh7S.png
| matu3ba wrote:
| strace can have a significant overhead:
| https://www.brendangregg.com/blog/2014-05-11/strace-wow-much...
| What does the output of `perf trace` say?
| dang wrote:
| Discussed at the time:
|
| _C Runtime Overhead_ -
| https://news.ycombinator.com/item?id=8958867 - Jan 2015 (31
| comments)
| ginko wrote:
| I never quite understood why assembly wasn't part of programming
| language shootout competitions. C was always treated as the speed
| of light for computing, which didn't quite make sense to me.
| pjscott wrote:
| In my experience there's usually not much speed advantage to be
| had from assembly _unless_ you have a specific thesis about how
| you 're going to do better than the code that a good compiler
| would generate; e.g. the article author skipping C runtime
| setup overhead, or Mike Pall dissecting the ARM implementation
| of one of the bytecodes in LuaJIT (recommended!):
|
| https://www.reddit.com/r/programming/comments/hkzg8/author_o...
|
| Unless you can think of something you can do in assembly that a
| compiler just wouldn't be able to think of, their ability to
| generate good machine code is really quite impressive:
|
| https://ridiculousfish.com/blog/posts/will-it-optimize.html
| jart wrote:
| Normally it's better to focus on high-level time-complexity
| improving optimizations but there's still a whole lot of
| things where you really need assembly micro-optimizations,
| since they usually offer a 10x or 100x speedup. It's just
| that those things usually only concern core libraries. Stuff
| like crc32. But the rest of the time we're just writing code
| that glues all the assembly optimized subroutines together.
| eatonphil wrote:
| Awesome post! And (2015) for the title maybe.
| Zababa wrote:
| Thanks, added!
| 2ton_jeff wrote:
| A couple of years ago I did a "Terminal Video Series" segment
| that also highlights the runtime overhead of 12 other languages,
| C of course actually one of the best ones:
| https://2ton.com.au/videos/tvs_part1/
| tialaramex wrote:
| I was actually surprised for C how expensive that is, I'd
| expected maybe a dozen or so syscalls to set up the environment
| the provided runtime wants, but that's a lot of syscalls for
| not very much value. Both C++ and Rust are more expensive but
| they're setting up lots of stuff I might use in real programs,
| even if it's wasted for printing "hello" - but what on Earth
| can C need all those calls for when the C runtime environment
| is so impoverished anyway?
|
| Go was surprising to me in the opposite direction, they're
| setting up a pretty nice hosted environment, and they're not
| using many instructions or system calls to do that compared to
| languages that would say they're much more focused on
| performance. I'd have expected to find it closer to Perl or
| something, and it's really much lighter.
| aidenn0 wrote:
| glibc isn't particularly optimized for startup-time. It also
| defaults to dynamic linkage which adds several syscalls, and
| even if it is statically linked, it may dynamically load NSS
| plugins. (musl gets around the latter by hardcoding support
| for NSCD, a service that can cache the NSS results).
|
| C11 adds several things that benefit from early
| initialization (like everything around threads), but GCC had
| some level of support for most of them before that.
| stabbles wrote:
| What's kinda neat about musl libc is that the interpreter/runtime
| linker and libc.so are the same executable, so that if you just
| dynamically link to libc, it does not need to open ld.so cache or
| locate and load libc.so.
| averne_ wrote:
| More importantly this uses the devkitPro/libnx homebrew
| toolchain, while the OP project uses the official SDK (the
| "complicated legal reasons" behind the absence of code probably
| being an NDA signature).
| josefx wrote:
| > I happened to run strace -tt against my solution (which
| provides microsecond-accurate timing information for syscalls)
|
| Weirdly I always found the timing results of strace at under one
| millisecond generally unreliable, it just seemed to add too much
| overhead itself.
|
| Also making system calls from your own code varies between
| prohibited and badly supported on most OSes. Some see calls that
| didn't pass through the system libc as a security issue and will
| intercept them, while Linux may just silently corrupt your
| process memory if you try something fancy as the Go team had to
| find out.
| mananaysiempre wrote:
| Umm, what's the story with Go on Linux? My understanding is
| that Linux explicitly supports making syscalls from your own
| code, it's just that on platforms where the syscall story is a
| giant mess (32-bit x86) the prescription is to either use the
| slow historical interface (INT 80h) or to jump to the vDSO
| (which will do SYSENTER or SYSCALL or whatever--SYSENTER in
| particular has the lovely property of _not saving the return
| address_ , so a common stub is pretty much architecturally
| required[1]).
|
| If I'm guessing correctly about what you're referring to, the
| Go people did something in between, got broken in a Linux
| kernel release, complained at the kernel people, the kernel got
| a patch to unbreak them. The story on MacOS or OpenBSD, where
| the kernel developers will cheerfully tell you to take a hike
| if you make a syscall yourself, seems much worse to me.
|
| (And yes, I'd say there _is_ a meaningful difference between a
| couple of instructions in the vDSO and Glibc's endless
| wrappers.)
|
| [1]:
| https://lore.kernel.org/lkml/Pine.LNX.4.44.0212172225410.136...
|
| ETA: Wait, no, I was thinking about the Android breakage[2].
| What's the Go story then?
|
| [2]:
| https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/...
| aw1621107 wrote:
| This gets rather over my head, but my best guess is this bug
| having to do with the Go runtime underestimating how much
| stack space vDSO code may need (blog post at [0], fix at
| [1]).
|
| [0]: https://marcan.st/2017/12/debugging-an-evil-go-runtime-
| bug/
|
| [1]: https://github.com/golang/go/commit/a158382b1c9c0b95a7d4
| 1865...
| mananaysiempre wrote:
| > _I built 32 kernels, one for each bit of the SHA-1
| prefix, which only took 29 minutes._
|
| Oh, that _is_ evil, thank you. I even encountered a link to
| this article a couple of months ago[1], but wasn't hooked
| enough to go and read it.
|
| Though the conclusion sounds somewhat dubious: the kernel
| doesn't document or control stack usage limits for the
| vDSO, they happen to blow way up on a system built with
| obscure (if arguably well-motivated) compiler options, a
| language runtime that tries to minimize stack usage crashes
| as a result, and somehow the fault is with the runtime in
| question for not going via libc (which happens to virtually
| always run with a large stack and a guard page, thus
| turning this insidious concurrent memory corruption bug
| into a mere extremely unlikely crash)?
|
| More like we're collectively garbage at accounting for our
| stack usage. To be fair to the kernel developers, I would
| also never guess, looking at this implementation of
| clock_gettime() [2], that you could compile it in such a
| way that it ends up requiring 4K of stack space on pain of
| memory corruption _in concurrent programs only_ (it
| originating in the kernel source tree has little to do with
| the bug, it's just weirdly-compiled userspace C code
| executing on top of a small unguarded stack).
|
| [1] https://utcc.utoronto.ca/~cks/space/blog/unix/UnixAPIAn
| dCRun... via <https://utcc.utoronto.ca/~cks/space/blog/prog
| ramming/CStackS...>, <https://utcc.utoronto.ca/~cks/space/b
| log/unix/StackSizeLimit...>, and <https://utcc.utoronto.ca/
| ~cks/space/blog/tech/StackSizeLimit...>.
|
| [2] https://elixir.bootlin.com/linux/latest/C/ident/__cvdso
| _cloc...
| badsectoracula wrote:
| > Also making system calls from your own code varies between
| prohibited and badly supported on most OSes.
|
| This is not the case on Linux though, the official API of the
| kernel is its system calls, not any specific C library.
| mgaunard wrote:
| That's the difference between precision and accuracy.
|
| Having more precision doesn't magically make the data accurate.
| commandlinefan wrote:
| > making system calls from your own code varies between
| prohibited and badly supported
|
| Which is probably at least part of the reason that libc itself
| ends up being slower than a specific, targeted solution.
| scottlamb wrote:
| josefx> Also making system calls from your own code varies
| between prohibited and badly supported on most OSes. Some see
| calls that didn't pass through the system libc as a security
| issue and will intercept them, while Linux may just silently
| corrupt your process memory if you try something fancy as the
| Go team had to find out.
|
| On Linux, making syscalls directly is fine. Good point about
| other platforms, but many people only care about Linux, for
| better or worse. And the author's last paragraph (quoted below)
| suggests using an alternate/static-linking-friendly libc, not
| making direct syscalls yourself. Presumably on platforms where
| those alternate libcs aren't available, you continue using the
| standard libc.
|
| ryanhileman> If you're running into process startup time issues
| in a real world scenario and ever actually need to do this, it
| might be worth your time to profile and try one of the
| alternative libc implementations (like musl libc or diet libc).
|
| IMHO, the Go vDSO problem [1] wasn't due to making direct
| syscalls but basically calling a userspace library without
| meeting its assumptions. I'd describe Linux's vDSO as a
| userspace library for making certain syscalls with less
| overhead. (If you don't care about the overhead, you can call
| them as you'd call any other syscall instead.) It assumed the
| standard ABI, in which there's a guard page as well as
| typically a generous amount of stack to begin with. Golang
| called into it from a thread that used Go's non-standard ABI
| with less stack space available (Go's own functions check the
| stack size on entry and copy the stack if necessary) and no
| guard page. On some Linux builds (with -fstack-check,
| apparently used by Gentoo Hardened ), it actually used enough
| stack to overflow. Without a guard page, that caused memory
| corruption.
|
| [1] https://github.com/golang/go/issues/20427
___________________________________________________________________
(page generated 2022-01-03 23:00 UTC)