[HN Gopher] Linux Syscall Support
___________________________________________________________________
Linux Syscall Support
Author : btdmaster
Score : 143 points
Date : 2024-11-01 22:14 UTC (6 days ago)
(HTM) web link (chromium.googlesource.com)
(TXT) w3m dump (chromium.googlesource.com)
| sph wrote:
| See also Linux's nolibc headers, which allows one to write C
| software that completely bypass libc, but instead directly
| operate through syscalls.
|
| https://github.com/torvalds/linux/tree/master/tools/include/...
|
| A sample use-case? I was developing an Erlang-like actor platform
| that should operate under Linux as well as a bare-metal
| microkernel, and all I needed is a light layer over syscalls
| instead of pulling the entire glibc. Also it provides a simple
| implementation for standard C functions (memcpy, printf) so I
| don't have to write them myself.
| guerrilla wrote:
| > What would be a use-case?
|
| Maybe bootstapping a new language with no dependencies.
| sph wrote:
| Yes. Go for example doesn't use glibc and instead interfaces
| with syscalls directly.
|
| https://pkg.go.dev/syscall
| cyberpunk wrote:
| I'm aware of this but I really don't the benefits of this
| approach; It causes issues in eg openbsd where you can only
| call syscalls from libc, and it seems like they're trying
| to outsmart the os developers and I just don't see an
| advantage.
|
| Is it faster? More stable?
| oguz-ismail wrote:
| > It causes issues in eg openbsd where you can only call
| syscalls from libc
|
| OpenBSD allows making syscalls from static binaries as
| well. If Go binaries are static, it shouldn't cause any
| problems.
| kbolino wrote:
| > OpenBSD allows making syscalls from static binaries as
| well.
|
| Do you have a source for this? My Google searches and
| personal recollections say that OpenBSD does not have a
| stable syscall ABI in the way that Linux does and the
| proper/supported way to make syscalls on OpenBSD is
| through dynamically linked libc; statically linking libc,
| or invoking the syscall mechanism it uses directly,
| results in binaries that can be broken on future OpenBSD
| versions.
| cesarb wrote:
| > > OpenBSD allows making syscalls from static binaries
| as well.
|
| > Do you have a source for this?
|
| One article from 2019 about this can be found at
| https://lwn.net/Articles/806776/ (later updates
| https://lwn.net/Articles/949078/ and
| https://lwn.net/Articles/959562/). Yes, it does not have
| a stable system call ABI, but as long as your program was
| statically compiled with the libc from the same OpenBSD
| release, AFAIK it should work.
| oguz-ismail wrote:
| Yeah. Do you have any information as to how/when the
| OpenBSD system call ABI has changed recently? I wouldn't
| expect that to happen very often.
| kbolino wrote:
| I upvoted for the great links, but I still don't think a
| static binary that will break in the future is meeting
| the expectations many have when static linking.
| tolciho wrote:
| Go recently got run through the wringer to remove
| syscalls (and various Go ports are probably still broken)
| due to pinsyscalls.
| masklinn wrote:
| > I just don't see an advantage.
|
| You don't have to deal with C ABI requirements with
| respect to stack, or registers management. You also don't
| need to do dynamic linking.
|
| On the other hand all of that comes back to bone you if
| you're trying to benefit from vDSO without going through
| a libc.
| titzer wrote:
| > You also don't need to do dynamic linking.
|
| This is a big one. Linking against libc on many platforms
| also means making your binaries relocatable. It's a lot
| of _unnecessary, incidental complexity_.
| rcxdude wrote:
| It also means giving up ASLR, though.
| titzer wrote:
| You can still randomize heap allocations (but not with as
| much entropy), as usually the heap segment is quite
| large. But you don't get randomization of, e.g. the code.
|
| ASLR is a weak defense. It's akin to randomizing which of
| the kitchen drawers you'll put your jewelry in. Not the
| same level of security as say, a locked safe.
|
| Attacks are increasingly sophisticated, composed of
| multiple exploits in a chain, one of which is some form
| of ASLR bypass. It's usually one of the easiest links in
| the chain.
| LegionMammal978 wrote:
| > On the other hand all of that comes back to bone you if
| you're trying to benefit from vDSO without going through
| a libc.
|
| At least the vDSO functions really don't need much in the
| way of stack space: generally there's nothing much there
| but clock_gettime() and gettimeofday(), which just read
| some values from the vvar area.
|
| The bigger pain, of course, is actually looking up the
| symbols in the vDSO, which takes at least a minimal ELF
| parser.
| t-8ch wrote:
| The kernel also provides a minimal vdso elf parser:
|
| https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
| lin...
| masklinn wrote:
| > At least the vDSO functions really don't need much in
| the way of stack space: generally there's nothing much
| there but clock_gettime() and gettimeofday(), which just
| read some values from the vvar area.
|
| And yet that's exactly one of the things Go fucked up in
| the past: https://marcan.st/2017/12/debugging-an-evil-go-
| runtime-bug/
| titzer wrote:
| There are several advantages to using kernel syscalls
| directly:
|
| 1. No overhead from libc; minimizes syscall cost
|
| 2. No dependency on libc and C language ABI/toolchains
|
| 3. Reduced attack surface. libc can and does have bugs
| and potentially ROP or Spectre gadgets.
|
| 4. Bootstrapping other languages, e.g. Virgil
| kllrnohj wrote:
| > 1. No overhead from libc; minimizes syscall cost
|
| The few nanoseconds of a straight function call are
| absolutely irrelevant vs the 10s of microseconds of a
| syscall cost _and_ you lose out on any of the
| optimizations a libc has that you might not or didn 't
| think about (like memoization of getpid() ) _and_ you
| need to take on keeping up with syscall evolution / best
| practices which a libc generally has a good handle on.
|
| > No dependency on libc and C language ABI/toolchains
|
| This obviously doesn't apply to a C syscall header,
| though, such as the case in OP :)
| gpderetta wrote:
| A syscall can be way less than 10us. Especially if it is
| not doing I/O.
| SAI_Peregrinus wrote:
| GNU aren't the OS developers of the Linux kernel. Think
| of the Go standard library on Linux as another libc-level
| library. On the BSDs there is a single libc that's part
| of the OS, on Linux there are several options for libc.
| josefx wrote:
| Didn't they go back to Glibc in 2017 after a syscall
| silently corrupted several of their tightly packed tiny Go
| stacks? The page you link to seems to refer to a proposal
| from 2014 as "new".
| melodyogonna wrote:
| That is the documentation for the Go syscall package. If
| you scroll down to the bottom of the page you'll see
| links to the source files.
| jsheard wrote:
| IIRC that was specifically on macOS and other BSDs which
| don't have a stable syscall interface. They still use raw
| syscalls on Linux, which guarantees syscall stability on
| pain of Linus Torvalds yelling at you if you break it.
| titzer wrote:
| I'm 100% with Linus on this one.
| cesarb wrote:
| > Didn't they go back to Glibc in 2017 after a syscall
| silently corrupted several of their tightly packed tiny
| Go stacks?
|
| You must be thinking of
| https://marcan.st/2017/12/debugging-an-evil-go-runtime-
| bug/ which was about the vDSO (a virtual dynamically
| linked library which is automatically mapped by the
| kernel on every process), not system calls. You normally
| call into the vDSO instead of doing direct system calls,
| because the vDSO can do some things (like reading the
| clock) in an optimized way without entering the kernel,
| but you can always bypass it and do the system calls
| directly; doing the system calls directly will not use
| any of the userspace stack (it immediately switches to a
| kernel stack and does all the work there).
| asveikau wrote:
| Except there are some platforms where you _need_ to go
| through libc and the direct syscall interface is considered
| private, or subject to change. OpenBSD is like this, and I
| believe Mac is too.
| boomskats wrote:
| Worth mentioning that the golang.org/x/sys/unix package has
| better support for syscalls than the og syscall package
| nowadays, especially for some of the newer ones like
| cachestat[0] which was added to the kernel in 6.5. AFAIK
| the original syscall package was 'frozen' a while back to
| preserve backward compatibility, and at one point there was
| even a bit of drama[1] around it being marked as deprecated
| instead of frozen.
|
| [0]: https://github.com/golang/go/issues/61917 [1]:
| https://github.com/golang/go/issues/60797
| lpapez wrote:
| Undeprecating something is truly a rare sight.
|
| So far I only knew about PHP undeprecating "is_a"
| function, so I guess this puts Go in good company ^^
| jcalabro wrote:
| Indeed, Zig does this for instance (at least for x86_64 linux
| [0]) as a way to avoid having to link libc at all
|
| [0] https://github.com/ziglang/zig/blob/ee9f00d673f2bccddc275
| 1c3...
| blueflow wrote:
| > See also Linux's nolibc headers
|
| Kind of an understatement. The existence of an official
| interface obsoletes 3rd party projects like the one posted.
| almostgotcaught wrote:
| Ya totally - those wacky people at chrome must've just never
| heard of those headers /s
|
| What you don't understand, because you don't work on Chrome,
| or Chrome sized projects, is that generic, lowest common
| denominator implementations cannot be optimal for all use-
| cases and at scale (Chrome-sized project) those
| inefficiencies matter. That's why this exists, that's why
| folly exists, that's why abseil exists, that's why no not
| everyone can just use boost, etc etc etc
| bla3 wrote:
| Might be a license thing? The Linux headers are probably GPL
| like the rest of Linux.
| unmole wrote:
| The Linux kernel licence explicitly says programs using the
| syscall interface are not considered derivative works and
| that GPL does not apply to them: https://github.com/torvald
| s/linux/blob/master/LICENSES/excep...
| sph wrote:
| nolibc is NOT under GPL. See first line of each file.
|
| /* SPDX-License-Identifier: LGPL-2.1 OR MIT */
|
| It's technically not part Linux's headers either. It's
| published under the tools subdirectory, so it's something
| that ships along with the kernel, but not used by the
| kernel itself. Basically it's there as some people might
| find it useful, but could've as well been a separate repo.
| yencabulator wrote:
| nolibc seems very minimal. For example, no pread/pwrite just
| read/write, forcing you to lseek and ruining concurrent use.
| sylware wrote:
| Well... last time I had a look at the assembly code of syscall
| entry on x86_64, I was scared away... this piece of "assembly"
| does require some very niche C compiler options to be
| compatible (stack alignment and more if I recall properly).
|
| Linux "C" code hard dependency on gcc/clang/("ultra complex
| compilers") is getting worse by the day. It should (very easy
| to say, I know) have stayed very simple and plain C99+ with
| smart macro definitions to be replaced with pure assembly or
| the missing bits for modern hardware programming
| (atomics/memory barriers, explicit unaligned access, etc), but
| those abominations like _generic (or static
| assert,__thread,etc) are toxic additions to the C standard
| (there are too many toxic additions and not enough
| removal/simplification/hardening in ISO C, yes, we will have to
| define a "C profile" which breaks backward compatibility with
| hardening and simplifications).
|
| I don't say all extensions are bad, but I think they need more
| "smart and pertinent pick and choose" (and I know this is a
| tough call), just because they "cost". For instance, for a
| kernel, we know it must have fine grained control of ELF object
| sections... or we would get much more source files (one per
| pertinent section) or "many more source configuration macros"
| (....but there I start to wonder if it was not actually the
| "right" way instead of requiring a C compiler to support such
| extension, it moves everything to the linker script... which is
| "required" anyway for a kernel).
|
| Linus T. is not omnipotent and can do only that much and a lot
| of "official" linux devs are putting really nasty SDK
| dependency requirements in everyday/everywhere kernels.
|
| That said, on my side, many of my user apps are now directly
| using linux syscalls... but are written in RISC-V assembly
| interpreted on x86_64 (I have a super lean interpreter/syscall
| translater written in x86_64 assembly and a super lean
| executable format wrapped in ELF executable format), or very
| plain and simple C99+ (legacy or because I want some apps to be
| a bit more 'platform crossy'... for now).
| gpderetta wrote:
| It is hard to take seriously someone that claims that thread
| locals are a toxic addition to the standard. (incidentally
| __thread is a GCC extension that predates the standard by
| almost a decade).
| jwatzman wrote:
| Can you elaborate on the complexity here for syscall entry on
| x86_64? (Or link to what you were reading?) Another commenter
| linked to Linux's own "nolibc" which is similar to, though
| simpler than, the Google project in the OP. Their x64_64 arch
| support is here, which looks simple enough, putting things
| into registers: https://github.com/torvalds/linux/blob/master
| /tools/include/...
|
| The non-arch-specific callers which use this are here, which
| also look relatively straightforward: https://github.com/torv
| alds/linux/blob/master/tools/include/...
|
| I don't see any complex stack alignment or anything which
| reads to me like it would require "niche C compiler options",
| so I'm curious if I'm missing something?
| kchr wrote:
| You linked the same file twice, was that intentional?
| rcxdude wrote:
| Linux has literally never been standard C. Linus used as many
| GCC extensions as he could from day 1.
| jlokier wrote:
| Another use-case is when you are writing threaded code that
| uses the clone() syscall instead of pthreads, usually for
| something with high performance, unusual clone flags, or a very
| small stack.
|
| Most libc functions, including the syscall wrappers and all
| pthreads functions, aren't safe to call in threads created by
| raw clone(). Anything that reads or writes errno, for example,
| is not safe.
|
| I've had to do this a couple of times. One a long time ago was
| an audio mixing real-time thread for a video game, which had to
| keep the audio device fed with low-latency frames for sound
| effects. In those days, pthreads wasn't good enough. For
| talking to the audio device, we had to use the Linux syscall
| wrapper macros, which have been replaced by nolibc now. More
| recently, a thread pool for high-performance storage I/O, which
| ran slightly faster than io_uring, and ran well on older
| kernels and ones with io_uring disabled for security.
| n_plus_1_acc wrote:
| Do I understand correctly that nolibc is just another
| implementation of the C standard library in terms of Linux
| syscalls? Comparably to, say, musl libc?
| sph wrote:
| glibc is a space shuttle, musl is a hatchback, nolibc is a
| skateboard
|
| They all do the same thing (take you from A to B), but offer
| different levels of comfort, efficiency and utility :)
| jonhohle wrote:
| Who can take their space shuttle to work these days, what
| with the price of rocket fuel!?
| sph wrote:
| It's a bit unwieldy, but the good thing is that it comes
| for free with your copy of GNU/Linux!
| nilamo wrote:
| And parking is always a nightmare for my shuttles
| malkia wrote:
| Apparently almost every linux app
| inopinatus wrote:
| simple, you merely have to compile the primary avionics
| software to webassembly first
| wbl wrote:
| And passenger safety?
| fallingsquirrel wrote:
| In a head-on collision, the space shuttle passengers will
| fare better than the hatchback. Even so, it wouldn't be
| my first choice for most destinations.
| oguz-ismail wrote:
| No it's not comparable to musl libc. Standard I/O functions
| don't support buffering and the printf implementation can't
| print floats, for example.
| yencabulator wrote:
| nolibc seems kinda neglected, or like a minimal subset of
| what's actually useful. There's no pread/pwrite etc, only
| read/write, forcing you to use lseek and ruining concurrent
| use.
| pantalaimon wrote:
| Wasn't that originally just for integration tests where you
| wanted to boot a minimal image that just runs your kernel CI
| test?
| inopinatus wrote:
| I knew a guy in the 2000s who wrote FreeBSD kernel modules to
| maximise req/sec capability when serving static content.
|
| His reasoning: why even bother with the context switch? And it
| was, I had to grudgingly admit, staggeringly fast for the time.
|
| The kicker is, it was mostly serving porn banner ads
| jagrsw wrote:
| Just a friendly reminder that syscall() is a vararg function.
| Meaning, you can't just go throwing arguments at it (so maybe
| it's better to use this wrapper to avoid problems instead).
|
| For example, on a 64-bit arch, this code would be sus.
|
| syscall(__NR_syscall_taking_6_args, 1, 2, 3, 4, 5, 6);
|
| Quiz: why
|
| PS: it's a common mistake, so I thought I'd save you a trip down
| the debugging rabbit hole.
| remram wrote:
| A quiz is _the opposite_ of saving someone effort.
| eddd-ddde wrote:
| Exactly, I am now morally bound to figure out the answer
| instead of going to work.
| oguz-ismail wrote:
| The last argument would be on the stack instead of in a
| register which is where the kernel expects to find the
| arguments. But a proper _syscall_ implementation would
| handle this just fine (e.g. <https://github.com/bminor/gli
| bc/blob/ba60be873554ecd141b55ea...>), so I don't think
| there's anything _sus_ about it.
| im3w1l wrote:
| > movq 8(%rsp),%r9
|
| This is a huge edgecase but is 8(%rsp) guaranteed to be
| readable memory
| achierius wrote:
| Yes, see
| https://en.wikipedia.org/wiki/Red_zone_(computing)
| jagrsw wrote:
| The problem is something a bit else (jstarks figured it
| out somewhere below). I'm not a compiler/abi eng, but it
| seems to depend on a compiler, eg. consider this with
| clang-16: #include <sys/syscall.h>
| #include <unistd.h> #include <alloca.h>
| #include <string.h> void s(long a, long b,
| long c, long d, long e, long f, long g) { }
| int main(void) { long a = 0xFFFFFFFFFFFFFFFF;
| s(a, a, a, a, a, a, a); syscall(9999, 1, 2, 3, 4,
| 5, 6); return 0; }
|
| Now, strace shows: $ strace -e
| process_vm_readv ./a process_vm_readv(1, 0x2, 3,
| 0x4, 5, 18446744069414584326) = -1 EINVAL (Invalid
| argument)
|
| objdump -d a 117f: 48 c7 45 f0 ff ff ff
| movq $0xffffffffffffffff,-0x10(%rbp) 1186: ff
| 1187: 48 8b 7d f0 mov -0x10(%rbp),%rdi
| 118b: 48 8b 75 f0 mov -0x10(%rbp),%rsi
| 118f: 48 8b 55 f0 mov -0x10(%rbp),%rdx
| 1193: 48 8b 4d f0 mov -0x10(%rbp),%rcx
| 1197: 4c 8b 45 f0 mov -0x10(%rbp),%r8
| 119b: 4c 8b 4d f0 mov -0x10(%rbp),%r9
| 119f: 48 8b 45 f0 mov -0x10(%rbp),%rax
| 11a3: 48 89 04 24 mov %rax,(%rsp)
| 11a7: e8 94 ff ff ff call 1140 <s> 11ac:
| bf 36 01 00 00 mov $0x136,%edi 11b1: be
| 01 00 00 00 mov $0x1,%esi 11b6: ba 02 00
| 00 00 mov $0x2,%edx 11bb: b9 03 00 00 00
| mov $0x3,%ecx 11c0: 41 b8 04 00 00 00 mov
| $0x4,%r8d 11c6: 41 b9 05 00 00 00 mov
| $0x5,%r9d 11cc: c7 04 24 06 00 00 00 movl
| $0x6,(%rsp) 11d3: b0 00 mov
| $0x0,%al 11d5: e8 56 fe ff ff call 1030
| <syscall@plt>
|
| Only 4 bytes are put on the stack, but syscall will read
| 8.
|
| It's tricky if one doesn't control types of arguments
| used in vararg.
| im3w1l wrote:
| I think you misunderstand. The red zone is on the
| opposite side of rsp. This line is trying to read an
| argument that may not exist, relying on the fact that
| this will put garbage in the register which syscall then
| ignores. But this only works if the memory is readable.
| Brian_K_White wrote:
| I never get people who aren't grateful for pointers. Being
| shown which direction to walk is of no value, they must also
| be carried there.
|
| They didn't claim to save work, they claimed to save hitting
| a bug, and having to debug it.
|
| They said the word "vararg". They gave you everything.
| remram wrote:
| They are not some professor in my school, some valued
| colleague, or known kernel expert. They are a stranger on
| the internet. No, I can't be bothered to research every
| person that claim to have some wisdom that they won't
| articulate to cultivate an air of mystery.
|
| They gave me everything to dismiss their claim.
| jstarks wrote:
| I guess if the arch's varargs conventions do something other
| than put each 32-bit value in a 64-bit "slot" (likely for
| inputs that end up on the stack, at least), then some of the
| arguments will not line up. Probably some of the last args will
| get combined into high/low parts of a 64-bit register when
| moved into registers to pass to the kernel. And then subsequent
| register inputs will get garbage from the stack.
|
| Need to cast them to long or size_t or whatever to prevent
| this.
| jagrsw wrote:
| Yes
| IAmLiterallyAB wrote:
| I've been using my own version of this. Maybe I'll switch over,
| this looks more complete.
| kentonv wrote:
| Disappointing that errors are still signaled by assigning to
| `errno` (you can apparently override this to some other global,
| but it has to be a global or global-like lvalue).
|
| The kernel actually signals errors by returning a negative error
| code (on most arches), which seems like a better calling
| convention. Storing errors in something like `errno` opens a
| whole can of worms around thread safety and signal safety, while
| seemingly providing very little benefit beyond following
| tradition.
| sidkshatriya wrote:
| code that uses errno is also a bit harder to understand. I like
| the way Rust does it -- if a function can fail, it returns a
| Result.
| dailykoder wrote:
| While that might be true and the industry has evolved and
| learned about "better" ways, the old systems still exist. I
| don't see any reason to complain about it.
|
| Yes, we can do better. Yes, we probably should do better. But
| in some cases you really have to think through every edge
| case and in the end someone has to do it. So just be grateful
| for what we have.
| IshKebab wrote:
| I don't think this is an "old system" though.
| alexey-salmin wrote:
| For old systems -- yes, of course. But designing a new,
| incompatible API around errno is just backwards.
| AndyKelley wrote:
| Disappointing is an understatement. Can't believe these people
| are making a browser. I'm sure they have some Google-flavored
| excuse for why to repeat this ridiculous threadlocal errno API.
| alexey-salmin wrote:
| There's a funny circular dependency in glibc sources because
| errno lives in the TLS block which is allocated using __sbrk
| which can set the errno before it's allocated (see the
| __libc_setup_tls).
|
| The branch that actually touches the errno is unlikely to be
| executed. However I did experience a puzzling crash with a
| cross-compiled libc because the compiler was smart enough to
| inject a speculative load of errno outside of the branch. Fun
| times.
| wg0 wrote:
| So web apps can make Linux sys calls? Or its about Chrome OS?
| thinkharderdev wrote:
| The chrome browser itself I would think
| Aransentin wrote:
| > We try to hide some of the differences between arches when
| reasonably feasible. e.g. Newer architectures no longer provide
| an open syscall, but do provide openat. We will still expose a
| sys_open helper by default that calls into openat instead.
|
| Sometimes you actually want to make sure that the exact syscall
| is called; e.g. you're writing a little program protected by
| strict seccomp rules. If the layer can magically call some other
| syscall under the hood this won't work anymore.
| dundarious wrote:
| musl does this too. glibc may also, I haven't checked in a long
| time. I bet rust, etc., does too. You always need to check.
| king_geedorah wrote:
| Glibc definitely does this transparent mapping as well.
| Calling int fd = open(<path>, O_RDONLY) yields
| openat(AT_FDCWD, <path>, O_RDONLY) when running through
| strace.
| jchw wrote:
| This really surprised me when I was digging into Linux
| tracing technology and noticed no `open` syscalls on my
| running system. It was all `openat`. I don't know when this
| transition happened, but I totally missed it.
| mcnichol wrote:
| 0-Day incoming
| sedatk wrote:
| Can't wait for Zig team to adopt this over libc, citing concerns
| about "libc not existing on certain configurations"[1]
|
| [1] https://github.com/ziglang/zig/issues/1840
| TUSF wrote:
| Zig on Linux already directly interfaces with syscalls,[0]
| unless your library or application directly links libc.
|
| [0]: https://ziglang.org/documentation/master/std/#std.os.linux
| AndyKelley wrote:
| Welcome to 2016.
| https://github.com/ziglang/zig/blob/5f0bfcac24036e1fff0b2bed...
| iTokio wrote:
| Using go is a nice way to do that by default as it also directly
| uses syscalls (see the _sys_ package)
___________________________________________________________________
(page generated 2024-11-07 23:00 UTC)