[HN Gopher] Murder Mystery: GCC builds failing after sbuild refa...
       ___________________________________________________________________
        
       Murder Mystery: GCC builds failing after sbuild refactoring
        
       Author : pabs3
       Score  : 105 points
       Date   : 2024-12-22 07:31 UTC (15 hours ago)
        
 (HTM) web link (www.linux.it)
 (TXT) w3m dump (www.linux.it)
        
       | rurban wrote:
       | I had a similar murder mystery investigation the last months, and
       | to rewrite the whole murder concept. But in the end, getting rid
       | of signals (I hate signals ) and replacing it with messages
       | didn't work, so I embraced arranging a proper group and killing
       | them at once. With some exceptions.
        
       | IshKebab wrote:
       | The standard Linux process management API is a total mess IMO. It
       | doesn't even offer a way to avoid TOCTOU. They've made some
       | improvements recently (https://lwn.net/Articles/794707/) but as
       | far as I know you still can't implement something like pstree
       | without just reading files from `/proc` which is shit.
        
         | formerly_proven wrote:
         | Who needs handles for processes when you can have reusable IDs
         | instead? /s
         | 
         | For a while the limit has been around 4 million or something
         | like that, instead of the traditional 32k. Which of course
         | broke a bunch of stuff that assumed "char pid[5];" is good
         | enough. Fun stuff.
         | 
         | The kernel offers up this (here abbreviated) example which you
         | could use for all accesses to /proc/<pid> when you have an open
         | pidfd to ensure that the two refer to the same process:
         | static int pidfd_metadata_fd(pid_t pid, int pidfd) {
         | char path[100];                 snprintf(path, sizeof(path),
         | "/proc/%d", pid);                 int procfd = open(path,
         | O_DIRECTORY | O_RDONLY | O_CLOEXEC);                 if
         | (sys_pidfd_send_signal(pidfd, 0, NULL, 0) < 0 && errno !=
         | EPERM) {                    close(procfd);
         | procfd = -1;                 }                 return procfd;
         | }
         | 
         | And these days you even can poll for pidfds (readable = exited)
         | and you can even use P_PIDFD with waitid. So the very basic
         | process management cycle can be done through pidfds nowadays.
         | 
         | > as far as I know you still can't implement something like
         | pstree without just reading files from `/proc` which is shit.
         | 
         | It's also slow! Tools like top/htop/ps -aux etc. use a
         | surprising amount of CPU because they have to do a trillion
         | trips to the kernel and back. Compare this to the win32
         | equivalent where, iirc, there is just one or two trips to the
         | kernel to acquire a snapshot and you walk that in user-space
         | instead.
        
           | alexvitkov wrote:
           | char path[100];       snprintf(path, sizeof(path),
           | "/proc/%d", pid);
           | 
           | You're assuming the pid will fit in 93 bytes! The exact same
           | mistake as `char pid[5];`
           | 
           | Jokes aside, sometimes userspace deserves to be broken, and
           | `char pid[5]` is IMO one of those cases. The fact that we
           | treat integers as a scarce reusable resource is bad, and the
           | fact that we now have a second-order process identifier to
           | cover that up is worse.
        
             | formerly_proven wrote:
             | Fun fact that broken code was in the runtime used by a
             | major Fortran compiler
        
         | pjmlp wrote:
         | UNIX/POSIX process management.
        
         | josephcsible wrote:
         | > It doesn't even offer a way to avoid TOCTOU.
         | 
         | What do you mean? I admit there are cases where it's hard to
         | avoid TOCTOU, but I can't think of any where it's impossible.
        
           | IshKebab wrote:
           | For example to kill all of the children of a process you
           | first have to read all of the files in `/proc` to build a
           | process tree, and then send your signals, by which time the
           | process tree may have changed.
        
             | maccam94 wrote:
             | Yeah, as far as I know the only way to be sure is by
             | putting the parent process into a cgroup. Then to kill all
             | of the child processes you have to freeze the cgroup,
             | enumerate the pids, send sigterm/sigkill to all of them,
             | before unfreezing again.
        
             | josephcsible wrote:
             | Doesn't the new pidfd API you linked give you a race-free
             | way to do that?
        
               | IshKebab wrote:
               | Maybe. Tbh I was trying to do something like that on
               | RHEL8 and I think it's too old for procfd.
        
             | o11c wrote:
             | As long as you aren't trying to reimplement `systemd`
             | badly, you really shouldn't have to deal with this.
        
               | IshKebab wrote:
               | "As long as you aren't trying to use the flawed API you
               | won't run into any of its flaws."
               | 
               | Ok then.
        
         | ijustlovemath wrote:
         | With TOCTOU, the solution is often just to try to use the
         | resource and handle errors, not to assume that since you
         | checked it, it's available.
        
           | loeg wrote:
           | PID reuse is the classic example here. There is no neat
           | solution other than process handles, which Unix/Linux does
           | not historically have.
        
           | IshKebab wrote:
           | That only works if you know the resource in advance. Doesn't
           | work if for example you're trying to delete files matching a
           | glob and you have to enumerate them by name and then delete
           | them by name.
           | 
           | The filesystem API offers a solution for that. The process
           | API doesn't (maybe with procfd; I haven't tried that API yet
           | since it's so new).
        
         | Polizeiposaune wrote:
         | The problem here wasn't a case of TOCTOU; rather, the program
         | was using a magic constant value that was outside the range of
         | getpid() but which wasn't outside of the domain of kill(2)'s
         | first parameter.
        
           | nneonneo wrote:
           | Proper enum types would have solved this problem; e.g. in
           | Rust                 enum ProcessHandle {
           | Pid(pid_t),           Invalid,           Terminated       }
           | 
           | There is no way to pass a Terminated or Invalid process
           | handle to kill, since there is no numeric identifier for it
           | anymore. And, any code that wants to get a PID from a
           | ProcessHandle has to explicitly handle non-PIDs - making it
           | impossible to forget to deal with these special cases.
        
           | IshKebab wrote:
           | Yeah I know, that was just another example of how bad the API
           | is.
        
         | ajross wrote:
         | Heh, "even". BSD process groups were around for two decades at
         | least before anyone even thought of the idea of a TOCTOU
         | vulnerability, much less coined an acronym for it. You're just
         | asking for too much here. Yes, it's an API you probably
         | shouldn't use. No, it's not going away.
         | 
         | The solution for this particular problem is cgroups. The build
         | system just wasn't using it.
        
       | usr1106 wrote:
       | Excellent writeup!
       | 
       | Some observations unrelated to the bug.
       | 
       | 1. Nice to see some details about the Debian build server
       | implementation. I have always found it rather obscure. Compared
       | to that OpenSUSE's OBS is refreshingly open. Not only can you
       | just use their cloud service for free, but you can also rather
       | easily set up your own instance.
       | 
       | 2. Nice to see that the Debian build infra gets updated. I am a
       | bit suprised that Perl code gets introduced in 2024. Isn't that a
       | bit out of fashion for a reason: It's typically very hard to read
       | code written by others. (As said unrelated. It did not cause the
       | bug in TFA.)
        
         | Kwpolska wrote:
         | sbuild is written in Perl, so it's not surprising that new code
         | for it is written in Perl.
        
         | aragilar wrote:
         | I believe installing https://packages.debian.org/sid/sbuild-
         | debian-developer-setu... will get you most of the way (but I've
         | only ever done manual builds of packages, so I may be wrong).
        
         | chubot wrote:
         | Are you surprised that programs can live for 10 or 20 years?
         | 
         | Most people who want to rewrite things in a language that's
         | more fashionable don't have the experience maintaining working
         | systems over 20+ years, like Debian does
        
           | eyegor wrote:
           | I'll just pitch in as someone who's worked with several 40+
           | year old codebases and 3+ year old perl, the perl code is
           | significantly harder to maintain. The language lends itself
           | to unreadable levels of terseness and abuse of regex. Unless
           | you use perl every day, the amount of time trying to
           | understand the syntax vs the code logic is skewed too heavily
           | towards obscure syntax. Even the oldest fortran/c is much
           | easier to work with.
           | 
           | Except maybe arithmetic gotos designed to minimize the number
           | of functions. Those are straight evil and I'm glad no modern
           | language supports them.
        
             | temporallobe wrote:
             | I also maintain older codebases and this is how I feel
             | about Clojure. It's far too terse and the syntax is nigh
             | unreadable unless you are using it very often - add to that
             | there seems to be a culture that discourages comments. OTOH
             | so-called "C" languages seem much more readable even if I
             | have haven't touched them in years or have never written a
             | line of code in them.
             | 
             | On another note, I have done a lot of things with Perl and
             | know it well, and I agree that it can be written in a very
             | unreadable way, which is of course enabled by its many
             | esoteric features and uncommon flexibility in both syntax
             | and methodology. It is simultaneously a high-level
             | scripting language and a low-level system tool. Indeed, I
             | have caught myself writing "clever" code only to come back
             | years later to regret it. Instead of "TMTOWTDI"it is has
             | turned into "TTMWTDI" (there's too many ways to do it).
        
             | shermantanktop wrote:
             | > Even the oldest fortran/c is much easier to work with.
             | 
             | Could that be survivor bias? If that old Fortran was hard
             | to work with, maybe it would have been rewritten, and left
             | the set of "oldest code" in the process.
        
       | dwattttt wrote:
       | It's problems like these (mistaking a sentinel value for a valid
       | member of a type) that make me think sentinel values are just a
       | mistake. In-band signalling must have been responsible for so
       | many bugs at this point.
        
         | ajross wrote:
         | It's hard though. I mean, it's easy to say you shouldn't do
         | this. But that becomes isomorphic to "You shouldn't have NULL".
         | And... that's hard, because you really need NULLs for a lot of
         | problems. Or "You shouldn't have error code returns from
         | syscalls", when you really need those.
         | 
         | It's possible to split it out, but then everyone complains
         | about the result. We thought exceptions were going to fix the
         | error code problem, but then we collectively decided we hate
         | them. OSes and static analyzers and instrumentation tools like
         | valgrind/asan/msan have done a really good job of turning NULL
         | dereference into a built-in assertion everywhere, but people
         | won't use them.
         | 
         | Even Rust, which has probably come closest to a genuine
         | solution, has done so only at the cost of immense complexity
         | and somewhat fugly syntax. No free lunch.
         | 
         | I think what's notable here isn't the bug itself, which is
         | pretty routine. It's the fact that simple bugs can be buried so
         | deeply that they get exposed only in a six-hour-long test rig
         | on one particular Linux distribution's backend.
        
           | dwattttt wrote:
           | Rust's answer is its type system, but Rust is hardly the only
           | language that can do this.
           | 
           | If a syscalls returns an fd, and on error it returns a
           | negative value, why not define it as a bitfield with one bit
           | a flag? Then in C there'd at least be a field name you could
           | grep for, or grep for it not being there.
        
           | paulddraper wrote:
           | > Even Rust, which has probably come closest to a genuine
           | solution
           | 
           | Haskell Maybe
           | 
           | Scala Option
           | 
           | Java Optional
           | 
           | These things have existed for over a decade (multiple, in the
           | case of Haskell).
        
             | crest wrote:
             | All of these didn't just give you something equivalent to C
             | malloc()/free() with a parameter and result type. Maybe
             | monads are useful in a lot of cases, but require a safe
             | type system that prevents accidentally/negligently ignoring
             | errors and language features to make it easy to unwrap the
             | enum variants e.g. destructuring pattern matches and a
             | collections of functions to work with them. I know both
             | Rust and Haskell have those and suspect the other two
             | examples as well. Without them they would be a pain in the
             | ass to use.
        
           | crest wrote:
           | The answer is to make enums explicit e.g. malloc should
           | return either a valid allocation or an error, but you can't
           | easily do that in C. It takes a richer/better type system to
           | make this convenient to use and safe from accidents.
           | Depending on your language and compiler the in memory
           | representation for a nullable and a non-nullable pointer can
           | be the same, but a well designed type system can prevent
           | fallible human programmers from confusing one for the other.
           | Add some destructuring pattern matching to the language the
           | usability improves instead of suffers. Rust added a whole lot
           | more which made the syntax very noisy.
        
       ___________________________________________________________________
       (page generated 2024-12-22 23:00 UTC)