[HN Gopher] The server chose violence
       ___________________________________________________________________
        
       The server chose violence
        
       Author : lukastyrychtr
       Score  : 231 points
       Date   : 2024-04-27 09:37 UTC (13 hours ago)
        
 (HTM) web link (cliffle.com)
 (TXT) w3m dump (cliffle.com)
        
       | pavlov wrote:
       | _> 'But REPLY_FAULT also provides a way to define and implement
       | new kinds of errors -- application-specific errors -- such as
       | access control rules. For instance, the Hubris IP stack assigns
       | IP ports to tasks statically. If a task tries to mess with
       | another task's IP port, the IP stack faults them. This gets us
       | the same sort of "fail fast" developer experience, with the
       | smaller and simpler code that results from not handling
       | "theoretical" errors that can't occur in practice.'_
       | 
       | This sounds good when the system is small and tight and
       | applications are written mostly by people who designed the whole
       | system.
       | 
       | But as an application developer, I'd be somewhat scared to
       | interface with third-party code over an IPC model where the other
       | service can at any time send back an instant death pill to my
       | process.
       | 
       | I guess I just don't trust other app developers that much. The
       | world is full of terrible drivers and background processes
       | written by stressed-out developers harassed by management.
       | They'll drop in a bunch of potentially unsuitable default
       | REPLY_FAULTs if it means they get to go home before 8pm.
        
         | skitter wrote:
         | > This sounds good when the system is small and tight and
         | applications are written mostly by people who designed the
         | whole system.
         | 
         | I think that's intentional because that's what Hubris is aimed
         | at.
        
           | mannykannot wrote:
           | ...and in that circumstance, the author reports finding,
           | apparently serendipitously, that it helped with development:
           | _" Initially I was concerned that I'd made the kernel too
           | aggressive, but in practice, this has meant that errors are
           | caught very early in development. A fault is hard to miss,
           | and literally cannot be ignored the way an error code might
           | be."_
        
         | mikaraento wrote:
         | Indeed, this happened with Symbian. An IPC server could panic
         | the client. As an application developer without access to the
         | OS source code this was pretty terrible. Not all preconditions
         | were easily understood and could vary between devices and OS
         | versions.
        
         | password4321 wrote:
         | > _not handling "theoretical" errors that can't occur in
         | practice_
         | 
         | The Dennis Nedry approach to counting dinosaurs in Jurassic
         | Park.
        
         | toast0 wrote:
         | > This sounds good when the system is small and tight and
         | applications are written mostly by people who designed the
         | whole system.
         | 
         | Swift death to deviance is a way to keep the system tight. The
         | designed scope probably keeps it small anyway. Scopes have a
         | way of creeping, but I don't think people will want to force
         | tasks into Hubris that would be better on the host rather than
         | in its embedded controllers.
        
         | ahepp wrote:
         | It seems like in an embedded environment, it's good to resolve
         | these misunderstandings immediately when they occur, regardless
         | of whose fault it is.
         | 
         | The server says "that client is bad!" so the kernel kills it.
         | The problem is really that the two didn't understand each
         | other.
        
         | aidenn0 wrote:
         | > But as an application developer, I'd be somewhat scared to
         | interface with third-party code over an IPC model where the
         | other service can at any time send back an instant death pill
         | to my process.
         | 
         | For service, think "OS interface". If you make a bogus kernel
         | call on a monolithic kernel, it would be reasonable for the OS
         | to kill you. Also note that when you say "process" it might be
         | different than you think because threads all share the same
         | address space on hubris.
        
       | rcarmo wrote:
       | Hubris and Humility (its debugger) are two pieces of tech I would
       | love to be deeply engrossed in if I had the time (or the
       | mandate). But alas, that is not possible.
        
       | autocole wrote:
       | Very enjoyable read, and this single supervisor is similar to how
       | I set up an application at a previous startup, where we unwrapped
       | everything. This reminds me of one of my favorite posts
       | https://medium.com/@mattklein123/crash-early-and-crash-often...
        
       | greenbit wrote:
       | Reminded me of the line from Errand of Mercy, "You will find
       | there are many rules and regulations. They will be posted.
       | Violation of the smallest of them will be punished by death."
        
       | e-dant wrote:
       | I'm wondering if this really is too aggressive.
       | 
       | On Linux, sure it's not possible to directly crash another
       | program you're talking to via a socket alone (ignoring bad data
       | on the socket).
       | 
       | But you can absolutely kill them. Anything running as root can
       | kill anything else. Can even reboot and bring down the whole
       | system.
       | 
       | Maybe a bit harder and a bit more unusual, but at least for
       | containers, root privileges are common. And yeah, sure, there's a
       | cgroup there are you're more limited. But you get the idea.
       | 
       | It's also a bit different from the (conventional?) wisdom about
       | being "liberal in what you accept, conservative in what you emit"
       | though that's a bit more tied to networked systems.
       | 
       | Though, maybe it's inevitable that a system _has_ to be liberal
       | in what they accept.
       | 
       | How else can you change the api slightly without breaking
       | existing programs?
        
         | gary_0 wrote:
         | Hubris isn't a general-purpose OS, it runs on a low-level
         | processor inside the Oxide server rack. I believe Hubris
         | doesn't even allow new kinds of processes at runtime; all
         | possible executables must be determined at compile time.
        
           | steveklabnik wrote:
           | > I believe Hubris doesn't even allow new kinds of processes
           | at runtime; all possible executables must be determined at
           | compile time.
           | 
           | This is correct, yes.
        
             | gary_0 wrote:
             | "Perfection is achieved, not when there is nothing more to
             | add, but when there is nothing left to take away."
        
           | sillywalk wrote:
           | > I believe Hubris doesn't even allow new kinds of processes
           | at runtime; all possible executables must be determined at
           | compile time.
           | 
           | Correct. From [0]:
           | 
           | "Hubris is an aggressively static system. The configuration
           | file defines the full set of tasks that may ever be running
           | in the application. These tasks are assigned to sections of
           | address space by the build system, and they will forever
           | occupy those sections.
           | 
           | Hubris has no operations for creating or destroying tasks at
           | runtime. Task resource requirements are determined during the
           | build and are fixed once deployed. This takes the kernel out
           | of the resource allocation business. Memories are the most
           | visible resources we handle this way, but it applies to all
           | allocatable or routable resources, including hardware
           | interrupts and memory-mapped registers - all are explicitly
           | wired up at compile time and cannot be changed at runtime."
           | 
           | [0] https://cliffle.com/blog/on-hubris-and-humility/
        
       | layer8 wrote:
       | Does REPLY_FAULT cascade? Meaning, if A is waiting in a SEND to
       | B, and B is waiting in a SEND to C, and C does REPLY_FAULT, does
       | A get killed along with B (and any further tasks that may be
       | waiting on A)? Because if not, a malicious task could just
       | delegate its experiments to a helper task. And if yes, that seems
       | rather brittle overall (without having any further familiarity
       | with Hubris). Furthermore, if SENDs can be circular/reciprocal, a
       | task may also inadvertently kill itself that way -- which (for
       | scenarios like B -> A -> B) may incentivize _not_ using
       | REPLY_FAULT.
        
         | samus wrote:
         | It seems that Hubris is not designed as a general-purpose
         | operating system. Processes are defined at build time.
         | 
         | The reason why servers can shoot back at their clients is
         | reliability, not security. Errors are thought to originate from
         | bugs, not from deliberate attacks. The extreme reaction of the
         | kernel ensures that developers find them as soon as possible.
         | 
         | Of course, there is an overlap with security, and this can be a
         | useful fallback measure in the event that a process tries to do
         | something that it isn't supposed to do.
        
           | steveklabnik wrote:
           | > It seems that Hubris is not designed as a general-purpose
           | operating system. Processes are defined at build time.
           | 
           | These are both correct.
           | 
           | Well, I mean, Hubris is general in the sense that, if you're
           | doing an embedded system and you can deal with the
           | constraints it has, like the latter, it can work for your
           | projects. But it's not trying to be anything other than a
           | good embedded OS, or to handle any project.
        
         | ironhaven wrote:
         | I think when B gets faulted A would get an error about a dead
         | server and would have the opportunity resend the same message
         | to a newly reset server not a cascading crash.
        
       | manishsharan wrote:
       | I recall server ABENDs in Novell NetWare. I think it was the OG
       | of server violence.
        
       | optimalsolver wrote:
       | Title sounds like it concerns a really fed-up waiter.
        
         | adonovan wrote:
         | In a sense, it does: waiting is one of the main jobs of an OS
         | kernel.
        
       | ahepp wrote:
       | > The Hubris IPC scheme is deliberately designed to work a lot
       | like a function call, at least from the perspective of the
       | client.
       | 
       | That's a bona fide remote procedure call, isn't it?
        
         | steveklabnik wrote:
         | In a sense, though most would think of "remote" as being "over
         | the network," and that's not the case here.
        
       | samus wrote:
       | I find Humility is a great name for a debugger. Many are the
       | programmers that refuse to use debuggers and just stare the code
       | down until it yields errors, under the assumptions that "good"
       | code doesn't need debugging!
        
         | hinkley wrote:
         | I find more bugs with a debugger. There's typically the bug I
         | was looking for, and then smaller bugs that didn't technically
         | cause the problem but contributed, and may be involved in the
         | next issue. I want to fix those too, and sometimes first.
        
         | r2_pilot wrote:
         | I find this attitude bizarre. Just earlier today I used python
         | debugging to quickly figure out why an error was occurring.
         | Being able to see the state of the variables without having to
         | print each helped solve it instantly.
        
         | YZF wrote:
         | It's partly a religious thing, partly what you're used to and
         | partly using the right tool for the job. Some programmers use
         | debuggers as a crutch and some complex systems (e.g. that
         | involve multiple distributed components or are timing
         | dependent) can't be easily debugged using traditional
         | debuggers.
         | 
         | EDIT: yet another factor is sometimes you may not even have
         | access to the system you need to troubleshoot. Being able to
         | reason about code execution without observing it is a useful
         | skill (and still a debugger is a useful tool).
        
       | hinkley wrote:
       | I wonder if they're going to find this creates security issues.
       | 
       | Processes keep state to analyze abuse of various kinds, and
       | killing a process presumably wipes its memory. Unless there's
       | some way to retain state across restarts?
        
         | bcantrill wrote:
         | Yes, we have an _in situ_ dump facility, which Cliff mentioned
         | at the end of [0]; it 's been essential for debugging these
         | issues when we hit them.
         | 
         | [0] https://cliffle.com/blog/who-killed-the-network-switch/
        
       | Animats wrote:
       | That's QNX-type interprocess communication. QNX doesn't offer
       | interprocess kill, though.
        
         | steveklabnik wrote:
         | The designer of Hubris (and several folks who work on it) are
         | familiar with QNX, for sure.
        
       | rjbwork wrote:
       | Reminds me of Vigil. https://github.com/munificent/vigil
        
       | crdrost wrote:
       | OK, we need to get this as an April Fools RFC for HTTP.
       | 
       | I propose HTTP 499 "Shame on you." A client receiving 499
       | (perhaps on a request that it must have originated with a
       | specific header like "Strict: true") must terminate, in a
       | language-dependent manner, the task which issued the request.
       | 
       | It perfectly balances the "WTF... But actually, hey" that one
       | sees in those contexts.
        
       | ahepp wrote:
       | It sounds like this may be similar to using signals for error
       | handling in a Unix system?
        
         | steveklabnik wrote:
         | In some sense, yes, this is kind of like the kernel sending
         | SIGKILL to a process.
        
       | loeg wrote:
       | > Take Unix for example. If you call close on a file descriptor
       | you never opened, you get an error code back. If you call open
       | and hand it a null pointer instead of a pathname? You get an
       | error code back. Both of these are violations of a system call's
       | preconditions, and both are handled through the same error
       | mechanism that handles "file not found" and other cases that can
       | happen in a correct program.
       | 
       | > On Hubris, if you break a system call's preconditions, your
       | task is immediately destroyed with no opportunity to do anything
       | else.
       | 
       | Oh, yeah. I've long thought EBADF and EINVALs (and EFAULT, I
       | guess) should basically always be fatal.
        
       | ezekiel68 wrote:
       | > There is no way to "fix" the problem and resume the task. This
       | was a conscious choice to avoid some subtle failure modes and
       | simplify reasoning about the system.
       | 
       | One of Einstein's famous quotes is, "...as simple as possible,
       | but no simpler." I'm pretty sure this design violates the latter
       | portion. I'm not interested in operating environments that can
       | tolerate no real-world chaos, and I'm not aware of any
       | commercially viable realms which would either. What -- push it
       | back to the init system to keep trying again? But by what
       | mechanism would that strategy be able to understand the fault
       | that occurred, in order to try again better?
       | 
       | Anyway, kudos for purity of conviction (I guess).
        
         | crote wrote:
         | > But by what mechanism would that strategy be able to
         | understand the fault that occurred, in order to try again
         | better?
         | 
         | I think the general idea is to apply this to problems which are
         | _clearly_ the result of an invalid program state, and therefore
         | not reasonably recoverable. They are either caused by bugs, an
         | attack, or corrupted hardware. In all cases you shouldn 't
         | continue, because there's something _seriously_ wrong with the
         | caller. If the caller continues, it could only cause more
         | damage.
         | 
         | It sounds a bit like Erlang/OTP's "let it crash" philosophy.
         | Erlang is used in quite a bunch of mission-critical hardware
         | and is famous for its reliability, so it might not be such a
         | huge dealbreaker in practice.
        
           | sillywalk wrote:
           | > It sounds a bit like Erlang/OTP's "let it crash"
           | philosophy.
           | 
           | Which was based partly on ideas from Tandem Computers'
           | NonStop / Guardian. Hardware and software were fail-fast i.e.
           | they would work correctly or stop, so they couldn't corrupt
           | data. If there was a problem, the whole processor / process
           | would be stopped, and a backup took over, which seems
           | somewhat similar to the "supervisor" tasks in hubris.
           | 
           | Quite a bit of a different use cases - an embedded os for
           | microcontrollers vs large OLTP applications. They both could
           | be considered "mission critical", at least for the people who
           | own/make money with them.
        
             | vvanders wrote:
             | From a "system engineering"(not to be confused with
             | software engineering) perspective they seem quite similar,
             | in my view even something like a watchdog timer(which just
             | about every CPU/core has these days) is just a hardware
             | version of similar philosophies. This[1] is one of my
             | favorite overviews on Erlang and what drives some of those
             | design decisions. You can absolutely apply the same
             | systematic thinking to other domains/places without having
             | to bring OTP or even Erlang into the conversation.
             | 
             | [1] https://ferd.ca/the-zen-of-erlang.html
        
         | cryptoxchange wrote:
         | It's a 2000 line rust embedded systems kernel that doesn't
         | support adding new tasks at runtime. It is written to go deep
         | in the guts of the 0xide server racks.
        
         | bcantrill wrote:
         | Hubris is not an academic exercise: it runs at the heart of
         | every element of the Oxide rack (compute sled, switch, power
         | shelf controller) -- and its design is informed by delivered
         | utility above all else. Indeed -- and as Cliff elaborated in
         | the blog -- REPLY_FAULT was something that he thought initially
         | perhaps too aggressive, but it was our own experience in
         | building, deploying, and (it must be said!) debugging the
         | system that gave him the confidence that it would make our
         | systems more robust, not capriciously faulty.
         | 
         | For more details on the thinking here and what it looks like in
         | practice, see (e.g.) [0] and [1].
         | 
         | [0] https://www.mattkeeter.com/blog/2024-03-25-packing/
         | 
         | [1] https://cliffle.com/blog/who-killed-the-network-switch/
        
         | vvanders wrote:
         | > that can tolerate no real-world chaos, and I'm not aware of
         | any commercially viable realms which would either.
         | 
         | Watchdog timers will happily kill/restart your processes that
         | don't poke them often enough. Even in my hobby exercises I've
         | seen I2C busses hang up often enough(and bring the whole system
         | down!) when some protocol bit goes wrong that I think the
         | design is actually quite inspired. As I understand it this
         | isn't talking about known error cases(that are handled) but
         | protocol mismatches and other things that shouldn't ever
         | happen.
         | 
         | Many other comments touched on it but it's a purpose built OS,
         | much in the same way I'm not going to build a UI in Erlang,
         | Hubris seems well positioned for the space that it occupies.
        
       | lloydatkinson wrote:
       | I'm really enjoying his posts on this
        
       | cryptoxchange wrote:
       | It's interesting how in a system where one team writes all the
       | code, nuking your clients from orbit when they look at you funny
       | can improve iteration speed.
       | 
       | It's funny to wake up and read this after falling asleep reading
       | about algebraic effects.
       | 
       | If you squint the right way, this is a kernel that lets a server
       | perform an effect that the client cannot handle.
       | 
       | I feel like this would make code reuse and composition much
       | harder, but provides a much simpler execution model. Definitely
       | the right trade off in a static embedded system. You can always
       | just vendor and modify a task if you need to reuse it.
        
         | theamk wrote:
         | I don't think this will make reuse much worse even in a general
         | programs, as long as there is a good division between expected
         | errors (file not found) and unexpected (invalid operation
         | code). In fact, there are a lot of ignorable errors in Unix
         | which IMHO should have been raising a fatal signal instead, as
         | this would substantially improve general software quality.
         | 
         | As an example: trying to close() invalid FD is a a non-fatal
         | error which is very often ignored. But it is actually super
         | dangerous, especially in multi-threaded apps: closing wrong fd
         | will harmlessly fail most of the time, but 1% of time you'll
         | close a logging socket or a database lock file or some
         | unrelated IPC connection.. That's how you get unreliable
         | software everyone hates.
        
       | jerf wrote:
       | I would advise the author to read up on "asynchronous exceptions"
       | and check out how many systems have had them at some point and
       | removed them.
       | 
       | I'm not saying that's because they're fundamentally impossible,
       | but because they have a track record of tripping up language
       | designers and it's good to cross check against the experiences.
       | 
       | Recommended languages are Java (ultimately a failure despite vast
       | effort), and Haskell and Erlang where they work, but a lot of
       | work of very different kinds was put in to make it work. I
       | definitely get Erlang vibes from this piece so it's possible the
       | preconditions for correct asynchronous exceptions are met or can
       | be met here. But they are _very_ subtle and have a tempestuous
       | history of working 99.9% but it being literally impossible to get
       | to 100%. This could be a big, big, big trap.
        
         | theamk wrote:
         | huh? Do not see anything asynchronous in author's work. It's
         | all synchronous, because IPCs in hubris are synchronous too.
        
         | steveklabnik wrote:
         | I am not familiar with what you're talking about (but Cliff may
         | already know), I'll have to look into it. But Hubris is a
         | synchronous system, and also, these faults aren't catchable, so
         | I'm not sure how directly relevant it is. What's the specific
         | issue you're worried about?
         | 
         | Your Erlang vibes are there for good reason, it's certainly an
         | influence.
        
           | jerf wrote:
           | Reaching out and nuking other... whatever you call them,
           | "execution contexts" is what I go for to be maximally generic
           | (thread/async task/continuation/generator/etc.), can
           | particularly cause problems if the context was going to do X,
           | Y, and Z and expected to be able to be guaranteed to run Z,
           | but Y killed it. The standard example is for X to be taking a
           | lock and Z to release it, but there are a lot of ways to get
           | into trouble and the obvious first solutions don't work.
           | 
           | Erlang solves it by locking what things it has that can have
           | that problem behind other execution contexts that don't get
           | killed when the main one dies, so they can still clean up.
           | ("Ports", in their terminology.) Haskell solves it by being a
           | functional language and beating the collective community's
           | head in it for several years. (Immutability helped a lot,
           | laziness took out back.)
           | 
           | If that sounds impossible... hey, great! Then I just pattern
           | matched on something that wasn't a match. If that doesn't
           | sound impossible, then it may be worth a look around.
           | 
           | Synchronousness may not really matter, I've kind of thought
           | that "asynchronous exception" is not a good name for the
           | issue for a while, but it's what it gets called. It's really
           | about one execution context lobbing errors/exceptions into
           | others. Although being synchronous would avoid the worst
           | timing issues.
        
             | steveklabnik wrote:
             | Ah, that problem in general I am familiar with, yes.
             | 
             | Tasks in Hubris are independently compiled programs, not
             | threads in a shared context. So I don't believe that it's
             | an issue. You don't share locks between tasks, you create a
             | task that holds the shared resource, and have the two tasks
             | that want to share it talk to that task, patterns like
             | that.
        
         | beeeeerp wrote:
         | >Recommended languages are Java (ultimately a failure despite
         | vast effort)
         | 
         | Why is Java a failure? Recent JVMs have come a long way, and
         | GraalVM makes it somewhat comparable to Go-like languages.
         | 
         | I understand the historical hate and how Oracle bought it, but
         | it really isn't _that_ bad of a language if you're using modern
         | Java.
        
           | steveklabnik wrote:
           | They're referring to that specific feature of Java being a
           | failure, not the language in general.
        
           | j16sdiz wrote:
           | The parent was talking about async exception in java -- like
           | the InterruptedException . They are hard to work or reason
           | with.
        
       ___________________________________________________________________
       (page generated 2024-04-27 23:00 UTC)