[HN Gopher] The server chose violence
___________________________________________________________________
The server chose violence
Author : lukastyrychtr
Score : 231 points
Date : 2024-04-27 09:37 UTC (13 hours ago)
(HTM) web link (cliffle.com)
(TXT) w3m dump (cliffle.com)
| pavlov wrote:
| _> 'But REPLY_FAULT also provides a way to define and implement
| new kinds of errors -- application-specific errors -- such as
| access control rules. For instance, the Hubris IP stack assigns
| IP ports to tasks statically. If a task tries to mess with
| another task's IP port, the IP stack faults them. This gets us
| the same sort of "fail fast" developer experience, with the
| smaller and simpler code that results from not handling
| "theoretical" errors that can't occur in practice.'_
|
| This sounds good when the system is small and tight and
| applications are written mostly by people who designed the whole
| system.
|
| But as an application developer, I'd be somewhat scared to
| interface with third-party code over an IPC model where the other
| service can at any time send back an instant death pill to my
| process.
|
| I guess I just don't trust other app developers that much. The
| world is full of terrible drivers and background processes
| written by stressed-out developers harassed by management.
| They'll drop in a bunch of potentially unsuitable default
| REPLY_FAULTs if it means they get to go home before 8pm.
| skitter wrote:
| > This sounds good when the system is small and tight and
| applications are written mostly by people who designed the
| whole system.
|
| I think that's intentional because that's what Hubris is aimed
| at.
| mannykannot wrote:
| ...and in that circumstance, the author reports finding,
| apparently serendipitously, that it helped with development:
| _" Initially I was concerned that I'd made the kernel too
| aggressive, but in practice, this has meant that errors are
| caught very early in development. A fault is hard to miss,
| and literally cannot be ignored the way an error code might
| be."_
| mikaraento wrote:
| Indeed, this happened with Symbian. An IPC server could panic
| the client. As an application developer without access to the
| OS source code this was pretty terrible. Not all preconditions
| were easily understood and could vary between devices and OS
| versions.
| password4321 wrote:
| > _not handling "theoretical" errors that can't occur in
| practice_
|
| The Dennis Nedry approach to counting dinosaurs in Jurassic
| Park.
| toast0 wrote:
| > This sounds good when the system is small and tight and
| applications are written mostly by people who designed the
| whole system.
|
| Swift death to deviance is a way to keep the system tight. The
| designed scope probably keeps it small anyway. Scopes have a
| way of creeping, but I don't think people will want to force
| tasks into Hubris that would be better on the host rather than
| in its embedded controllers.
| ahepp wrote:
| It seems like in an embedded environment, it's good to resolve
| these misunderstandings immediately when they occur, regardless
| of whose fault it is.
|
| The server says "that client is bad!" so the kernel kills it.
| The problem is really that the two didn't understand each
| other.
| aidenn0 wrote:
| > But as an application developer, I'd be somewhat scared to
| interface with third-party code over an IPC model where the
| other service can at any time send back an instant death pill
| to my process.
|
| For service, think "OS interface". If you make a bogus kernel
| call on a monolithic kernel, it would be reasonable for the OS
| to kill you. Also note that when you say "process" it might be
| different than you think because threads all share the same
| address space on hubris.
| rcarmo wrote:
| Hubris and Humility (its debugger) are two pieces of tech I would
| love to be deeply engrossed in if I had the time (or the
| mandate). But alas, that is not possible.
| autocole wrote:
| Very enjoyable read, and this single supervisor is similar to how
| I set up an application at a previous startup, where we unwrapped
| everything. This reminds me of one of my favorite posts
| https://medium.com/@mattklein123/crash-early-and-crash-often...
| greenbit wrote:
| Reminded me of the line from Errand of Mercy, "You will find
| there are many rules and regulations. They will be posted.
| Violation of the smallest of them will be punished by death."
| e-dant wrote:
| I'm wondering if this really is too aggressive.
|
| On Linux, sure it's not possible to directly crash another
| program you're talking to via a socket alone (ignoring bad data
| on the socket).
|
| But you can absolutely kill them. Anything running as root can
| kill anything else. Can even reboot and bring down the whole
| system.
|
| Maybe a bit harder and a bit more unusual, but at least for
| containers, root privileges are common. And yeah, sure, there's a
| cgroup there are you're more limited. But you get the idea.
|
| It's also a bit different from the (conventional?) wisdom about
| being "liberal in what you accept, conservative in what you emit"
| though that's a bit more tied to networked systems.
|
| Though, maybe it's inevitable that a system _has_ to be liberal
| in what they accept.
|
| How else can you change the api slightly without breaking
| existing programs?
| gary_0 wrote:
| Hubris isn't a general-purpose OS, it runs on a low-level
| processor inside the Oxide server rack. I believe Hubris
| doesn't even allow new kinds of processes at runtime; all
| possible executables must be determined at compile time.
| steveklabnik wrote:
| > I believe Hubris doesn't even allow new kinds of processes
| at runtime; all possible executables must be determined at
| compile time.
|
| This is correct, yes.
| gary_0 wrote:
| "Perfection is achieved, not when there is nothing more to
| add, but when there is nothing left to take away."
| sillywalk wrote:
| > I believe Hubris doesn't even allow new kinds of processes
| at runtime; all possible executables must be determined at
| compile time.
|
| Correct. From [0]:
|
| "Hubris is an aggressively static system. The configuration
| file defines the full set of tasks that may ever be running
| in the application. These tasks are assigned to sections of
| address space by the build system, and they will forever
| occupy those sections.
|
| Hubris has no operations for creating or destroying tasks at
| runtime. Task resource requirements are determined during the
| build and are fixed once deployed. This takes the kernel out
| of the resource allocation business. Memories are the most
| visible resources we handle this way, but it applies to all
| allocatable or routable resources, including hardware
| interrupts and memory-mapped registers - all are explicitly
| wired up at compile time and cannot be changed at runtime."
|
| [0] https://cliffle.com/blog/on-hubris-and-humility/
| layer8 wrote:
| Does REPLY_FAULT cascade? Meaning, if A is waiting in a SEND to
| B, and B is waiting in a SEND to C, and C does REPLY_FAULT, does
| A get killed along with B (and any further tasks that may be
| waiting on A)? Because if not, a malicious task could just
| delegate its experiments to a helper task. And if yes, that seems
| rather brittle overall (without having any further familiarity
| with Hubris). Furthermore, if SENDs can be circular/reciprocal, a
| task may also inadvertently kill itself that way -- which (for
| scenarios like B -> A -> B) may incentivize _not_ using
| REPLY_FAULT.
| samus wrote:
| It seems that Hubris is not designed as a general-purpose
| operating system. Processes are defined at build time.
|
| The reason why servers can shoot back at their clients is
| reliability, not security. Errors are thought to originate from
| bugs, not from deliberate attacks. The extreme reaction of the
| kernel ensures that developers find them as soon as possible.
|
| Of course, there is an overlap with security, and this can be a
| useful fallback measure in the event that a process tries to do
| something that it isn't supposed to do.
| steveklabnik wrote:
| > It seems that Hubris is not designed as a general-purpose
| operating system. Processes are defined at build time.
|
| These are both correct.
|
| Well, I mean, Hubris is general in the sense that, if you're
| doing an embedded system and you can deal with the
| constraints it has, like the latter, it can work for your
| projects. But it's not trying to be anything other than a
| good embedded OS, or to handle any project.
| ironhaven wrote:
| I think when B gets faulted A would get an error about a dead
| server and would have the opportunity resend the same message
| to a newly reset server not a cascading crash.
| manishsharan wrote:
| I recall server ABENDs in Novell NetWare. I think it was the OG
| of server violence.
| optimalsolver wrote:
| Title sounds like it concerns a really fed-up waiter.
| adonovan wrote:
| In a sense, it does: waiting is one of the main jobs of an OS
| kernel.
| ahepp wrote:
| > The Hubris IPC scheme is deliberately designed to work a lot
| like a function call, at least from the perspective of the
| client.
|
| That's a bona fide remote procedure call, isn't it?
| steveklabnik wrote:
| In a sense, though most would think of "remote" as being "over
| the network," and that's not the case here.
| samus wrote:
| I find Humility is a great name for a debugger. Many are the
| programmers that refuse to use debuggers and just stare the code
| down until it yields errors, under the assumptions that "good"
| code doesn't need debugging!
| hinkley wrote:
| I find more bugs with a debugger. There's typically the bug I
| was looking for, and then smaller bugs that didn't technically
| cause the problem but contributed, and may be involved in the
| next issue. I want to fix those too, and sometimes first.
| r2_pilot wrote:
| I find this attitude bizarre. Just earlier today I used python
| debugging to quickly figure out why an error was occurring.
| Being able to see the state of the variables without having to
| print each helped solve it instantly.
| YZF wrote:
| It's partly a religious thing, partly what you're used to and
| partly using the right tool for the job. Some programmers use
| debuggers as a crutch and some complex systems (e.g. that
| involve multiple distributed components or are timing
| dependent) can't be easily debugged using traditional
| debuggers.
|
| EDIT: yet another factor is sometimes you may not even have
| access to the system you need to troubleshoot. Being able to
| reason about code execution without observing it is a useful
| skill (and still a debugger is a useful tool).
| hinkley wrote:
| I wonder if they're going to find this creates security issues.
|
| Processes keep state to analyze abuse of various kinds, and
| killing a process presumably wipes its memory. Unless there's
| some way to retain state across restarts?
| bcantrill wrote:
| Yes, we have an _in situ_ dump facility, which Cliff mentioned
| at the end of [0]; it 's been essential for debugging these
| issues when we hit them.
|
| [0] https://cliffle.com/blog/who-killed-the-network-switch/
| Animats wrote:
| That's QNX-type interprocess communication. QNX doesn't offer
| interprocess kill, though.
| steveklabnik wrote:
| The designer of Hubris (and several folks who work on it) are
| familiar with QNX, for sure.
| rjbwork wrote:
| Reminds me of Vigil. https://github.com/munificent/vigil
| crdrost wrote:
| OK, we need to get this as an April Fools RFC for HTTP.
|
| I propose HTTP 499 "Shame on you." A client receiving 499
| (perhaps on a request that it must have originated with a
| specific header like "Strict: true") must terminate, in a
| language-dependent manner, the task which issued the request.
|
| It perfectly balances the "WTF... But actually, hey" that one
| sees in those contexts.
| ahepp wrote:
| It sounds like this may be similar to using signals for error
| handling in a Unix system?
| steveklabnik wrote:
| In some sense, yes, this is kind of like the kernel sending
| SIGKILL to a process.
| loeg wrote:
| > Take Unix for example. If you call close on a file descriptor
| you never opened, you get an error code back. If you call open
| and hand it a null pointer instead of a pathname? You get an
| error code back. Both of these are violations of a system call's
| preconditions, and both are handled through the same error
| mechanism that handles "file not found" and other cases that can
| happen in a correct program.
|
| > On Hubris, if you break a system call's preconditions, your
| task is immediately destroyed with no opportunity to do anything
| else.
|
| Oh, yeah. I've long thought EBADF and EINVALs (and EFAULT, I
| guess) should basically always be fatal.
| ezekiel68 wrote:
| > There is no way to "fix" the problem and resume the task. This
| was a conscious choice to avoid some subtle failure modes and
| simplify reasoning about the system.
|
| One of Einstein's famous quotes is, "...as simple as possible,
| but no simpler." I'm pretty sure this design violates the latter
| portion. I'm not interested in operating environments that can
| tolerate no real-world chaos, and I'm not aware of any
| commercially viable realms which would either. What -- push it
| back to the init system to keep trying again? But by what
| mechanism would that strategy be able to understand the fault
| that occurred, in order to try again better?
|
| Anyway, kudos for purity of conviction (I guess).
| crote wrote:
| > But by what mechanism would that strategy be able to
| understand the fault that occurred, in order to try again
| better?
|
| I think the general idea is to apply this to problems which are
| _clearly_ the result of an invalid program state, and therefore
| not reasonably recoverable. They are either caused by bugs, an
| attack, or corrupted hardware. In all cases you shouldn 't
| continue, because there's something _seriously_ wrong with the
| caller. If the caller continues, it could only cause more
| damage.
|
| It sounds a bit like Erlang/OTP's "let it crash" philosophy.
| Erlang is used in quite a bunch of mission-critical hardware
| and is famous for its reliability, so it might not be such a
| huge dealbreaker in practice.
| sillywalk wrote:
| > It sounds a bit like Erlang/OTP's "let it crash"
| philosophy.
|
| Which was based partly on ideas from Tandem Computers'
| NonStop / Guardian. Hardware and software were fail-fast i.e.
| they would work correctly or stop, so they couldn't corrupt
| data. If there was a problem, the whole processor / process
| would be stopped, and a backup took over, which seems
| somewhat similar to the "supervisor" tasks in hubris.
|
| Quite a bit of a different use cases - an embedded os for
| microcontrollers vs large OLTP applications. They both could
| be considered "mission critical", at least for the people who
| own/make money with them.
| vvanders wrote:
| From a "system engineering"(not to be confused with
| software engineering) perspective they seem quite similar,
| in my view even something like a watchdog timer(which just
| about every CPU/core has these days) is just a hardware
| version of similar philosophies. This[1] is one of my
| favorite overviews on Erlang and what drives some of those
| design decisions. You can absolutely apply the same
| systematic thinking to other domains/places without having
| to bring OTP or even Erlang into the conversation.
|
| [1] https://ferd.ca/the-zen-of-erlang.html
| cryptoxchange wrote:
| It's a 2000 line rust embedded systems kernel that doesn't
| support adding new tasks at runtime. It is written to go deep
| in the guts of the 0xide server racks.
| bcantrill wrote:
| Hubris is not an academic exercise: it runs at the heart of
| every element of the Oxide rack (compute sled, switch, power
| shelf controller) -- and its design is informed by delivered
| utility above all else. Indeed -- and as Cliff elaborated in
| the blog -- REPLY_FAULT was something that he thought initially
| perhaps too aggressive, but it was our own experience in
| building, deploying, and (it must be said!) debugging the
| system that gave him the confidence that it would make our
| systems more robust, not capriciously faulty.
|
| For more details on the thinking here and what it looks like in
| practice, see (e.g.) [0] and [1].
|
| [0] https://www.mattkeeter.com/blog/2024-03-25-packing/
|
| [1] https://cliffle.com/blog/who-killed-the-network-switch/
| vvanders wrote:
| > that can tolerate no real-world chaos, and I'm not aware of
| any commercially viable realms which would either.
|
| Watchdog timers will happily kill/restart your processes that
| don't poke them often enough. Even in my hobby exercises I've
| seen I2C busses hang up often enough(and bring the whole system
| down!) when some protocol bit goes wrong that I think the
| design is actually quite inspired. As I understand it this
| isn't talking about known error cases(that are handled) but
| protocol mismatches and other things that shouldn't ever
| happen.
|
| Many other comments touched on it but it's a purpose built OS,
| much in the same way I'm not going to build a UI in Erlang,
| Hubris seems well positioned for the space that it occupies.
| lloydatkinson wrote:
| I'm really enjoying his posts on this
| cryptoxchange wrote:
| It's interesting how in a system where one team writes all the
| code, nuking your clients from orbit when they look at you funny
| can improve iteration speed.
|
| It's funny to wake up and read this after falling asleep reading
| about algebraic effects.
|
| If you squint the right way, this is a kernel that lets a server
| perform an effect that the client cannot handle.
|
| I feel like this would make code reuse and composition much
| harder, but provides a much simpler execution model. Definitely
| the right trade off in a static embedded system. You can always
| just vendor and modify a task if you need to reuse it.
| theamk wrote:
| I don't think this will make reuse much worse even in a general
| programs, as long as there is a good division between expected
| errors (file not found) and unexpected (invalid operation
| code). In fact, there are a lot of ignorable errors in Unix
| which IMHO should have been raising a fatal signal instead, as
| this would substantially improve general software quality.
|
| As an example: trying to close() invalid FD is a a non-fatal
| error which is very often ignored. But it is actually super
| dangerous, especially in multi-threaded apps: closing wrong fd
| will harmlessly fail most of the time, but 1% of time you'll
| close a logging socket or a database lock file or some
| unrelated IPC connection.. That's how you get unreliable
| software everyone hates.
| jerf wrote:
| I would advise the author to read up on "asynchronous exceptions"
| and check out how many systems have had them at some point and
| removed them.
|
| I'm not saying that's because they're fundamentally impossible,
| but because they have a track record of tripping up language
| designers and it's good to cross check against the experiences.
|
| Recommended languages are Java (ultimately a failure despite vast
| effort), and Haskell and Erlang where they work, but a lot of
| work of very different kinds was put in to make it work. I
| definitely get Erlang vibes from this piece so it's possible the
| preconditions for correct asynchronous exceptions are met or can
| be met here. But they are _very_ subtle and have a tempestuous
| history of working 99.9% but it being literally impossible to get
| to 100%. This could be a big, big, big trap.
| theamk wrote:
| huh? Do not see anything asynchronous in author's work. It's
| all synchronous, because IPCs in hubris are synchronous too.
| steveklabnik wrote:
| I am not familiar with what you're talking about (but Cliff may
| already know), I'll have to look into it. But Hubris is a
| synchronous system, and also, these faults aren't catchable, so
| I'm not sure how directly relevant it is. What's the specific
| issue you're worried about?
|
| Your Erlang vibes are there for good reason, it's certainly an
| influence.
| jerf wrote:
| Reaching out and nuking other... whatever you call them,
| "execution contexts" is what I go for to be maximally generic
| (thread/async task/continuation/generator/etc.), can
| particularly cause problems if the context was going to do X,
| Y, and Z and expected to be able to be guaranteed to run Z,
| but Y killed it. The standard example is for X to be taking a
| lock and Z to release it, but there are a lot of ways to get
| into trouble and the obvious first solutions don't work.
|
| Erlang solves it by locking what things it has that can have
| that problem behind other execution contexts that don't get
| killed when the main one dies, so they can still clean up.
| ("Ports", in their terminology.) Haskell solves it by being a
| functional language and beating the collective community's
| head in it for several years. (Immutability helped a lot,
| laziness took out back.)
|
| If that sounds impossible... hey, great! Then I just pattern
| matched on something that wasn't a match. If that doesn't
| sound impossible, then it may be worth a look around.
|
| Synchronousness may not really matter, I've kind of thought
| that "asynchronous exception" is not a good name for the
| issue for a while, but it's what it gets called. It's really
| about one execution context lobbing errors/exceptions into
| others. Although being synchronous would avoid the worst
| timing issues.
| steveklabnik wrote:
| Ah, that problem in general I am familiar with, yes.
|
| Tasks in Hubris are independently compiled programs, not
| threads in a shared context. So I don't believe that it's
| an issue. You don't share locks between tasks, you create a
| task that holds the shared resource, and have the two tasks
| that want to share it talk to that task, patterns like
| that.
| beeeeerp wrote:
| >Recommended languages are Java (ultimately a failure despite
| vast effort)
|
| Why is Java a failure? Recent JVMs have come a long way, and
| GraalVM makes it somewhat comparable to Go-like languages.
|
| I understand the historical hate and how Oracle bought it, but
| it really isn't _that_ bad of a language if you're using modern
| Java.
| steveklabnik wrote:
| They're referring to that specific feature of Java being a
| failure, not the language in general.
| j16sdiz wrote:
| The parent was talking about async exception in java -- like
| the InterruptedException . They are hard to work or reason
| with.
___________________________________________________________________
(page generated 2024-04-27 23:00 UTC)