[HN Gopher] BSD kqueue is a mountain of technical debt (2021)
___________________________________________________________________
BSD kqueue is a mountain of technical debt (2021)
Author : thunderbong
Score : 109 points
Date : 2024-12-29 15:20 UTC (7 hours ago)
(HTM) web link (ariadne.space)
(TXT) w3m dump (ariadne.space)
| caboteria wrote:
| (2021)
| Joker_vD wrote:
| So the argument is that extending kqueue's interface to handle
| more and more event is worse than turning more and more events
| into xxxfd subsystems. Why is that worse, again? Like, not in
| "it's not a composable design" abstract sense but in concrete
| "you can't do this and that, which are useful things" sense?
| jsnell wrote:
| Yeah, the article really doesn't make a case as much as assert
| it.
|
| One way to think about this is whether all these non-file fds
| are useful for a variety of generic operations, or whether
| they're only used for epoll and one-off system calls or ioctls
| specific to that type of fd. If it's the latter, it seems hard
| to believe that there actually is some kind of composability
| advantage.
|
| So, what can you do with them?
|
| 1. You can use these fds with poll()/select(), not just epoll.
| That's not a big deal, since nobody really should be using
| either of them. But it does also mean that you should ideally
| be able to use them with any future event notification syscalls
| as well. And the invention of new mechanisms is not
| hypothetical, since Linux added io_uring. I'd be curious to
| know how well io_uring supports arbitrary fd types, and whether
| they got that support "for free" or if io_uring needs to be
| extended for every new fd type.
|
| 2. You can typically read() structured data from those fds
| describing the specific event. With kqueue all that data would
| need to be passed through struct kevent, which needs to be
| fixed size and generic enough to be shared between all the
| different event types.
|
| 3. You can pass individual fds between processes, either via
| UDS or fork(). I expect you would not be able to do that for
| individual kqueue filters.
|
| 4. You can close() the fd to release the underlying kernel
| resource. This doesn't seem interesting, just listing it for
| completeness.
|
| So there's probably enough smoke there that one could at least
| argue the case. It's too bad the author didn't.
| stevefolta wrote:
| > ...nobody really should be using either of them.
|
| What's wrong with poll(), at least for smaller values of
| nfds? And what should one use instead when writing POSIX-
| compatible portable code, where epoll() and kqueue() don't
| exist?
| somat wrote:
| Fun fact, on openbsd(at least, I don't know about freebsd)
| select and poll were reworked to use kqueue internally.
|
| The thread is an interesting read as it sounds like the
| naive approach has negative performance implications,
| knowing openbsd I suspect the main motivation was to reduce
| maintenance burden, that is, make it easier to reason about
| the internals of event driven mechanisms by only having one
| such system to worry about.
|
| https://marc.info/?l=openbsd-tech&m=162444132017561&w=2
| ajross wrote:
| Do you have a third party driver that exposes a file descriptor
| for some oddball purpose? epoll can wait on that.
|
| It's absolutely true that _within the bounds_ of what the OS
| supports, both interfaces are complete. But that 's sort of a
| specious point. One requires a giant tree of extra types be
| amended every time something changes, while the other exploits
| a nearly-five-decade old abstraction that already encapsulates
| the idea of "something on which you might want to wait".
| markjdb wrote:
| The same is true of kqueue/kevent though... the driver just
| needs to decide which filters it wants to implement. There's
| no need to extend kqueue when adding some custom driver or fd
| type. One just needs to define some semantics for the
| existing filters.
| ajross wrote:
| > the driver just needs to decide
|
| That's pretty much the definition of technical debt though.
| "This interface works fine for you if you take steps to
| handle it specially according to its requirements". It
| makes kqueue into a point of friction for everything in the
| system wanting to provide a file descriptor-like API.
| netbsdusers wrote:
| No it isn't. Letting files be poll/select/epoll'd isn't
| free either. They don't get support for that by magic. A
| poll operation has to be coded, and this is just as a
| much a "point of friction" then as supporting kqueue. (It
| bears mentioning as well that on at least DragonFly BSD
| and OpenBSD, they have reimplemented poll()/select() to
| use kqueue's knotes mechanism, and so now you only have
| to write a VOP_KQFILTER implementation and not a VOP_POLL
| too.)
| Someone wrote:
| I would think it's the other way around. If you have separate
| subsystems for events of types A, B, C, etc. there's no easy
| way you can have a process wait for "an event of type A or C,
| whichever comes first".
|
| There also is the thing that, sometimes, different types of
| event are identical types, lower down in the stack. A simple
| example are keyboard and mouse events generated from USB
| devices.
|
| Why would the low-level USB stack have to do a switch on device
| type to figure out what subsystem to post an event to?
| IshKebab wrote:
| Because "wait for an event" is a good abstraction for many
| event types. The epoll interface is a generic wait interface
| whereas the kqueue one is a hardcoded list of things you can
| wait for.
|
| Think about async programming you have a single `await` keyword
| that can await lots of different things.
| toast0 wrote:
| How many event types do you need though? And would you
| actually need to do less work to add support for a new one to
| a user program? It would be different work, for sure, but I
| don't think it would be meaningfully less work.
| baggy_trough wrote:
| I must have missed where the mountain of technical debt was. The
| article just says epoll has a more composable design.
| vacuity wrote:
| To be fair (?), the title seems like classic clickbait. The
| thesis seems to just be
|
| > While I agree that epoll doesn't come close, I think that's
| actually a feature that has lead to a much more flexible and
| composable design.
|
| You could read it as "kqueue is more prone to tech debt than
| epoll if events keep getting added".
| klabb3 wrote:
| While I don't doubt that the author has good reason for their
| opinion, the reason is handwaved here, no? Where is the tech
| debt? In the kernel only? Does it affect user space? For _who_ is
| it more or less composable?
|
| I personally think "async IO" (a term I don't love, but anyway),
| should be agreed upon and settled across OSs. Not necessarily
| exact syscall equivalence, but the major design choices needs to
| be the same. For instance, who owns the buffers (kernel or user),
| and should it be completion or ready-based event systems?
|
| The reason I think this is so important to agree upon, is that
| language runtimes (and other custom runtimes like libuv) need to
| be simplified in order to allow for healthy growth. Any language
| that doesn't have cross-platform support for "async IO" is
| completely nerfed, increasing the barrier of entry for things
| that otherwise would be "graduate level" projects, into deep
| OS/kernel knowledge that is hard to find outside FAANG and a few
| other spaces.
| BobbyTables2 wrote:
| I fully agree.
|
| Synchronous I/O looks appealing for beginners and simple
| projects. It seems like the simplest path for things where I/O
| isn't the primary focus.
|
| Then one suddenly finds themselves juggling ephemeral threads
| to "background" tasks or such. The application quickly becomes
| a multi-threaded monster and still suffers responsiveness
| issues from synchronous use.
|
| The Tower of Babel created from third party runtimes further
| adds to the mess even for those who try to do better.
| worik wrote:
| > Synchronous I/O looks appealing for beginners and simple
| projects.
|
| Yes. And.
|
| Synchronous I/O is suitable for almost all projects
|
| When it is unsuitable it is catastrophicly unsuitable,
| astrange wrote:
| Synchronous I/O is actually better than async for many
| programs, partly because almost all async systems were
| designed without thinking about priority inversions and
| can't resolve them properly. If you have futures, you have
| this problem.
| liontwist wrote:
| The entire unix operating system is designed to abstract away
| async problems and turn them sync so you can write sane code
| (if, else, for).
|
| In my experience simple blocking code is incredibly
| misunderstood. Most programmers stop their thinking at
| "blocking is slow and non blocking is fast!"
| mschuster91 wrote:
| > The entire unix operating system is designed to abstract
| away async problems and turn them sync so you can write
| sane code (if, else, for).
|
| That only works for workloads that can be independently
| parallelized. For stuff that cannot be - which is the
| majority of consumer workload - you're in for a world of
| pain with locking and whatnot other issues.
| liontwist wrote:
| No. It's almost all the time that your computer has more
| than one thing to do;
|
| - your browser uses a process per tab - your web server
| responds to multiple requests in parallel - your web app
| is going to launch other programs - your desktop ui is
| going to have a compositing server separate from each
| application - your compiler is going to work on multiple
| source files
| jandrewrogers wrote:
| The differences in design reflect differences in use case
| assumptions, which in turn reflects that different OS tend to
| be used for different workloads.
|
| Broadly speaking it is difficult to use async I/O well without
| understanding the implementation details. It is a sharp tool
| for experts, not magic "go faster" juice.
| bluetomcat wrote:
| The claim is silly and unfounded. If anything, kqueue is the
| interface that pollutes userspace less. You have a single
| kevent call that is used both for waiting and registering
| events represented by (filter, ident) tuples. All of the data
| related to the event is contained in a struct that's passed
| between the kernel and the user. For non-fd events
| (EVFILT_PROC, EVFILT_TIMER, EVFILT_SIGNAL), it is much more
| straightforward to use compared to the Linux way where you need
| to keep track of one more resource, read specific binary data
| from it, have specific flags, etc.
| somat wrote:
| The article reads more like pr damage control than an actual
| complaint. epoll has a reputation as a fragile, hard to
| correctly use interface. and kqueue as relatively simple and
| sane. So the question always is. why didn't linux just adopt
| the already existing kqueue instead?
|
| Sometimes it seems that linux does end up with more than it's
| fair share of technical excellence coupled to a bad
| interface. iptables, epoll and git come to mind.
| Aurornis wrote:
| > While I don't doubt that the author has good reason for their
| opinion, the reason is handwaved here, no?
|
| I came to the comments hoping to get more explanation. I was
| waiting for the full explanation of the technical debt but then
| the article just came to an abrupt end.
|
| I wonder if the article started as a short history and
| explanation of differences, but then the dramatic headline was
| chosen to drive more clicks? The article could have been
| interesting by itself without setting up the "mountain of
| technical debt" claim in the headline.
| mst wrote:
| I know ariadne well enough that I'm pretty confident that the
| headline will have been chosen out of annoyance rather than
| click maximisation.
|
| That doesn't mean I agree with the conclusion (I am
| ambivalent and would have to think rather more before having
| an opinion myself) but I'm reasonably sure of the motivation
| nonetheless.
| a-dub wrote:
| a good computer system would allow the application programmer
| to write their code in a synchronous and single threaded manner
| and all performance optimizations with respect to
| multiprocessing (and associated resource sharing) would be
| handled automatically.
| cempaka wrote:
| I'm not sure providing that level of complex logic and
| scheduling is really a proper task for OS kernels. It's
| essentially what you get from green threads / coroutines
| which are provided by numerous runtimes and frameworks, but
| those have to be translated to the various different async IO
| facilities in order to be portable between OSes.
| liontwist wrote:
| This is almost what we have in UNIX. The resource sharing and
| multithreading happens automatically if you start shelling
| out to programs to do work, or have a simple multi process
| architecture like NGINX, Postgres, and web app
| badgersnake wrote:
| If you come from the starting point that Linux is correct and
| everything different is wrong then you get blog posts like
| this.
| throw16180339 wrote:
| The author thinks that musl, Alpine, and Linux are right and
| everything else is wrong.
|
| * https://ariadne.space/2022/03/27/the-tragedy-of-
| gethostbynam... - musl's DNS resolver doesn't support TCP
| lookups, but it's your fault for expecting the standard DNS
| lookup functions to work. You should change your program to
| use a third-party library for DNS lookups. This post is a
| couple years old; musl now supports TCP lookups.
|
| * https://ariadne.space/2021/06/25/understanding-thread-
| stack-... - Alpine has a stack size that's substantially
| smaller (128k) than other widely used OSes (512k+). Your
| program may work everywhere else, but you're wrong to assume
| a reasonable stack size. Here's how to jump through a hoop
| and make it work on Alpine.
| blueflow wrote:
| Same person that both contributed the code of conduct to
| Alpine and later got caught bragging on twitter about
| having bullied people out of the project.
|
| The work of someone who did not ask themselves the
| necessary questions, but now its some years later and
| things have changed.
| johnbellone wrote:
| Sounds like an asshole.
| a_t48 wrote:
| The second article is not saying that Alpine is right, only
| that it behaves differently. "Reasonable" stack size is
| pretty subjective - many workflows won't care about stack
| allocating megabytes at a time and could save RAM from
| having a smaller stack size. The article is pretty
| informative with workarounds for this situation. There's no
| need to attack the author about this, especially from a
| throwaway.
| marcodiego wrote:
| I remember that when io_uring was in it's early stages several
| pointed to kqueue (also nt'wait_for_multiple_something or iocp
| and solaris event ports). By coming later, I think io_uring was
| able to better fit modern reality and also avoid problems from
| previous implementations.
|
| Hope security issues with it are solved and usage becomes mostly
| transparent through userspace libs, it looks like a high
| performance strategy for our current computing hardware.
| petesergeant wrote:
| > I think io_uring was able to better fit modern reality
|
| Genuinely would be fascinated to understand what that means
| nesarkvechnep wrote:
| I'm wondering what has fundamentally changed in computing
| between the creation of kqueue and io_uring so the latter is
| able to better fit modern reality.
| vacuity wrote:
| I don't know that there were relevant changes between the
| advent of each, so much as kqueue just didn't have existing
| features in mind. I assume GP is referring to the ring
| buffer design and/or completion-based processing, as part
| of the ability to batch syscall processing. This is
| reminiscent of external work like FlexSC and can be viewed
| as mechanical sympathy.
| mananaysiempre wrote:
| I don't know which things here are actually relevant to the
| differences between the two, but of course there have been
| changes. Core counts are much higher. You can stripe some
| enterprise SSDs and get bandwidth within an order of
| magnitude or so of your RAM. Yet clocks aren't that much
| higher, and user-supervisor transitions are comparatively
| much more expensive. There's a reason Lemire's talk on
| simdjson is called "Data engineering at the speed of your
| disk".
| adrian_b wrote:
| I have not looked at the io_uring implementation to see if
| it really has improvements from this point of view, but
| something that has changed during the quarter of century
| from the design of kqueue until now is that currently it
| has become much more important than before to minimize the
| number of context switches caused by system calls in the
| programs that desire to reach a high performance for
| input/output.
|
| The reason is that with faster CPU cores, more cores
| sharing the memory, bigger CPU core states and relatively
| slower memory in comparison with the CPU cores, the time
| wasted by a context switch has become relatively greater in
| comparison with the time used by a CPU core to do useful
| work.
|
| So hopefully, implementing some I/O task using io_uring
| should require less system calls than when using kqueue or
| epoll. According to the io_uring man page, this should be
| true.
| oasisaimlessly wrote:
| In addition to the reasons you listed, context switches
| have also been significantly slowed down by
| Meltdown/Spectre speculative execution vulnerability
| mitigations.
| muststopmyths wrote:
| I haven't looked into io_uring except superficially, but
| Windows implemented Registered I/O in Windows 8 circa 2011.
| this is the basically the same programming paradigm used in
| io_uring, except it is sockets-only.
|
| Talk here [1] speaks to the modern reality of 14 years ago
| :-).
|
| Since kqueue seems very similar to IOCP in paradigm, I guess
| some of the overheads are similar and hence a ring-buffer-
| based I/O system would be more performant.
|
| It's worth noting that NVME storage also seems to use a
| similar I/O pattern as RIO, so I assume we're "closer to the
| hardware" in this way.
|
| 1. https://learn.microsoft.com/en-us/shows/build-
| build2011/sac-...
| JackSlateur wrote:
| io_uring is not limited to networking/sockets
|
| io_uring is not even limited to IO
| muststopmyths wrote:
| I was talking about RIO
| JackSlateur wrote:
| And I was talking about the modern reality enabled by
| iouring. Not really 14yo stuff, to be honest.
| p_ing wrote:
| Microsoft implemented I/O rings in an update to Windows 10
| with some differences and it is largely a copy-and-paste of
| io_uring.
|
| It's important to note that the NT kernel was built to
| leverage async I/O throughout. It was part of the original
| design documents and not an after-thought.
|
| https://learn.microsoft.com/en-
| us/windows/win32/api/ioringap...
|
| https://windows-internals.com/i-o-rings-when-one-i-o-
| operati...
|
| https://windows-internals.com/ioring-vs-io_uring-a-
| compariso...
|
| https://www.cs.fsu.edu/~zwang/files/cop4610/Fall2016/window
| s...
| muststopmyths wrote:
| Winsock Registered I/O or RIO predates io_uring and you
| could make the argument that io_uring "is largely a copy-
| paste of RIO" if you wanted to be childish.
|
| The truth is that they both use a obvious pattern of
| high-performance data transfer that's been around a long
| time. As I said, NVME devices have been doing that for a
| while and it's a common paradigm in any DMA-based
| transfer.
|
| io_uring seems more expansive and hence useful compared
| to the limited scope of RIO or even the NtIoRing stuff.
| p_ing wrote:
| > io_uring "is largely a copy-paste of RIO" if you wanted
| to be childish.
|
| This was unnecessary. I'm not denigrating I/O Rings or
| io_uring. The NT kernel is more advanced than the
| Linux/BSD/macOS kernels in certain ways. There should be
| a back-and-forth copying of the good
| ideas/implementations.
| Asmod4n wrote:
| syscalls which go from user to kernelspace got more expensive
| after the mitigations for vulnerabilities in intel and amd
| CPUs. io_uring solved that.
| Someone wrote:
| > io_uring was able to better fit modern reality and also avoid
| problems from previous implementations.
|
| Learning from the past can indeed lead to better designs.
|
| > Hope security issues with it are solved
|
| So, if I understand that correctly, it also introduced new
| problems? If so, do we know those new issues are solvable
| without inventing yet another API?
|
| If not, is it really better or just having a different set of
| issues?
| zamalek wrote:
| Here's a recent vulnerability:
| https://blog.exodusintel.com/2024/03/27/mind-the-patch-
| gap-e...
|
| It's just your typical multithreading woes (in an unsafe
| language). One presumes that these problems will be ironed
| out eventually. Unfortunately it seems as though various
| hardened distros turn it off for this reason (source was a HN
| comment I read a while back).
| Validark wrote:
| A good start to an article, but unfortunately it feels like it
| was cut off before the asserted argument was demonstrated to be
| true.
| jcalvinowens wrote:
| I'll save you some time if you haven't read this yet:
|
| > Hopefully, as you can see, epoll can automatically monitor any
| kind of kernel resource without having to be modified, due to its
| composable design, which makes it superior to kqueue from the
| perspective of having less technical debt.
|
| That's literally the entire argument.
|
| That's not what technical debt means. The author doesn't
| reference a single line of source code: this is vapid clickbait.
| steeleduncan wrote:
| > That's not what technical debt means
|
| Undoubtedly, I'm not sure why technical debt is in the title
|
| > this is vapid clickbait
|
| This seems unfair. It is a well written discussion of the
| difference between kqueue on BSDs and epoll on linux, and a
| historical overview of predecessor APIs. It just has nothing to
| do with technical debt, which, given the title, is admittedly
| odd
| codr7 wrote:
| I'm curious: What would be the difference between suboptimal
| decisions made earlier in the history of a code base and
| technical debt?
| StayTrue wrote:
| Technical debt has become a meaningless term.
| mwkaufma wrote:
| Making a mountain out of a molehill.
| nobodyandproud wrote:
| Is this correct? How I interpreted the author:
|
| - kqueue : Designed to work off of concrete event filters. New
| event types mean kqueue must be rengineered support it (at the
| very least, a new event filter type).
|
| Versus
|
| - epoll : Designed to work off of kernel handles. The actual
| event type isn't the concern of epoll and therefore epoll itself
| remains unaffected if a new event type were introduced.
|
| I'm guessing composibility is mentioned because of this
| decoupling? I'd think a better explanation would be single-
| responsibility, but I'm likely not understanding the author
| correctly here.
| adrian_b wrote:
| I have not looked at kqueue, but there is no reason to believe
| that it would be affected by the introduction of a new event
| filter more than epoll is affected by the introduction of a new
| system call that creates a new kind of pseudo-file handles that
| can be passed to epoll.
|
| The event filters must have a standard interface, so adding one
| more kind should not matter for kqueue more than the existence
| of a new kind of "file" handle matters for epoll.
|
| The epoll API makes opaque for the user the differences between
| the kinds of kernel handles, but at some point inside the
| kernel the handles must be disambiguated and appropriate
| actions for each kind of handle must be performed, exactly like
| in kqueue various kinds of event filters must be invoked.
| nobodyandproud wrote:
| Right. I've never worked in the kernel space, but I wanted to
| understand the author.
|
| As you mentioned, event handlers still only care for certain
| messages and ignore others, so the disambiguation must
| happen.
|
| Rereading it; what's also not clear to me is whether epoll is
| a single queue, or one queue for each subsystem and handle
| type.
|
| From a handler's perspective, it seems like an implementation
| detail?
|
| Again, I don't know about this stuff; just hoping someone
| knowledgeable can help clear it up.
| markjdb wrote:
| I don't think the article does a good job of arguing its
| premise, which I think is that kqueue is a less general
| interface than epoll.
|
| When adding a new descriptor type, one can define semantics for
| existing filters (e.g., EVFILT_READ) as one pleases.
|
| To give an example, FreeBSD has EVFILT_PROCDESC to watch for
| events on process descriptors, which are basically analogous to
| pidfds. Right now, using that filter kevent() can tell you that
| the process referenced by a procdesc has exited. That could
| have been defined using the EVFILT_READ filter instead of or in
| addition to adding EVFILT_PROCDESC. There was no specific need
| to introduce EVFILT_PROCDESC, except that the implementor
| presumably wanted to leave space to add additional event types,
| and it seemed cleaner to introduce a new EVFILT_PROCDESC
| filter. Process descriptors don't implement EVFILT_READ today,
| but there's no reason they can't.
|
| So if one wants to define a new event type using kevent(), one
| has the option of adding a new definition (new filter, new note
| type for an existing filter, etc.), or adding a new type of
| file descriptor which implements EVFILT_READ and other
| "standard" filters. kqueue doesn't really constrain you either
| way.
|
| In FreeBSD, most of the filter types correspond to non-fd-based
| events. But nothing stops one from adding new fd types for
| similar purposes. For instance, we have both EVFILT_TIMER (a
| non-fd event filter) and timerfd (which implements EVFILT_READ
| and in particular didn't need a new filter). Both are roughly
| equivalent; the latter behaves more like a regular file
| descriptor from kqueue's perspective, which might be better,
| but it'd be nice to see an example illustrating how.
|
| One could argue that the simultaneous existence of timerfds and
| EVFILT_TIMER is technical debt, but that's not really kqueue's
| fault. EVFILT_TIMER has been around forever, and timerfd was
| added to improve Linux compatibility.
|
| So, I think the article is misguided. In particular, the claim
| that "any time you want kqueue to do something new, you have to
| add a new type of event filter" is just wrong. I'm not arguing
| that there isn't technical debt here, but it's not really
| because of kqueue's design.
| nobodyandproud wrote:
| Thanks.
|
| Then it seems like there are more similarities than
| differences here: Both solve the same problem of "select" by
| being the central (kernel-level) events queue; though with
| different APIs.
|
| The other bit that caught my eye was the author saying epoll
| can do nearly everything kqueue can do.
|
| What is that slight bit that epoll can't do?
| oasisaimlessly wrote:
| AFAIK, just aio[1] (async file IO).
|
| [1]: https://man.freebsd.org/cgi/man.cgi?query=aio&sektion=
| 4&apro...
| markjdb wrote:
| I'm not sure. Maybe it's "wait for events that aren't tied
| to an fd."
|
| For instance, FreeBSD (and I think other BSDs) also have
| EVFILT_PROC, which lets you monitor a PID (not an fd) for
| events. One such event is NOTE_FORK, i.e., the monitored
| process just forked. Can you wait for such events with
| epoll? I'm not sure.
|
| More generally, suppose you wanted to automatically start
| watching all descendants of the process for events as well.
| If I was required to use a separate fd to monitor that
| child process, then upon getting the fork event I'd have to
| somehow obtain an fd for that child process and then tell
| epoll about it, and in that window I may have missed the
| child process forking off a grandchild.
|
| I'm not sure how to solve this kind of problem in the epoll
| world. I guess you could introduce a new fd type which
| represents a process and all of its descendants, and define
| some semantics for how epoll reports events on that fd
| type. In FreeBSD we can just have a dedicated EVFILT_PROC
| filter, no need for a new fd type. I'm not really sure
| whether that's better or worse.
| WesolyKubeczek wrote:
| I went to read the article hoping that it would shed light on
| some nitty-gritty internals of the Free(or any other)BSD kernel
| and tell a story.
|
| The story would be, of course, about how the brave and bold
| developers undertook a journey to add something to the kqueue
| mechanism, or fix a bug in it, or increase its performance, or
| improve its scalability, and how on their way there they hit the
| legendary, impenetrable Mountain of Technical Debt. How they
| tried to force their way over the mountain first, but how the
| dangers and the cold drove them to try to go in depth, via
| disused tunnels made by some hard-working fearless people many
| moons ago. In there, our brave team discovers long-decayed badly
| burned bodies, rusted tools, and fragmentary diaries made by some
| of those people.
|
| They learn that the hard-working tunnelling people undertook a
| quest not unlike their own; they also felt that the Mountain of
| Technical Debt could actually present an opportunity to enrich
| themselves, and while mining it for Bugfixes and Improvements
| here and there, they accidentally awoke an Ancient Evil of
| Regression-Rich Behavior, the Mother of All Heisenbugs.
|
| The hard-working tunnelling people were no strangers to the
| battle, and they faced the enemy, but it overpowered them with
| Crippling Regressions and Massive Burnout. Only a few survivors
| lived to tell the tale, and while what they told was confusing,
| the clear message was to avoid the Mountain and never approach
| the tunnels again. At least this was what everyone remembered,
| but almost everyone forgot what evil fought them and drove them
| out.
|
| Our heroes realize that their quest has suddenly become much more
| dangerous than they thought, but they don't want to stop halfway,
| and they can be hard-working and hard-fighting too, and their
| tools are much more superior than those of their predecessors.
| They also think that they are much more resistant to burnout. Of
| course they go on, and eventually they meet the creature. Each
| attempt at improving kqueue yields such obscure regressions and
| bugs elsewhere in the system, that half of the team succumbs to
| Burnout unleashed on them despite youth, vigor, tools, and
| strength. Two brave souls, armed with the best debuggers,
| fuzzers, and in-depth code analyzers, manage to escape the Mother
| of Heisenbugs, wounding it, and damaging some of the Mountain
| itself. They describe their findings in a much more coherent way
| than their surviving predecessors have managed to, and conclude
| that the Mountain and the creature cannot be defeated on their
| home ground, not until a viable replacement arises which will be
| compatible with and and all the quirks the software grew to rely
| on. This replacement would starve the creature and the Mountain
| would erode.
|
| The story of the kqueue Replacement Building would be a separate
| one, in a different genre, with our brave heroes acting as wise
| advisors to the new cast of protagonists, detailing challenges
| they all run into, and how their combined wisdom, intelligence,
| and superb teamwork and camaraderie help them overcome hardships
| -- sometimes barely, but still.
|
| Needless to say, the article failed me greatly. And it didn't
| even make an attempt to describe where the debt was!
| ndesaulniers wrote:
| What was the AI prompt that generated that? LotR but
| programming?
| cardiffspaceman wrote:
| On the other hand, I enjoyed reading it.
| WesolyKubeczek wrote:
| Purely organic, no AI used whatsoever.
|
| (Lord of io_urings, oh my)
| sc68cal wrote:
| kqueue was if not the, then one of the first attempts to solve
| the issue of being more performant than select(). So this whole
| post boils down to "they didn't get it right the first time" to
| which I say, okay sure, but we only know that now after kqueue
| proved that event based queues were better than select.
|
| This post is ahistorical and anti-social.
| ryao wrote:
| I don't see how adding a new syscall to monitor something new via
| epoll is more desirable than adding a new BSD event filter.
| Either way, you are modifying the syscall interface. The initial
| epoll_create() had to be revised with epoll_create1(). It is even
| cited on slide 53 as being an example of a past design failure:
|
| https://www.man7.org/conf/lpc2014/designing_linux_kernel_API...
|
| Interestingly, the BSD kqueue made the same mistake and had to be
| revised for the same reason, which is why kqueue1() was made.
| However, by the points on the technical checklist there for
| syscall design, kqueue is an excellent design, minus the one
| mistake of not supporting O_CLOEXEC at the beginning.
| ori_b wrote:
| Counterpoint -- epoll is fundamentally broken:
| https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-...
|
| You can end up with events on file descriptors that you don't
| even have open -- they leak across process boundaries. And that
| means that if the file descriptor gets reused, you can end up
| with events on the wrong file descriptor.
| zzo38computer wrote:
| From looking at the man page, it look like epoll does not
| return the file descriptor of the event; it returns a union
| containing user-defined data, although one of the union's
| fields is called "fd" because it is presumably intended to be
| used as the file descriptor.
|
| However, this is still subject to the problems you mention, as
| well as that you presumably can no longer control events for a
| file descriptor if you do longer have it, so it is still seems
| a problem.
|
| Putting the file descriptor in the "struct epoll_event" instead
| of "epoll_data_t" would have avoided the problem of events on
| the wrong file descriptor, but that still might not be good
| enough. (It could specify -1 as the file descriptor for events
| of file descriptors that you do not have access to.)
|
| Some of this is just the problem with POSIX in general. I did
| consider such problems (of file descriptors and of event
| handling) in my own ideas of operating system design, which
| uses capabilities. I would want to avoid the mess that other
| systems is doing, too. A capability in this case is
| fundamentally a different data type than numbers (and IPC
| treats them differently), although I am not sure yet how to
| handle this in a suitable way within a process (tagged memory
| might do, although not all computers use tagged memory). (Note
| that most I/O is done using IPC and not by system calls; the
| number of system calls is limited and is usually only used for
| managing the capabilities and event handling. This would also
| improve security as well as other things.)
| kev009 wrote:
| This is pretty facepalm, has the author actually used any of
| these extended epoll APIs? inotify - deprecated. aio - complete
| footgun in linux. Whatever point is trying to be made here is
| lost in reality, I fail to see how having a bunch of additional
| function calls is superior to event filters and there is no
| coherent argument laid out.. how is kqueue not composable if it
| too has added event types (i.e. Mach ports they mention)? You're
| going to have to modify the kernel in either case.
| netbsdusers wrote:
| Five points in reply:
|
| 1. Signalfd is a mountain of technical debt. It's not like a file
| at all. It reads entirely different things depending on which
| process is reading from a single common open file description,
| and it interacts with epoll in a most bizarre way, in that the
| process that _added_ the signalfd to the poll set is the one
| whose signals will notify readiness - even if that process then
| closed the epoll and signalfd.
|
| 2. Signalfd is a mountain of technical debt because they built it
| over signals. Signals are of two forms: the one is asynchronous,
| the other synchronous. The one should be replaced with a generic
| message queue in a technical debt free design. The other is
| debatable. (Mach and also Fuchsia have an "Exception Port"
| instead, a message port someone can read from to deal with
| exceptions incurred by some thread.)
|
| 3. Regarding: > epoll can automatically monitor any kind of
| kernel resource without having to be modified, due to its
| composable design
|
| Well, so can kqueue - EVFILT_READ/EVFILT_WRITE will work just as
| well on any file, be it a socket, an eventfd, or even another
| kqueue (though I don't think anyone who isn't dealing with
| technical debt should be adding kqueues to kqueues, nor epolls to
| epolls.) No need to modify kqueue! But in either case, though,
| you've got to modify something. Whether that's trying to fit
| signals into a file and getting signalfd, or whether it's adding
| a signal event filter and getting EVFILT_SIGNAL, something has to
| be done.
|
| 4. FreeBSD's process descriptors predate Linux's pidfds. They are
| also better and less technically indebted because they are pure
| capabilities (they were invented as part of the Capsicum object-
| capability model based security architecture for FreeBSD) while
| pidfds are not: what you can do with a pidfd is decided on based
| on the ambient authority of the process trying to, say, signal a
| pidfd, and not on rights intrinsic to the fact of having a pidfd.
| In fact, these pidfds are not even like traditional Unix open
| file descriptions, whose rights are based on the credential of
| who opened them. This makes privilege-separation architectures
| impossible with pidfds, but I digress.
|
| 5. The author ignored the actual points people argued against
| epoll with, viz. that 1) epoll's edge triggering behaviour is
| useless, and 2) that epoll's conflation of file descriptor with
| open file description is a terribly leaky abstraction:
|
| https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-...
| https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-...
___________________________________________________________________
(page generated 2024-12-29 23:00 UTC)