[HN Gopher] BSD kqueue is a mountain of technical debt (2021)
       ___________________________________________________________________
        
       BSD kqueue is a mountain of technical debt (2021)
        
       Author : thunderbong
       Score  : 109 points
       Date   : 2024-12-29 15:20 UTC (7 hours ago)
        
 (HTM) web link (ariadne.space)
 (TXT) w3m dump (ariadne.space)
        
       | caboteria wrote:
       | (2021)
        
       | Joker_vD wrote:
       | So the argument is that extending kqueue's interface to handle
       | more and more event is worse than turning more and more events
       | into xxxfd subsystems. Why is that worse, again? Like, not in
       | "it's not a composable design" abstract sense but in concrete
       | "you can't do this and that, which are useful things" sense?
        
         | jsnell wrote:
         | Yeah, the article really doesn't make a case as much as assert
         | it.
         | 
         | One way to think about this is whether all these non-file fds
         | are useful for a variety of generic operations, or whether
         | they're only used for epoll and one-off system calls or ioctls
         | specific to that type of fd. If it's the latter, it seems hard
         | to believe that there actually is some kind of composability
         | advantage.
         | 
         | So, what can you do with them?
         | 
         | 1. You can use these fds with poll()/select(), not just epoll.
         | That's not a big deal, since nobody really should be using
         | either of them. But it does also mean that you should ideally
         | be able to use them with any future event notification syscalls
         | as well. And the invention of new mechanisms is not
         | hypothetical, since Linux added io_uring. I'd be curious to
         | know how well io_uring supports arbitrary fd types, and whether
         | they got that support "for free" or if io_uring needs to be
         | extended for every new fd type.
         | 
         | 2. You can typically read() structured data from those fds
         | describing the specific event. With kqueue all that data would
         | need to be passed through struct kevent, which needs to be
         | fixed size and generic enough to be shared between all the
         | different event types.
         | 
         | 3. You can pass individual fds between processes, either via
         | UDS or fork(). I expect you would not be able to do that for
         | individual kqueue filters.
         | 
         | 4. You can close() the fd to release the underlying kernel
         | resource. This doesn't seem interesting, just listing it for
         | completeness.
         | 
         | So there's probably enough smoke there that one could at least
         | argue the case. It's too bad the author didn't.
        
           | stevefolta wrote:
           | > ...nobody really should be using either of them.
           | 
           | What's wrong with poll(), at least for smaller values of
           | nfds? And what should one use instead when writing POSIX-
           | compatible portable code, where epoll() and kqueue() don't
           | exist?
        
             | somat wrote:
             | Fun fact, on openbsd(at least, I don't know about freebsd)
             | select and poll were reworked to use kqueue internally.
             | 
             | The thread is an interesting read as it sounds like the
             | naive approach has negative performance implications,
             | knowing openbsd I suspect the main motivation was to reduce
             | maintenance burden, that is, make it easier to reason about
             | the internals of event driven mechanisms by only having one
             | such system to worry about.
             | 
             | https://marc.info/?l=openbsd-tech&m=162444132017561&w=2
        
         | ajross wrote:
         | Do you have a third party driver that exposes a file descriptor
         | for some oddball purpose? epoll can wait on that.
         | 
         | It's absolutely true that _within the bounds_ of what the OS
         | supports, both interfaces are complete. But that 's sort of a
         | specious point. One requires a giant tree of extra types be
         | amended every time something changes, while the other exploits
         | a nearly-five-decade old abstraction that already encapsulates
         | the idea of "something on which you might want to wait".
        
           | markjdb wrote:
           | The same is true of kqueue/kevent though... the driver just
           | needs to decide which filters it wants to implement. There's
           | no need to extend kqueue when adding some custom driver or fd
           | type. One just needs to define some semantics for the
           | existing filters.
        
             | ajross wrote:
             | > the driver just needs to decide
             | 
             | That's pretty much the definition of technical debt though.
             | "This interface works fine for you if you take steps to
             | handle it specially according to its requirements". It
             | makes kqueue into a point of friction for everything in the
             | system wanting to provide a file descriptor-like API.
        
               | netbsdusers wrote:
               | No it isn't. Letting files be poll/select/epoll'd isn't
               | free either. They don't get support for that by magic. A
               | poll operation has to be coded, and this is just as a
               | much a "point of friction" then as supporting kqueue. (It
               | bears mentioning as well that on at least DragonFly BSD
               | and OpenBSD, they have reimplemented poll()/select() to
               | use kqueue's knotes mechanism, and so now you only have
               | to write a VOP_KQFILTER implementation and not a VOP_POLL
               | too.)
        
         | Someone wrote:
         | I would think it's the other way around. If you have separate
         | subsystems for events of types A, B, C, etc. there's no easy
         | way you can have a process wait for "an event of type A or C,
         | whichever comes first".
         | 
         | There also is the thing that, sometimes, different types of
         | event are identical types, lower down in the stack. A simple
         | example are keyboard and mouse events generated from USB
         | devices.
         | 
         | Why would the low-level USB stack have to do a switch on device
         | type to figure out what subsystem to post an event to?
        
         | IshKebab wrote:
         | Because "wait for an event" is a good abstraction for many
         | event types. The epoll interface is a generic wait interface
         | whereas the kqueue one is a hardcoded list of things you can
         | wait for.
         | 
         | Think about async programming you have a single `await` keyword
         | that can await lots of different things.
        
           | toast0 wrote:
           | How many event types do you need though? And would you
           | actually need to do less work to add support for a new one to
           | a user program? It would be different work, for sure, but I
           | don't think it would be meaningfully less work.
        
       | baggy_trough wrote:
       | I must have missed where the mountain of technical debt was. The
       | article just says epoll has a more composable design.
        
         | vacuity wrote:
         | To be fair (?), the title seems like classic clickbait. The
         | thesis seems to just be
         | 
         | > While I agree that epoll doesn't come close, I think that's
         | actually a feature that has lead to a much more flexible and
         | composable design.
         | 
         | You could read it as "kqueue is more prone to tech debt than
         | epoll if events keep getting added".
        
       | klabb3 wrote:
       | While I don't doubt that the author has good reason for their
       | opinion, the reason is handwaved here, no? Where is the tech
       | debt? In the kernel only? Does it affect user space? For _who_ is
       | it more or less composable?
       | 
       | I personally think "async IO" (a term I don't love, but anyway),
       | should be agreed upon and settled across OSs. Not necessarily
       | exact syscall equivalence, but the major design choices needs to
       | be the same. For instance, who owns the buffers (kernel or user),
       | and should it be completion or ready-based event systems?
       | 
       | The reason I think this is so important to agree upon, is that
       | language runtimes (and other custom runtimes like libuv) need to
       | be simplified in order to allow for healthy growth. Any language
       | that doesn't have cross-platform support for "async IO" is
       | completely nerfed, increasing the barrier of entry for things
       | that otherwise would be "graduate level" projects, into deep
       | OS/kernel knowledge that is hard to find outside FAANG and a few
       | other spaces.
        
         | BobbyTables2 wrote:
         | I fully agree.
         | 
         | Synchronous I/O looks appealing for beginners and simple
         | projects. It seems like the simplest path for things where I/O
         | isn't the primary focus.
         | 
         | Then one suddenly finds themselves juggling ephemeral threads
         | to "background" tasks or such. The application quickly becomes
         | a multi-threaded monster and still suffers responsiveness
         | issues from synchronous use.
         | 
         | The Tower of Babel created from third party runtimes further
         | adds to the mess even for those who try to do better.
        
           | worik wrote:
           | > Synchronous I/O looks appealing for beginners and simple
           | projects.
           | 
           | Yes. And.
           | 
           | Synchronous I/O is suitable for almost all projects
           | 
           | When it is unsuitable it is catastrophicly unsuitable,
        
             | astrange wrote:
             | Synchronous I/O is actually better than async for many
             | programs, partly because almost all async systems were
             | designed without thinking about priority inversions and
             | can't resolve them properly. If you have futures, you have
             | this problem.
        
           | liontwist wrote:
           | The entire unix operating system is designed to abstract away
           | async problems and turn them sync so you can write sane code
           | (if, else, for).
           | 
           | In my experience simple blocking code is incredibly
           | misunderstood. Most programmers stop their thinking at
           | "blocking is slow and non blocking is fast!"
        
             | mschuster91 wrote:
             | > The entire unix operating system is designed to abstract
             | away async problems and turn them sync so you can write
             | sane code (if, else, for).
             | 
             | That only works for workloads that can be independently
             | parallelized. For stuff that cannot be - which is the
             | majority of consumer workload - you're in for a world of
             | pain with locking and whatnot other issues.
        
               | liontwist wrote:
               | No. It's almost all the time that your computer has more
               | than one thing to do;
               | 
               | - your browser uses a process per tab - your web server
               | responds to multiple requests in parallel - your web app
               | is going to launch other programs - your desktop ui is
               | going to have a compositing server separate from each
               | application - your compiler is going to work on multiple
               | source files
        
         | jandrewrogers wrote:
         | The differences in design reflect differences in use case
         | assumptions, which in turn reflects that different OS tend to
         | be used for different workloads.
         | 
         | Broadly speaking it is difficult to use async I/O well without
         | understanding the implementation details. It is a sharp tool
         | for experts, not magic "go faster" juice.
        
         | bluetomcat wrote:
         | The claim is silly and unfounded. If anything, kqueue is the
         | interface that pollutes userspace less. You have a single
         | kevent call that is used both for waiting and registering
         | events represented by (filter, ident) tuples. All of the data
         | related to the event is contained in a struct that's passed
         | between the kernel and the user. For non-fd events
         | (EVFILT_PROC, EVFILT_TIMER, EVFILT_SIGNAL), it is much more
         | straightforward to use compared to the Linux way where you need
         | to keep track of one more resource, read specific binary data
         | from it, have specific flags, etc.
        
           | somat wrote:
           | The article reads more like pr damage control than an actual
           | complaint. epoll has a reputation as a fragile, hard to
           | correctly use interface. and kqueue as relatively simple and
           | sane. So the question always is. why didn't linux just adopt
           | the already existing kqueue instead?
           | 
           | Sometimes it seems that linux does end up with more than it's
           | fair share of technical excellence coupled to a bad
           | interface. iptables, epoll and git come to mind.
        
         | Aurornis wrote:
         | > While I don't doubt that the author has good reason for their
         | opinion, the reason is handwaved here, no?
         | 
         | I came to the comments hoping to get more explanation. I was
         | waiting for the full explanation of the technical debt but then
         | the article just came to an abrupt end.
         | 
         | I wonder if the article started as a short history and
         | explanation of differences, but then the dramatic headline was
         | chosen to drive more clicks? The article could have been
         | interesting by itself without setting up the "mountain of
         | technical debt" claim in the headline.
        
           | mst wrote:
           | I know ariadne well enough that I'm pretty confident that the
           | headline will have been chosen out of annoyance rather than
           | click maximisation.
           | 
           | That doesn't mean I agree with the conclusion (I am
           | ambivalent and would have to think rather more before having
           | an opinion myself) but I'm reasonably sure of the motivation
           | nonetheless.
        
         | a-dub wrote:
         | a good computer system would allow the application programmer
         | to write their code in a synchronous and single threaded manner
         | and all performance optimizations with respect to
         | multiprocessing (and associated resource sharing) would be
         | handled automatically.
        
           | cempaka wrote:
           | I'm not sure providing that level of complex logic and
           | scheduling is really a proper task for OS kernels. It's
           | essentially what you get from green threads / coroutines
           | which are provided by numerous runtimes and frameworks, but
           | those have to be translated to the various different async IO
           | facilities in order to be portable between OSes.
        
           | liontwist wrote:
           | This is almost what we have in UNIX. The resource sharing and
           | multithreading happens automatically if you start shelling
           | out to programs to do work, or have a simple multi process
           | architecture like NGINX, Postgres, and web app
        
         | badgersnake wrote:
         | If you come from the starting point that Linux is correct and
         | everything different is wrong then you get blog posts like
         | this.
        
           | throw16180339 wrote:
           | The author thinks that musl, Alpine, and Linux are right and
           | everything else is wrong.
           | 
           | * https://ariadne.space/2022/03/27/the-tragedy-of-
           | gethostbynam... - musl's DNS resolver doesn't support TCP
           | lookups, but it's your fault for expecting the standard DNS
           | lookup functions to work. You should change your program to
           | use a third-party library for DNS lookups. This post is a
           | couple years old; musl now supports TCP lookups.
           | 
           | * https://ariadne.space/2021/06/25/understanding-thread-
           | stack-... - Alpine has a stack size that's substantially
           | smaller (128k) than other widely used OSes (512k+). Your
           | program may work everywhere else, but you're wrong to assume
           | a reasonable stack size. Here's how to jump through a hoop
           | and make it work on Alpine.
        
             | blueflow wrote:
             | Same person that both contributed the code of conduct to
             | Alpine and later got caught bragging on twitter about
             | having bullied people out of the project.
             | 
             | The work of someone who did not ask themselves the
             | necessary questions, but now its some years later and
             | things have changed.
        
             | johnbellone wrote:
             | Sounds like an asshole.
        
             | a_t48 wrote:
             | The second article is not saying that Alpine is right, only
             | that it behaves differently. "Reasonable" stack size is
             | pretty subjective - many workflows won't care about stack
             | allocating megabytes at a time and could save RAM from
             | having a smaller stack size. The article is pretty
             | informative with workarounds for this situation. There's no
             | need to attack the author about this, especially from a
             | throwaway.
        
       | marcodiego wrote:
       | I remember that when io_uring was in it's early stages several
       | pointed to kqueue (also nt'wait_for_multiple_something or iocp
       | and solaris event ports). By coming later, I think io_uring was
       | able to better fit modern reality and also avoid problems from
       | previous implementations.
       | 
       | Hope security issues with it are solved and usage becomes mostly
       | transparent through userspace libs, it looks like a high
       | performance strategy for our current computing hardware.
        
         | petesergeant wrote:
         | > I think io_uring was able to better fit modern reality
         | 
         | Genuinely would be fascinated to understand what that means
        
           | nesarkvechnep wrote:
           | I'm wondering what has fundamentally changed in computing
           | between the creation of kqueue and io_uring so the latter is
           | able to better fit modern reality.
        
             | vacuity wrote:
             | I don't know that there were relevant changes between the
             | advent of each, so much as kqueue just didn't have existing
             | features in mind. I assume GP is referring to the ring
             | buffer design and/or completion-based processing, as part
             | of the ability to batch syscall processing. This is
             | reminiscent of external work like FlexSC and can be viewed
             | as mechanical sympathy.
        
             | mananaysiempre wrote:
             | I don't know which things here are actually relevant to the
             | differences between the two, but of course there have been
             | changes. Core counts are much higher. You can stripe some
             | enterprise SSDs and get bandwidth within an order of
             | magnitude or so of your RAM. Yet clocks aren't that much
             | higher, and user-supervisor transitions are comparatively
             | much more expensive. There's a reason Lemire's talk on
             | simdjson is called "Data engineering at the speed of your
             | disk".
        
             | adrian_b wrote:
             | I have not looked at the io_uring implementation to see if
             | it really has improvements from this point of view, but
             | something that has changed during the quarter of century
             | from the design of kqueue until now is that currently it
             | has become much more important than before to minimize the
             | number of context switches caused by system calls in the
             | programs that desire to reach a high performance for
             | input/output.
             | 
             | The reason is that with faster CPU cores, more cores
             | sharing the memory, bigger CPU core states and relatively
             | slower memory in comparison with the CPU cores, the time
             | wasted by a context switch has become relatively greater in
             | comparison with the time used by a CPU core to do useful
             | work.
             | 
             | So hopefully, implementing some I/O task using io_uring
             | should require less system calls than when using kqueue or
             | epoll. According to the io_uring man page, this should be
             | true.
        
               | oasisaimlessly wrote:
               | In addition to the reasons you listed, context switches
               | have also been significantly slowed down by
               | Meltdown/Spectre speculative execution vulnerability
               | mitigations.
        
           | muststopmyths wrote:
           | I haven't looked into io_uring except superficially, but
           | Windows implemented Registered I/O in Windows 8 circa 2011.
           | this is the basically the same programming paradigm used in
           | io_uring, except it is sockets-only.
           | 
           | Talk here [1] speaks to the modern reality of 14 years ago
           | :-).
           | 
           | Since kqueue seems very similar to IOCP in paradigm, I guess
           | some of the overheads are similar and hence a ring-buffer-
           | based I/O system would be more performant.
           | 
           | It's worth noting that NVME storage also seems to use a
           | similar I/O pattern as RIO, so I assume we're "closer to the
           | hardware" in this way.
           | 
           | 1. https://learn.microsoft.com/en-us/shows/build-
           | build2011/sac-...
        
             | JackSlateur wrote:
             | io_uring is not limited to networking/sockets
             | 
             | io_uring is not even limited to IO
        
               | muststopmyths wrote:
               | I was talking about RIO
        
               | JackSlateur wrote:
               | And I was talking about the modern reality enabled by
               | iouring. Not really 14yo stuff, to be honest.
        
             | p_ing wrote:
             | Microsoft implemented I/O rings in an update to Windows 10
             | with some differences and it is largely a copy-and-paste of
             | io_uring.
             | 
             | It's important to note that the NT kernel was built to
             | leverage async I/O throughout. It was part of the original
             | design documents and not an after-thought.
             | 
             | https://learn.microsoft.com/en-
             | us/windows/win32/api/ioringap...
             | 
             | https://windows-internals.com/i-o-rings-when-one-i-o-
             | operati...
             | 
             | https://windows-internals.com/ioring-vs-io_uring-a-
             | compariso...
             | 
             | https://www.cs.fsu.edu/~zwang/files/cop4610/Fall2016/window
             | s...
        
               | muststopmyths wrote:
               | Winsock Registered I/O or RIO predates io_uring and you
               | could make the argument that io_uring "is largely a copy-
               | paste of RIO" if you wanted to be childish.
               | 
               | The truth is that they both use a obvious pattern of
               | high-performance data transfer that's been around a long
               | time. As I said, NVME devices have been doing that for a
               | while and it's a common paradigm in any DMA-based
               | transfer.
               | 
               | io_uring seems more expansive and hence useful compared
               | to the limited scope of RIO or even the NtIoRing stuff.
        
               | p_ing wrote:
               | > io_uring "is largely a copy-paste of RIO" if you wanted
               | to be childish.
               | 
               | This was unnecessary. I'm not denigrating I/O Rings or
               | io_uring. The NT kernel is more advanced than the
               | Linux/BSD/macOS kernels in certain ways. There should be
               | a back-and-forth copying of the good
               | ideas/implementations.
        
           | Asmod4n wrote:
           | syscalls which go from user to kernelspace got more expensive
           | after the mitigations for vulnerabilities in intel and amd
           | CPUs. io_uring solved that.
        
         | Someone wrote:
         | > io_uring was able to better fit modern reality and also avoid
         | problems from previous implementations.
         | 
         | Learning from the past can indeed lead to better designs.
         | 
         | > Hope security issues with it are solved
         | 
         | So, if I understand that correctly, it also introduced new
         | problems? If so, do we know those new issues are solvable
         | without inventing yet another API?
         | 
         | If not, is it really better or just having a different set of
         | issues?
        
           | zamalek wrote:
           | Here's a recent vulnerability:
           | https://blog.exodusintel.com/2024/03/27/mind-the-patch-
           | gap-e...
           | 
           | It's just your typical multithreading woes (in an unsafe
           | language). One presumes that these problems will be ironed
           | out eventually. Unfortunately it seems as though various
           | hardened distros turn it off for this reason (source was a HN
           | comment I read a while back).
        
       | Validark wrote:
       | A good start to an article, but unfortunately it feels like it
       | was cut off before the asserted argument was demonstrated to be
       | true.
        
       | jcalvinowens wrote:
       | I'll save you some time if you haven't read this yet:
       | 
       | > Hopefully, as you can see, epoll can automatically monitor any
       | kind of kernel resource without having to be modified, due to its
       | composable design, which makes it superior to kqueue from the
       | perspective of having less technical debt.
       | 
       | That's literally the entire argument.
       | 
       | That's not what technical debt means. The author doesn't
       | reference a single line of source code: this is vapid clickbait.
        
         | steeleduncan wrote:
         | > That's not what technical debt means
         | 
         | Undoubtedly, I'm not sure why technical debt is in the title
         | 
         | > this is vapid clickbait
         | 
         | This seems unfair. It is a well written discussion of the
         | difference between kqueue on BSDs and epoll on linux, and a
         | historical overview of predecessor APIs. It just has nothing to
         | do with technical debt, which, given the title, is admittedly
         | odd
        
           | codr7 wrote:
           | I'm curious: What would be the difference between suboptimal
           | decisions made earlier in the history of a code base and
           | technical debt?
        
       | StayTrue wrote:
       | Technical debt has become a meaningless term.
        
       | mwkaufma wrote:
       | Making a mountain out of a molehill.
        
       | nobodyandproud wrote:
       | Is this correct? How I interpreted the author:
       | 
       | - kqueue : Designed to work off of concrete event filters. New
       | event types mean kqueue must be rengineered support it (at the
       | very least, a new event filter type).
       | 
       | Versus
       | 
       | - epoll : Designed to work off of kernel handles. The actual
       | event type isn't the concern of epoll and therefore epoll itself
       | remains unaffected if a new event type were introduced.
       | 
       | I'm guessing composibility is mentioned because of this
       | decoupling? I'd think a better explanation would be single-
       | responsibility, but I'm likely not understanding the author
       | correctly here.
        
         | adrian_b wrote:
         | I have not looked at kqueue, but there is no reason to believe
         | that it would be affected by the introduction of a new event
         | filter more than epoll is affected by the introduction of a new
         | system call that creates a new kind of pseudo-file handles that
         | can be passed to epoll.
         | 
         | The event filters must have a standard interface, so adding one
         | more kind should not matter for kqueue more than the existence
         | of a new kind of "file" handle matters for epoll.
         | 
         | The epoll API makes opaque for the user the differences between
         | the kinds of kernel handles, but at some point inside the
         | kernel the handles must be disambiguated and appropriate
         | actions for each kind of handle must be performed, exactly like
         | in kqueue various kinds of event filters must be invoked.
        
           | nobodyandproud wrote:
           | Right. I've never worked in the kernel space, but I wanted to
           | understand the author.
           | 
           | As you mentioned, event handlers still only care for certain
           | messages and ignore others, so the disambiguation must
           | happen.
           | 
           | Rereading it; what's also not clear to me is whether epoll is
           | a single queue, or one queue for each subsystem and handle
           | type.
           | 
           | From a handler's perspective, it seems like an implementation
           | detail?
           | 
           | Again, I don't know about this stuff; just hoping someone
           | knowledgeable can help clear it up.
        
         | markjdb wrote:
         | I don't think the article does a good job of arguing its
         | premise, which I think is that kqueue is a less general
         | interface than epoll.
         | 
         | When adding a new descriptor type, one can define semantics for
         | existing filters (e.g., EVFILT_READ) as one pleases.
         | 
         | To give an example, FreeBSD has EVFILT_PROCDESC to watch for
         | events on process descriptors, which are basically analogous to
         | pidfds. Right now, using that filter kevent() can tell you that
         | the process referenced by a procdesc has exited. That could
         | have been defined using the EVFILT_READ filter instead of or in
         | addition to adding EVFILT_PROCDESC. There was no specific need
         | to introduce EVFILT_PROCDESC, except that the implementor
         | presumably wanted to leave space to add additional event types,
         | and it seemed cleaner to introduce a new EVFILT_PROCDESC
         | filter. Process descriptors don't implement EVFILT_READ today,
         | but there's no reason they can't.
         | 
         | So if one wants to define a new event type using kevent(), one
         | has the option of adding a new definition (new filter, new note
         | type for an existing filter, etc.), or adding a new type of
         | file descriptor which implements EVFILT_READ and other
         | "standard" filters. kqueue doesn't really constrain you either
         | way.
         | 
         | In FreeBSD, most of the filter types correspond to non-fd-based
         | events. But nothing stops one from adding new fd types for
         | similar purposes. For instance, we have both EVFILT_TIMER (a
         | non-fd event filter) and timerfd (which implements EVFILT_READ
         | and in particular didn't need a new filter). Both are roughly
         | equivalent; the latter behaves more like a regular file
         | descriptor from kqueue's perspective, which might be better,
         | but it'd be nice to see an example illustrating how.
         | 
         | One could argue that the simultaneous existence of timerfds and
         | EVFILT_TIMER is technical debt, but that's not really kqueue's
         | fault. EVFILT_TIMER has been around forever, and timerfd was
         | added to improve Linux compatibility.
         | 
         | So, I think the article is misguided. In particular, the claim
         | that "any time you want kqueue to do something new, you have to
         | add a new type of event filter" is just wrong. I'm not arguing
         | that there isn't technical debt here, but it's not really
         | because of kqueue's design.
        
           | nobodyandproud wrote:
           | Thanks.
           | 
           | Then it seems like there are more similarities than
           | differences here: Both solve the same problem of "select" by
           | being the central (kernel-level) events queue; though with
           | different APIs.
           | 
           | The other bit that caught my eye was the author saying epoll
           | can do nearly everything kqueue can do.
           | 
           | What is that slight bit that epoll can't do?
        
             | oasisaimlessly wrote:
             | AFAIK, just aio[1] (async file IO).
             | 
             | [1]: https://man.freebsd.org/cgi/man.cgi?query=aio&sektion=
             | 4&apro...
        
             | markjdb wrote:
             | I'm not sure. Maybe it's "wait for events that aren't tied
             | to an fd."
             | 
             | For instance, FreeBSD (and I think other BSDs) also have
             | EVFILT_PROC, which lets you monitor a PID (not an fd) for
             | events. One such event is NOTE_FORK, i.e., the monitored
             | process just forked. Can you wait for such events with
             | epoll? I'm not sure.
             | 
             | More generally, suppose you wanted to automatically start
             | watching all descendants of the process for events as well.
             | If I was required to use a separate fd to monitor that
             | child process, then upon getting the fork event I'd have to
             | somehow obtain an fd for that child process and then tell
             | epoll about it, and in that window I may have missed the
             | child process forking off a grandchild.
             | 
             | I'm not sure how to solve this kind of problem in the epoll
             | world. I guess you could introduce a new fd type which
             | represents a process and all of its descendants, and define
             | some semantics for how epoll reports events on that fd
             | type. In FreeBSD we can just have a dedicated EVFILT_PROC
             | filter, no need for a new fd type. I'm not really sure
             | whether that's better or worse.
        
       | WesolyKubeczek wrote:
       | I went to read the article hoping that it would shed light on
       | some nitty-gritty internals of the Free(or any other)BSD kernel
       | and tell a story.
       | 
       | The story would be, of course, about how the brave and bold
       | developers undertook a journey to add something to the kqueue
       | mechanism, or fix a bug in it, or increase its performance, or
       | improve its scalability, and how on their way there they hit the
       | legendary, impenetrable Mountain of Technical Debt. How they
       | tried to force their way over the mountain first, but how the
       | dangers and the cold drove them to try to go in depth, via
       | disused tunnels made by some hard-working fearless people many
       | moons ago. In there, our brave team discovers long-decayed badly
       | burned bodies, rusted tools, and fragmentary diaries made by some
       | of those people.
       | 
       | They learn that the hard-working tunnelling people undertook a
       | quest not unlike their own; they also felt that the Mountain of
       | Technical Debt could actually present an opportunity to enrich
       | themselves, and while mining it for Bugfixes and Improvements
       | here and there, they accidentally awoke an Ancient Evil of
       | Regression-Rich Behavior, the Mother of All Heisenbugs.
       | 
       | The hard-working tunnelling people were no strangers to the
       | battle, and they faced the enemy, but it overpowered them with
       | Crippling Regressions and Massive Burnout. Only a few survivors
       | lived to tell the tale, and while what they told was confusing,
       | the clear message was to avoid the Mountain and never approach
       | the tunnels again. At least this was what everyone remembered,
       | but almost everyone forgot what evil fought them and drove them
       | out.
       | 
       | Our heroes realize that their quest has suddenly become much more
       | dangerous than they thought, but they don't want to stop halfway,
       | and they can be hard-working and hard-fighting too, and their
       | tools are much more superior than those of their predecessors.
       | They also think that they are much more resistant to burnout. Of
       | course they go on, and eventually they meet the creature. Each
       | attempt at improving kqueue yields such obscure regressions and
       | bugs elsewhere in the system, that half of the team succumbs to
       | Burnout unleashed on them despite youth, vigor, tools, and
       | strength. Two brave souls, armed with the best debuggers,
       | fuzzers, and in-depth code analyzers, manage to escape the Mother
       | of Heisenbugs, wounding it, and damaging some of the Mountain
       | itself. They describe their findings in a much more coherent way
       | than their surviving predecessors have managed to, and conclude
       | that the Mountain and the creature cannot be defeated on their
       | home ground, not until a viable replacement arises which will be
       | compatible with and and all the quirks the software grew to rely
       | on. This replacement would starve the creature and the Mountain
       | would erode.
       | 
       | The story of the kqueue Replacement Building would be a separate
       | one, in a different genre, with our brave heroes acting as wise
       | advisors to the new cast of protagonists, detailing challenges
       | they all run into, and how their combined wisdom, intelligence,
       | and superb teamwork and camaraderie help them overcome hardships
       | -- sometimes barely, but still.
       | 
       | Needless to say, the article failed me greatly. And it didn't
       | even make an attempt to describe where the debt was!
        
         | ndesaulniers wrote:
         | What was the AI prompt that generated that? LotR but
         | programming?
        
           | cardiffspaceman wrote:
           | On the other hand, I enjoyed reading it.
        
           | WesolyKubeczek wrote:
           | Purely organic, no AI used whatsoever.
           | 
           | (Lord of io_urings, oh my)
        
       | sc68cal wrote:
       | kqueue was if not the, then one of the first attempts to solve
       | the issue of being more performant than select(). So this whole
       | post boils down to "they didn't get it right the first time" to
       | which I say, okay sure, but we only know that now after kqueue
       | proved that event based queues were better than select.
       | 
       | This post is ahistorical and anti-social.
        
       | ryao wrote:
       | I don't see how adding a new syscall to monitor something new via
       | epoll is more desirable than adding a new BSD event filter.
       | Either way, you are modifying the syscall interface. The initial
       | epoll_create() had to be revised with epoll_create1(). It is even
       | cited on slide 53 as being an example of a past design failure:
       | 
       | https://www.man7.org/conf/lpc2014/designing_linux_kernel_API...
       | 
       | Interestingly, the BSD kqueue made the same mistake and had to be
       | revised for the same reason, which is why kqueue1() was made.
       | However, by the points on the technical checklist there for
       | syscall design, kqueue is an excellent design, minus the one
       | mistake of not supporting O_CLOEXEC at the beginning.
        
       | ori_b wrote:
       | Counterpoint -- epoll is fundamentally broken:
       | https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-...
       | 
       | You can end up with events on file descriptors that you don't
       | even have open -- they leak across process boundaries. And that
       | means that if the file descriptor gets reused, you can end up
       | with events on the wrong file descriptor.
        
         | zzo38computer wrote:
         | From looking at the man page, it look like epoll does not
         | return the file descriptor of the event; it returns a union
         | containing user-defined data, although one of the union's
         | fields is called "fd" because it is presumably intended to be
         | used as the file descriptor.
         | 
         | However, this is still subject to the problems you mention, as
         | well as that you presumably can no longer control events for a
         | file descriptor if you do longer have it, so it is still seems
         | a problem.
         | 
         | Putting the file descriptor in the "struct epoll_event" instead
         | of "epoll_data_t" would have avoided the problem of events on
         | the wrong file descriptor, but that still might not be good
         | enough. (It could specify -1 as the file descriptor for events
         | of file descriptors that you do not have access to.)
         | 
         | Some of this is just the problem with POSIX in general. I did
         | consider such problems (of file descriptors and of event
         | handling) in my own ideas of operating system design, which
         | uses capabilities. I would want to avoid the mess that other
         | systems is doing, too. A capability in this case is
         | fundamentally a different data type than numbers (and IPC
         | treats them differently), although I am not sure yet how to
         | handle this in a suitable way within a process (tagged memory
         | might do, although not all computers use tagged memory). (Note
         | that most I/O is done using IPC and not by system calls; the
         | number of system calls is limited and is usually only used for
         | managing the capabilities and event handling. This would also
         | improve security as well as other things.)
        
       | kev009 wrote:
       | This is pretty facepalm, has the author actually used any of
       | these extended epoll APIs? inotify - deprecated. aio - complete
       | footgun in linux. Whatever point is trying to be made here is
       | lost in reality, I fail to see how having a bunch of additional
       | function calls is superior to event filters and there is no
       | coherent argument laid out.. how is kqueue not composable if it
       | too has added event types (i.e. Mach ports they mention)? You're
       | going to have to modify the kernel in either case.
        
       | netbsdusers wrote:
       | Five points in reply:
       | 
       | 1. Signalfd is a mountain of technical debt. It's not like a file
       | at all. It reads entirely different things depending on which
       | process is reading from a single common open file description,
       | and it interacts with epoll in a most bizarre way, in that the
       | process that _added_ the signalfd to the poll set is the one
       | whose signals will notify readiness - even if that process then
       | closed the epoll and signalfd.
       | 
       | 2. Signalfd is a mountain of technical debt because they built it
       | over signals. Signals are of two forms: the one is asynchronous,
       | the other synchronous. The one should be replaced with a generic
       | message queue in a technical debt free design. The other is
       | debatable. (Mach and also Fuchsia have an "Exception Port"
       | instead, a message port someone can read from to deal with
       | exceptions incurred by some thread.)
       | 
       | 3. Regarding: > epoll can automatically monitor any kind of
       | kernel resource without having to be modified, due to its
       | composable design
       | 
       | Well, so can kqueue - EVFILT_READ/EVFILT_WRITE will work just as
       | well on any file, be it a socket, an eventfd, or even another
       | kqueue (though I don't think anyone who isn't dealing with
       | technical debt should be adding kqueues to kqueues, nor epolls to
       | epolls.) No need to modify kqueue! But in either case, though,
       | you've got to modify something. Whether that's trying to fit
       | signals into a file and getting signalfd, or whether it's adding
       | a signal event filter and getting EVFILT_SIGNAL, something has to
       | be done.
       | 
       | 4. FreeBSD's process descriptors predate Linux's pidfds. They are
       | also better and less technically indebted because they are pure
       | capabilities (they were invented as part of the Capsicum object-
       | capability model based security architecture for FreeBSD) while
       | pidfds are not: what you can do with a pidfd is decided on based
       | on the ambient authority of the process trying to, say, signal a
       | pidfd, and not on rights intrinsic to the fact of having a pidfd.
       | In fact, these pidfds are not even like traditional Unix open
       | file descriptions, whose rights are based on the credential of
       | who opened them. This makes privilege-separation architectures
       | impossible with pidfds, but I digress.
       | 
       | 5. The author ignored the actual points people argued against
       | epoll with, viz. that 1) epoll's edge triggering behaviour is
       | useless, and 2) that epoll's conflation of file descriptor with
       | open file description is a terribly leaky abstraction:
       | 
       | https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-...
       | https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-...
        
       ___________________________________________________________________
       (page generated 2024-12-29 23:00 UTC)