[HN Gopher] Linux: What Can You Epoll?
___________________________________________________________________
Linux: What Can You Epoll?
Author : todsacerdoti
Score : 142 points
Date : 2022-10-22 16:22 UTC (6 hours ago)
(HTM) web link (darkcoding.net)
(TXT) w3m dump (darkcoding.net)
| sylware wrote:
| I wrote many of my own programs on elf/linux: I do epoll as much
| as I can.
|
| The only troubling thingy is the lack of classification of
| signals, those who are synchronous by nature and the other ones.
| For instance, in a monothreaded application, segfault won't be
| delivered via epoll...
|
| At the same time, it is still important to keep the asynchronous
| API for signals that for lower latency, but then, only the
| realtime behaviour should be kept since this is really where
| latency does matter.
| emilfihlman wrote:
| Regular files not having a non blocking mode is one of the
| biggest and gravest idiocies on linux land.
|
| And there's one even worse: even having the concept of
| uninterruptiple sleep (D).
| bitwize wrote:
| Why epoll when you can io_uring? In Rust?
| karthikmurkonda wrote:
| Yep
| tlsalmin wrote:
| Just skimmed through the article, since I'm just here to testify
| that the most important revelation for me on writing APIs was
| that you can put and epoll_fd in an epoll_fd. This allows the API
| to have e.g. a single epoll_fd that contains all outbound
| connections, timers, signalfds and inotifys mentioned in the
| articled. Then the e.g. daemon using the APIs can have an
| epoll_fd per library it is using and just be sitting in the
| epoll_wait loop ready to fire library_x_process() call when
| events arrive.
| kentonv wrote:
| Another use case for this: Say you have a set of "jobs" each
| composed of many "tasks" (each waiting for some event). The
| "jobs" are able to run concurrently on different threads, but
| the "tasks" must not run concurrently with other tasks in the
| same job because they might share data structures without
| synchronization.
|
| (This is a pretty common pattern in a lot of big servers.)
|
| Now you want to make sure you utilize multiple cores
| effectively. The naive approaches are:
|
| 1. Create a thread per job, each waiting on its own epoll
| specific to the job. This may be expensive if there are many
| jobs, and could allow too much concurrency.
|
| 2. Have a single epoll and a pool of threads waiting on it.
| Each thread must lock a mutex for the job that owns the task
| it's going to run. But a thread could receive an event for a
| task belonging to a job that's already running on another
| thread, in which case it has to synchronize with that other
| thread somehow, which is a pain. Be careful not to create a
| situation where all threads are blocked on the mutex for one
| job while other jobs are starved.
|
| Epoll nesting presents a clean solution:
|
| 3. Create an epoll per job, plus an outer epoll that waits on
| other epolls. A pool of threads waits on the outer epoll, which
| signals when a per-job epoll becomes ready. The thread
| receiving that event then takes ownership of the per-job epoll
| until the event queue is empty.
| Matthias247 wrote:
| > On Linux write to a regular file never blocks. Writing to a
| file copies data from our user space buffer to the kernel buffer
| and returns immediately. At some later point in time the kernel
| will send it to the disk. A regular file is hence always ready
| for writing and epoll wouldn't add anything.
|
| Is that true? If it would be, the amount of data the Kernel would
| need to buffer is unbounded. I assumed there is a limit on the
| amount of buffered and not yet committed data, and when that is
| cross the call would block until more data is flushed to disk.
| Which is kind of the same as happens for TCP sockets. The
| `write()` call there doesn't really send data to the peer, it
| just submits data to the kernels send buffer, from where it will
| be asynchronously transmitted.
|
| Edit: Actually I will answer my own question and say I know it
| will block. I had deployed IO heavy applications in the past with
| instrumented read/write calls for IO operations in a threadpool.
| Even though typical IO times are well below 1ms, under extremely
| high load latencies of more than 1s could be observed, which is
| far from "not blocking".
| kentonv wrote:
| Yes, file I/O can block. However, there is an assumption that
| file I/O will never block "indefinitely" -- unless something is
| severely broken, the kernel will always finish the operation in
| finite time probably measured in milliseconds at most. The same
| is not true of network communications, where you may be waiting
| for an event that never happens.
|
| There is a temptation to say that, well, milliseconds are a
| long time, so wouldn't we like to do this in a non-blocking way
| so we can work on other stuff in the meantime?
|
| But... consider this: Reads and writes of memory _also_ may
| block. If you really think about it, the only real difference
| between main memory blocking and disk blocking is the amount of
| time they may block. And with modern SSDs that time difference
| is not as large as it used to be.
|
| So do you want to be able to access memory in a non-blocking
| way? Well... you can make the same logical arguments as you do
| with file I/O, but in practice, almost no one tries to do this.
| Instead, you separate work into threads, and let the CPU switch
| (hyper)threads whenever it needs to wait for memory.
|
| In fact, memory reads may very well block on disk, if you use
| swap!
|
| Given all this, it stops being so clear that async file I/O
| really makes sense.
|
| Meanwhile, as it happens, the Linux kernel was never really
| designed for async file I/O in the first place. When you
| perform file I/O, the kernel may need to execute filesystem
| driver code, and it does so within the same thread that invoked
| the operation from userspace. That filesystem code is blocking.
| For the kernel to deliver true async file I/O, either all this
| code needs to be rewritten to be non-blocking (which would
| probably slow it down in most cases!), or the kernel needs to
| start a thread behind the scenes to perform the work.
|
| But... you can just as easily start a thread in userspace.
| So... maybe just do that?
|
| (Or, the modern answer: Use io_uring, which is explicitly
| designed to allow a userspace thread to request work performed
| on a separate kernel thread, and get notified of completion
| later.)
| jeffbee wrote:
| io_uring just racked up another CVE, so I kinda feel that its
| severely under-designed nature will always haunt it. The idea
| that you can just hand off infinite amounts of work for the
| kernel to do on your behalf is pretty fundamentally broken.
| It is a concrete implementation of wishful thinking.
| tankenmate wrote:
| All "work" you want to do that interfaces with anything on
| an OS is handed off to the kernel; want to read a file?
| kernel, want to sleep for a while? kernel, etc. Besides
| things like network traffic is also asynchronous like
| io_uring (even if the socket() interfaces make it look
| somewhat synchronous). Outside of toy system asynchronicity
| is always a thing, especially when running on multiple
| cores.
|
| I kind of get where you are coming from but at the same
| time, the kernel always gets the last say, so as long as
| io_uring has a good design and implementation it will
| always be just as good or bad as the OS as a whole. Whether
| run of the mill programmers are up to the task of being
| able to properly conceptualise and use such an OS is
| probably not the same thing.
| jeffbee wrote:
| Yeah but it's not well-designed, that's my point. It has
| obliviously shrugged off the tricky question of object
| lifetime, that's why it has already collected 16
| different CVEs for things like use-after-free.
| Considering its short history, io_uring has already
| rocketed to the top of the list of dangerous kernel
| features.
| nathants wrote:
| with linux 6.0, lsm got the ability to filter io_uring.
| deny all and carry on.
| vlovich123 wrote:
| That analysis would seem smart but let's try a game of Mad
| Libs:
|
| The Linux Kernel just racked up another CVE, so I kinda
| feel that its severely under-designed nature will always
| haunt it.
|
| KDE just racked up another CVE, so I kinda feel that its
| severely under-designed nature will always haunt it.
|
| Firefox just racked up another CVE, so I kinda feel that
| its severely under-designed nature will always haunt it.
|
| Chrome just racked up another CVE, so I kinda feel that its
| severely under-designed nature will always haunt it.
|
| Windows just racked up another CVE, so I kinda feel that
| its severely under-designed nature will always haunt it.
|
| Photoshop just racked up another CVE, so I kinda feel that
| its severely under-designed nature will always haunt it.
|
| All CPUs just rucked up another CVE, so I kinda feel that
| its severely under-designed nature will always haunt it.
|
| What's the theme? Racking up CVEs is something all software
| & hardware does. Mistakes can happen in design and in
| implementation and no one is immune. Using presence of CVEs
| as an indication of immaturity / fundamental design flaw
| isn't helpful. In fact, it's probably the opposite.
| Software that has no CVEs probably just means no one is
| paying attention to it. Sure, in a theoretical case maybe
| you've built a formal proof and translated that into a
| memory safe language somehow (& you assume you've made no
| mistakes modelling your entire system in your proof), then
| maybe. However, that encompasses 0% of all software.
|
| > The idea that you can just hand off infinite amounts of
| work for the kernel to do on your behalf is pretty
| fundamentally broken. It is a concrete implementation of
| wishful thinking
|
| How is that any different from a file descriptor? The
| kernel is free to setup limits on how much work you can
| have outstanding at any given time (now maybe those bits
| are missing right now, but it doesn't feel like an
| intractable problem).
| [deleted]
| loeg wrote:
| > For the kernel to deliver true async file I/O, either all
| this code needs to be rewritten to be non-blocking
|
| This is, I believe, the NT model.
| abiloe wrote:
| > If you really think about it, the only real difference
| between main memory blocking and disk blocking is the amount
| of time they may block.
|
| This is a somewhat confusing analysis you have here. Direct
| read/write from memory for all intents and purposes doesn't
| block. Why do you say that reads and writes may also block?
|
| The reason memory blocks is because it needs to page in or
| out from secondary storage - which makes this statement "the
| only real difference between main memory blocking and disk
| blocking is the amount of time they may block." not really
| true
| tremon wrote:
| _Why do you say that reads and writes may also block?_
|
| Let's define "may block" first, perhaps? What do we mean
| when we say "network I/O may block"? Usually, this means
| that the kernel may see your network request and raise you
| a context switch while it waits for the network response on
| your behalf. In your last sentence you appear to argue that
| the reason _why_ the kernel performs a context switch is
| relevant in determining if an operation "may block", and
| the GP is arguing that that's a distinction without a
| difference.
|
| If the definition of "may block" is really just "the kernel
| may decide to context-switch away from your program", then
| yes, the GP's assertion that file I/O, memory I/O (mmap)
| and memory access (swap) are all operations that may block
| is correct -- the only difference is in degree: from
| microsecond delays for nvm-backed swap to multi-second
| delays for network transfers.
|
| Or, of course, I may have misunderstood the GP's train of
| thought.
| [deleted]
| jesboat wrote:
| >> If you really think about it, the only real difference
| between main memory blocking and disk blocking is the
| amount of time they may block. > > This is a somewhat
| confusing analysis you have here. Direct read/write from
| memory for all intents and purposes doesn't block. Why do
| you say that reads and writes may also block?
|
| Reads and writes from actual, physical, hardware memory
| might block, depending on how you define "block", in the
| sense that some reads may miss CPU cache. But once you get
| to that point, you could argue that every branch might
| block if the branch misprediction causes a pipeline stall.
| This is not a useful definition of "block".
|
| The thing is, most programs are almost never low-level
| enough to be dealing with memory in that sense: they read
| and write _virtual_ memory. And virtual memory can block
| for any number of reasons, including some pretty non-
| obvious ones like. For example:
|
| - the system is under memory pressure and that page is no
| longer in RAM because it got written to a swap file
|
| - the system is under memory pressure and that page is no
| longer in RAM because it was a read-only mapping from a
| file and could be purged
|
| -- e.g. it's part of your executable's code
|
| - this is your first access to a page of anonymous virtual
| memory and the kernel hadn't needed to allocate a physical
| page until now
|
| - you're in a VM and the VMM can do whatever it wants
|
| - the page is COW from another process
| kentonv wrote:
| > This is not a useful definition of "block".
|
| I think what I'm saying is that calling file I/O
| "blocking" is also not a useful definition of "block".
| Because I don't really see the fundamental difference
| between "we have to wait for main memory to respond" and
| "we have to wait for disk to respond".
|
| > this is your first access to a page of anonymous
| virtual memory and the kernel hadn't needed to allocate a
| physical page until now
|
| And said allocation could block on all sorts of things
| you might not expect. Once upon a time I helped debug a
| problem where memory allocation would block waiting for
| the XFS filesystem driver to flush dirty inodes to disk.
| Our system generated lots of dirty inodes, and we were
| seeing programs randomly hang on allocation for minutes
| at a time.
| abiloe wrote:
| > I think what I'm saying is that calling file I/O
| "blocking" is also not a useful definition of "block".
| Because I don't really see the fundamental difference
| between "we have to wait for main memory to respond" and
| "we have to wait for disk to respond".
|
| In addition to the point elsewhere made that you're sort
| of implicitly denying the magnitude of the differences
| here - the latency differences are on the order of 1000s.
|
| The other way of separating is if the OS (or some kind of
| software trap handler more generally) has to get
| involved. A main memory read to a non-faulting address
| doesn't involve the OS - ie it doesn't ever block.
| However faulting reads, calls to "disk" IO, and
| networking IO (ie just I/O in general) involving the
| OS/monitor/what have you are all potentially blocking
| operations.
| dahfizz wrote:
| > Because I don't really see the fundamental difference
| between "we have to wait for main memory to respond" and
| "we have to wait for disk to respond".
|
| The difference, conservatively, is a factor of 1000.
|
| There are plenty of times in software engineering where
| scaling 1000x will force you to reconsider your
| architecture.
| kentonv wrote:
| > Direct read/write from memory for all intents and
| purposes doesn't block.
|
| Sure it does! Main memory is much slower than cache so on a
| cache miss the CPU has to stop and wait for main memory to
| respond. The CPU may even switch to executing some other
| thread in the meantime (that's what hyperthreading is). But
| if there isn't another hyperthread ready, the CPU sits
| idle, wasting resources.
|
| It's not a form of blocking implemented by the OS
| scheduler, but it's pretty similar conceptually.
|
| > The reason memory blocks is because it needs to page in
| or out from secondary storage
|
| Nope, that's not what I was referring to (other than in the
| line mentioning swap).
| bch wrote:
| With the utmost respect, I've never heard "blocking"
| described as "takes some measurable amount of time"
| (which is how I'm reading your above statement); by that
| definition, async blocks to a degree too.
|
| You're throwing traditional blocking/non-blocking
| distinctions on their ear.
| Volundr wrote:
| Blocking in this case is referring to the CPU thread
| sitting idle whilst the operation is performed. This is
| what it means when your blocked on a network request,
| blocked on a disk operation, or blocked on a memory
| request. It's all blocking.
|
| A cache miss and going to RAM is usually fast enough that
| we as software engineers don't care about it, and in fact
| our programming language of choice may not even give us a
| way of telling the difference between a piece of data
| coming from a CPU register or L1 cache vs going to RAM,
| but that doesn't mean the blocking isn't happening.
|
| EDIT: to maybe make this a little clearer for those who
| might not be aware the CPU doesn't go fetch something
| from RAM. It puts in a request to the memory controller
| (handwaving modern architecture a bit here) then has to
| wait ~100-1000 CPU cycles before the controller gets back
| to it with the data. Depending on the circumstances the
| kernel may let that core sit idle, or it may do a context
| switch to another thread. The only difference between
| this process and say a network request is how many CPU
| cycles before you get your results. In the meantime the
| thread isn't progressing and is blocked.
| bch wrote:
| > A cache miss and going to RAM is usually fast enough
| that we as software engineers don't care about it, and in
| fact our programming language of choice may not even give
| us a way of telling the difference between a piece of
| data coming from a CPU register or L1 cache vs going to
| RAM, but that doesn't mean the blocking isn't happening.
|
| Yes, this is the line being discussed, and I guess
| (historically) I've just considered "a cost" without
| dragging "blocking" into the equation. We know that _not_
| accessing memory is cheaper than accessing it, and we can
| tune (pack our structs, mind thrashing the cache), but
| calling that blocking is still new to me. I'll have to
| consider what it means. And also, does it imply the
| existence of non-blocking memory (I don't think DMA is
| typically in a developers toolkit, but...)?
| Volundr wrote:
| > And also, does it imply the existence of non-blocking
| memory
|
| Yes actually! If you know your going to need a block of
| memory before you actually need it, you can put in a
| request to the memory controller before you need it, then
| proceed to do some other work and check back in when your
| ready for the data or when the memory controller signals
| you it's done. It's just that this kind of thing is
| usually the scope of compiler optimizations or hyper
| optimized software like Varnish cache rather than
| something your average web developer thinks about. It's
| again conceptually the same as an async network request,
| but you bother with one while considering the other just
| "a cost" because of the different timescales.
| jmalicki wrote:
| > And also, does it imply the existence of non-blocking
| memory
|
| Prefetching instructions, to tell the processor to load
| before you use it!
|
| The first google hit [0] even calls it non-blocking
| memory access!
|
| In [1] you can see some of the available prefetching
| instructions, and in [2] some analysis on how they deal
| with TLB misses (another _extremely_ expensive way memory
| access can be blocking short of a page fault).
|
| Another thing not mentioned above is that accessing a
| page of newly allocated memory often causes a page fault,
| since allocation is often delayed until use of each page,
| for overcommitting behavior - same for writing to memory
| that is copy-on-write from a fork!
|
| [0] https://www.sciencedirect.com/topics/computer-
| science/prefet....
|
| [1] https://docs.oracle.com/cd/E36784_01/html/E36859/epmp
| w.html
|
| [2] https://stackoverflow.com/a/52377359/435796
| [deleted]
| abiloe wrote:
| > Sure it does! Main memory is much slower than cache so
| on a cache miss the CPU has to stop and wait for main
| memory to respond. The CPU may even switch to executing
| some other thread in the meantime (that's what
| hyperthreading is).
|
| Cache is a memory. And which cache, by the way? Even L1
| cache on modern processors doesn't have 0 latency. And
| this is a rather poor way of describing hyperthreading -
| the CPU doesn't really "switch" - the context for the
| alternate process is already available and the resource
| stealing can occur for any kind of stall (including cache
| loads), not just memory. Calling this a "switch"
| suggesting it is like a context switch is very
| misleading. It's not similar conceptually.
|
| In any event, by this definition even a mispredicted
| branch or a divide becomes "blocking" - which sort of
| tortures any meaningful definition of blocking.
|
| The essential difference is - memory accesses to paged in
| memory (and branch mispredictions, cache misses) are not
| something you typically or reasonably trap outside of
| debugging. mmaps, swaps, disk I/O, network accesses are
| all something delegated to an OS - and at that point are
| orders of magnitude more expensive than even most NUMA
| memory accesses. I sort of see where you're coming from -
| but I don't think it's a useful point.
| kentonv wrote:
| None of this seems to contradict my point?
|
| My argument is that disk I/O is more like memory I/O than
| it is like network I/O, and so for concurrency purposes
| it may make more sense to treat it like you would memory
| I/O (use threads) than like you would network I/O (where
| you'd use non-blocking APIs and event queues).
| abiloe wrote:
| > My argument is that disk I/O is more like memory I/O
| than it is like network I/O
|
| It depends on your network and disk - and yes SSD and
| "slow" ethernet are the common case, but there is enough
| variation (say an relatively sluggish embedded eMMC on
| one end and 100 GbE for the networking case), that
| there's no point in making some distinction about disk vs
| network latency - for a general concurrency abstraction
| they're both slow IO and you might as well have a common
| abstraction like IOCP or io_uring.
|
| > concurrency purposes it may make more sense to treat it
| like you would memory I/O (use threads) than like you
| would network I/O (where you'd use non-blocking APIs and
| event queues).
|
| No, case in point, Windows had IOCP for years such that
| you could use the same common abstraction for network and
| disk. The fact that the POSIX/UNIX world was far behind
| the times in getting its shit together doesn't mean much.
|
| And why, fundamentally, can you not use blocking APIs
| with threads for networking?
| p12tic wrote:
| It's complicated, memory accesses can really block for
| relatively long periods of time.
|
| Consider that regular memory access via cache takes around
| 1 nanosecond.
|
| If the data is not in top-level cache, then we're looking
| at roughly 10 nanoseconds access latency.
|
| If the data is not in cache at all, we are looking into
| 50-150 nanoseconds access latency.
|
| If the data is in memory, but that memory is attached to
| another CPU socket, it's even more latency.
|
| Finally, if the data access is via atomic instruction and
| there are many other CPUs accessing the same memory
| location, then the latency can be as high as 3000
| nanoseconds.
|
| It's not very hard to find NVMe attached storage that has
| latencies of tens of microseconds, which is not very far
| off memory access speeds.
| eloff wrote:
| I just want to add to your explanation, that even in the
| absence of hard paging from disk, you can have soft page
| faults where the kernel modifies the page table entries
| or assigns a memory page, or copies a copy on write page,
| etc.
|
| In addition to the cache misses you mention there's also
| TLB misses.
|
| Memory is not actually random access, locality matters a
| lot. SSDs reads, on the other hand, are much closer to
| random access, but much more expensive.
| caf wrote:
| The term "blocking" in UNIX-like OSes is jargon with a
| particular meaning. It means an interruptible wait.
|
| Disk files do not block - they may Disk Wait instead, which is
| an uninterruptible wait (this is what the 'D' process status
| stands for). Disk Wait doesn't interact with O_NONBLOCK,
| select(3), poll(3) etc.
|
| (Back in the bad old days it wasn't even possible for a Disk
| Waiting process to wake up to process a SIGKILL and die, which
| was the bane of system administrators everywhere when NFS
| introduced the idea of disks that could disappear when the
| network went down. Now it's common for OSes to make some kinds
| of Disk Waits at least killable).
| Snild wrote:
| > I assumed there is a limit on the amount of buffered and not
| yet committed data, and when that is cross the call would block
| until more data is flushed to disk.
|
| There is. It's tunable through /proc/sys/vm/dirty_ratio. When
| there is that much write cache, application writes will start
| to writeback synchronously.
|
| There is also dirty_background_ratio, which is the threshold at
| which writeback starts happening in the background (that is, in
| a kernel thread).
| throwaway09223 wrote:
| No, as you reasoned out it is absolutely incorrect. Write calls
| to regular files will block until they are complete, unless
| some kind of error situation is encountered.
|
| This effect is often particularly pronounced with NFS, where
| calls might block for _hours_ or even indefinitely if the
| underlying network filesystem goes away.
| tankenmate wrote:
| Just in case anyone isn't aware there is a mount flag called
| "soft" that allows the NFS client (and some other network
| filesystems) to timeout or be interrupted, i.e. the process
| won't get stuck in 'D' (device wait) state.
| inetknght wrote:
| > _This effect is often particularly pronounced with NFS,
| where calls might block for hours or even indefinitely if the
| underlying network filesystem goes away._
|
| I can't tell you how many times I've had to debug a stuck
| process and it turns out that the logs indicated the NFS had
| a hiccup a day or two ago during a file read or write and the
| process was never notified of a file error. It's f!@#ing
| frustrating. Worse, though, was CIFS.
| lanstin wrote:
| I routinely have to run file system scans on a giant NFS
| filer, and even without a hiccup, out of a 100M stat or
| read calls, ten or so will just never finish. In Go, I have
| to wrap the call with a channel thing and a time out and
| hope I don't run out of threads before scanning all 400 M
| files.
| kotlin2 wrote:
| The write call returns how many bytes were accepted:
| https://man7.org/linux/man-pages/man2/write.2.html
|
| > The number of bytes written may be less than count if, for
| example, there is insufficient space on the underlying physical
| medium, or the RLIMIT_FSIZE resource limit is encountered (see
| setrlimit(2)), or the call was interrupted by a signal handler
| after having written less than count bytes. (See also pipe(7).)
| wtallis wrote:
| That doesn't answer the question. Blocking isn't a matter of
| how much data is written, but a matter of when the system
| call completes. Other parts of that man page imply that
| write(2) may block, unless the fd was opened with O_NONBLOCK
| (in which case you'll get an EAGAIN error instead of it
| blocking).
| icedchai wrote:
| "It's complicated." Generally, with a regular file,
| write(2) will complete as soon as the the data makes it to
| filesystem buffers/cache. The data is _probably_ not on
| disk when the call completes. This depends on how the file
| was opened (O_FSYNC, O_DIRECT, etc.) and the underlying
| filesystem itself. There are many other details at work,
| like actual file system, memory pressure (there may not be
| enough buffers), cache in the physical disk device or
| controller, etc. So the write call itself is "blocking",
| but the physical writes are (generally) not synchronous
| with the call.
| wtallis wrote:
| Yes, whether a write blocks is really about whether the
| application can do anything else while the write is
| processed; whether the application is told the write is
| done when it lands in a cache or when it is actually on
| stable storage is a separate question.
| throwaway09223 wrote:
| > (in which case you'll get an EAGAIN error instead of it
| blocking).
|
| You won't. O_NONBLOCK cannot be used with regular files.
| That part of the manpage is discussing other non-socket
| file types.
|
| Disk i/o via write(2) is always a blocking call. Always.
| 100% of the time, no exceptions.
| cout wrote:
| It is bounded by available memory. Writes to a socket go to a
| FIFO queue (the socket's write buffer), but writes to disk are
| different; they go through the page cache
| (https://www.kernel.org/doc/html/latest/admin-
| guide/mm/concep...):
|
| > The physical memory is volatile and the common case for
| getting data into the memory is to read it from files. Whenever
| a file is read, the data is put into the page cache to avoid
| expensive disk access on the subsequent reads. Similarly, when
| one writes to a file, the data is placed in the page cache and
| eventually gets into the backing storage device. The written
| pages are marked as dirty and when Linux decides to reuse them
| for other purposes, it makes sure to synchronize the file
| contents on the device with the updated data.
|
| There are many advantages to doing it this way. One is that
| multiple writes to the same page will result in a single
| physical write, if the page has not yet been flushed to disk.
|
| There are many reasons that you might have seen a write to a
| file block. One is that the number of dirty pages has reached
| the threshold (nr_dirty_threshold in /proc/vmstat). After that
| happens, any process doing disk IO will block.
|
| Another reason is memory pressure. Since all writes go through
| the page cache, the kernel must first allocate a page before
| the call to write(2) can be completed. If there are many pages
| in the page cache, this can take a long time (I once witnessed
| an old kernel bug cause all page allocations to result in
| kswapd attempting to reclaim pages, due to active pages being
| placed ahead of inactive pages in the LRU lists).
|
| In general, if you are writing a lot to disk but you have no
| intention of reading it in the near future, it is a good idea
| to call posix_fadvise(2) with FADV_DONTNEED to ensure the pages
| will be reused for something else more quickly.
| lanstin wrote:
| It is pretty easy to completely hork a large box with a very
| disk intensive process; hit a local file system hard enough
| and you can get a majority of the processes into D state,
| uninterruptible IO Disk wait. Maybe not from inside a
| container, haven't see it, but definitely on a box with
| shared processes. Even just too much logging can harm
| unrelated processes that aren't even doing much with the
| disk.
| rwmj wrote:
| It's weird that (according to this document) you can epoll Unix
| domain sockets but not sockets created by socketpair(2). I
| thought socketpair created essentially two pre-connected Unix
| domain sockets.
| kentonv wrote:
| Hmm, I don't think that's it says (unless they edited it since
| your post?). It mentions socketpair explicitly as something
| that _is_ epoll-friendly, and which you can use to communicate
| with another thread, in the case where you must create a thread
| to perform some blocking task but still want to get completion
| notification in the main thread via epoll.
| ajross wrote:
| Indeed, I am all but certain you can epoll on socketpairs. That
| sounds like a mistake in the article.
| kentonv wrote:
| I highly recommend that you do NOT use signalfd to get
| notification of signals through epoll. Instead, block (mask) the
| signal, set a signal handler, and use epoll_pwait() to atomically
| unblock it while you wait for events. Note that in this setup,
| your signal handler callback need not be async-signal-safe, since
| you know the precise state of the calling thread: it's invoking
| epoll_pwait(). This sidesteps most of the pain of using signals
| which might otherwise make you think you want signalfd.
|
| Two reasons not to use signalfd:
|
| 1. signalfd has weird semantics that don't match what you'd
| normally expect from a file descriptor. When you read from a
| signalfd, it tells you signals queued on the thread that called
| read(), NOT the thread that created the signalfd. Worse, if you
| add signalfd to an epoll, the epoll will report readiness based
| on the thread that used epoll_ctl() to add the signalfd, which
| may be different from the thread that is reading from the epoll.
| So you might get a notification that the signalfd is ready, but
| then read the signalfd and find there are no signals, and then
| wait on the epoll again just to have it tell you again that this
| signalfd is ready.
|
| 2. It turns out that signalfd's implementation has some severe
| lock contention issues. I learned this through my own
| experimentation recently. In my experiment, I had 5000 threads
| each waiting on an epoll that included a signalfd. When
| delivering a thread-specific signal to each of the 5000 threads
| at once, the process spent 2+ MINUTES of CPU time spinning on
| spinlocks in the kernel before completing all the event
| deliveries. The time spent was O(n^2) to the number of threads.
| When I switched to an epoll_pwait()-based implementation, the
| same task took a few milliseconds.
|
| Here's the PR where I switched KJ's event loop (used in Cap'n
| Proto and Cloudflare Workers) to use epoll_pwait():
| https://github.com/capnproto/capnproto/pull/1511
| kelnos wrote:
| The big downside of using a traditional signal handler is that
| the only way to get your own data into the handler function is
| through global variables (or thread locals). While you can
| certainly make an exception just for that one thing, it feels
| gross to do so. And you can also just defer processing to your
| main loop by setting a flag or writing to a pipe, but those
| things still need to be global variables.
|
| I didn't know about signalfd's limitations before reading your
| post, and was happy that signalfd could eliminate the need for
| global variables when doing signal handling. Shame that's not
| really the case.
| kentonv wrote:
| In my case I use a thread_local pointer that I initialize
| right before epoll_pwait and set back to null immediately
| after. The pointer points to the same data structures that I
| would otherwise use to handle signalfd events. Yeah it's a
| little icky to use the global but I think it ends up
| semantically equivalent.
| [deleted]
| wahern wrote:
| Unfortunately, thread-local storage is not async-signal
| safe. You're relying (knowingly, I presume, but others
| should be warned) on implementation details.
|
| But, yeah, signalfd leaves much to be desired. *BSD kqueue
| EVFILT_SIGNAL has much saner semantics.
| kentonv wrote:
| > Unfortunately, thread-local storage is not async-signal
| safe.
|
| Doesn't matter, because the signal handler in this case
| is strictly called "during" invocation of epoll_pwait, so
| there's no risk of it interrupting the initialization of
| a TLS object. The usual rules about async signal safety
| do not need to be followed here; it's as if
| epoll_wait()'s implementation made a plain old function
| call to the signal handler.
|
| (Also, since we're talking about epoll, we can assume
| Linux, which means we can assume ELF, which means it's
| pretty easy to use thread_local in a way that requires no
| initialization by allocating it in the ELF TLS section.
| But yes, that's relying on implementation details I
| suppose.)
|
| > kqueue EVFILT_SIGNAL
|
| Having recently implement kqueue support in my event loop
| I have to say I'm disappointed by EVFILT_SIGNAL. It does
| not play well with signals that target a specific thread
| (pthread_kill()) -- on FreeBSD, all threads will get the
| kqueue event, while on MacOS, none of them do.
| Fortunately EVFILT_USER provides a reasonable alternative
| for efficient inter-thread signaling.
|
| (I don't like using a pipe or socketpair as that involves
| allocating a whole two file descriptors and a kernel
| buffer, and it requires a redundant syscall on the
| receiving end to read the dummy byte out of the buffer.
| If you're just trying to tell another thread "hey I added
| something to your work queue, please wake up and check",
| that's a waste.)
| kelnos wrote:
| Makes sense, and is probably the "safest" you can get.
| Since, as you say, you know exactly the state of everything
| on that thread when you're in your handler, you can also
| know that your thread local was set properly before the
| epoll_pwait() call.
|
| It's probably code I'd want to isolate somewhere, with big
| warnings so any future reader understands why it is how it
| is, but I agree it's probably the safest way to do it.
| FPGAhacker wrote:
| You should do a write up of item 2.
| tlsalmin wrote:
| I have to disagree here. Not recommending signalfd for the
| mentioned use cases might be reasonable, just as reasonable as
| it is to use threads for a specific use case. For a single
| threaded non-blocking-FD using client/server signalfd removes
| the risk of doing too much in the signal handler and brings
| signals nicely into the event loop. This just happens to be 99%
| of the functionality I have to do.
|
| I'd only use more than one signalfd if each signalfd only
| catches a specific signal. E.g. main context handles Sigterm
| and a background process library handles sigchld.
| guenthert wrote:
| Thanks for the reminder that there is no non-blocking i/o for
| files residing on block devices.
| yxhuvud wrote:
| But there is, io_uring.
| m00dy wrote:
| io_uring, a magical keyword I used to use in job
| interviews...
| healthandsafety wrote:
| Care to elaborate?
| kortilla wrote:
| Everyone says it's better on paper but you rarely get to
| actually use it in real code.
| guenthert wrote:
| That is async i/o afaiu and not classic Unix non-blocking
| i/o (O_NONBLOCK given to 2 open()).
| yxhuvud wrote:
| Sure. But why does the difference matter? It is not as if
| epoll is classic Unix either.
| guenthert wrote:
| epoll might not be, but poll is (depending on how one
| would interpret 'classic').
|
| Anyhow, I wrongly assumed the difference mattered in
| respect of whether one could use io_uring in combination
| with epoll(). It turns out, one can [1] or [2].
|
| [1] https://stackoverflow.com/questions/70132802/waiting-
| for-epo...
|
| [2]
| https://unixism.net/loti/tutorial/register_eventfd.html
| yxhuvud wrote:
| Having done my own share of uring bindings I wish I had
| found work places that appreciated that.
| bfrog wrote:
| why epoll at all, the new hotness is io_uring, fire away your
| iovecs, check back later
| rwmj wrote:
| You can go from select/poll to epoll relatively easily, but
| I've found that to use io_uring you have to substantially
| rearchitect your whole program (if you want any performance
| benefit).
|
| Actually I'd love to be wrong about this, but I've not found a
| way to easily retrofit io_uring into programs/libraries that
| are already using either synchronous operations or poll(2).
| jasonzemos wrote:
| io_uring is basically a drop-in for epoll. It has an
| intrinsic performance benefit because multiple operations can
| be both submitted and completed in a single action.
| Rearchitecting is only optional when going further by
| replacing standalone syscalls with io_uring operations. In
| the case of poll(2) I believe it should be no more difficult
| than refactoring for epoll.
| wahern wrote:
| With io_uring, _every_ line in an application that calls
| read /recv needs to be refactored, along with much of the
| surrounding context. io_uring doesn't replace poll/epoll,
| it effectively replaces typical event loop frameworks. You
| can integrate io_uring into pre-existing event loop
| frameworks, but the event loop framework will end up as a
| 99% superfluous wrapper, at least on Linux.
|
| Note that many applications don't use event loop
| frameworks. For simple applications they can be overkill.
| Even for more complex applications, it may be cleaner to
| use restartable semantics (i.e. same semantics as read--
| just call me again), especially for libraries or components
| that want to be event loop agnostic.
| gavinray wrote:
| You can use userspace Coroutines/fibers implementations to
| wire in async io_uring into existing synchronous code and
| maintain the facade of the code still being synchronous
|
| How easy/feasible this is depends on the language.
|
| In C++, Rust, Zig, Java (Loom fibers), and Kotlin I know for
| a fact it's doable
|
| Other languages I'm not sure what the experience is like
| drpixie wrote:
| Does anyone feel that the Linux API (and so the kernel) is
| slowly getting more and more complex and cumbersome?
___________________________________________________________________
(page generated 2022-10-22 23:00 UTC)