[HN Gopher] Investigating Linux phantom disk reads
___________________________________________________________________
Investigating Linux phantom disk reads
Author : kamaraju
Score : 111 points
Date : 2023-05-02 20:25 UTC (2 hours ago)
(HTM) web link (questdb.io)
(TXT) w3m dump (questdb.io)
| sytse wrote:
| TLDR; "Ingestion of a high number of column files under memory
| pressure led to the kernel starting readahead disk read
| operations, which you wouldn't expect from a write-only load. The
| rest was as simple as using madvise in our code to disable the
| readahead in table writers."
| EE84M3i wrote:
| The article kind of dances around it, but AIUI the reason that
| their "weite-only load" caused reads (and thus readahead) was
| because they were writing to a mapped page that had already
| been evicted - so the kernel _was_ reading /faulting those
| pages because it can only write in block/page sized chunks.
|
| In some sense maybe this could be thought of as readahead in
| preparation for writing to those pages, which is undesirable in
| this case.
|
| However, what confused be about this article was if the data
| files are append only, how is there a "next" block to read
| ahead to? I guess maybe the files are pre-allocated or the
| kernel is reading previous pages.
| [deleted]
| bremac wrote:
| Reading between the lines, it sounds as if they're using
| mmap. There is no "append" operation on a memory mapping, so
| the file would need to be preallocated before mapping it.
|
| If the preallocation is done using fallocate or just writing
| zeros, then by default it's backed by blocks on disk, and
| readahead must hit the disk since there is data there. On the
| other hand, preallocating with fallocate using
| FALLOC_FL_ZERO_RANGE or (often) with ftruncate() will just
| update the logical file length, and even if readahead is
| triggered it won't actually hit the disk.
| EE84M3i wrote:
| For the file being entirely pre-allocated case I
| understand, but for the file hole case I'm not sure I
| understand why you'd get such high disk activity.
|
| If the index block also got evicted from the page cache,
| then could reading into a file hole still trigger a fault?
| Or is the "holiness" of a page for a mapping stored in the
| page table?
| pengaru wrote:
| The readahead is a bit of a readaround when I last checked,
| as in it'll pull in some stuff before the fault as well.
|
| There used to be a sys-wide tunable in /sys to control how
| large an area readahead would extend to, but I'm not seeing
| it anymore on this 6.1 laptop. I think there's been some work
| changing stuff to be more clever in this area in recent
| years. It used to be interesting to make that value small vs.
| large and see how things like uncached journalctl (heavy mmap
| user) were affected in terms of performance vs. IO generated.
| EE84M3i wrote:
| The article distinguishes "readaround" from a linear
| predicted "readahead", but then says the output of blktrace
| indicates a "potential readahead", which is where I got
| confused.
|
| Does MADV_RANDOM disable both "readahead" and "readaround"?
| pengaru wrote:
| Going through mmap for bulk-ingest sucks because the kernel has
| to fault in the contents to make what's in-core reflect what's
| on-disk before your write access to the mapped memory occurs.
| It's basically a read-modify-write pattern even when all you
| intended to do was write the entire page.
|
| When you just use a write call you provide a unit of arbitrary
| size, and if you've done your homework that size is a multiple of
| page size and the offset page-aligned. Then there's no need for
| the kernel to load anything in for the written pages; you're
| providing everything in the single call. Then you go down the
| O_DIRECT rabbithole every fast linux database has historically
| gone down.
| davidhyde wrote:
| Seems like using memory mapped files for a write-only load is the
| sub optimal choice. Maybe I'm mistaken but surely using an
| append-only file handle would be simpler than changing the
| behaviour of how memory mapped files are cached like they did for
| their solution?
| addisonj wrote:
| I am going to write this comment with a large preface: I don't
| think it is ever helpful to be an absolutist. For every best-
| practice/"right way" to do things, there are circumstances when
| doing it another way makes sense. That can be a ton of reasons
| for that, be it technical, money/time, etc. The best engineering
| teams aren't those that just blindly follow what others say is a
| best practice but understand the options and make an informed
| choice. None of the following comment is at all commentary on
| questDB, as they mention in the article, _many_ databases use
| similar tools.
|
| With that said, after reading the first paragraph I immediately
| searched the article for "mmap" and had a good sense of where the
| rest of this was going. Put simply, it is just really hard to
| consider what the OS is going to do in all situations when using
| mmap. Based on my experience, I would guess that a _ton_ of
| people reading this comment have hit issues that, I would argue,
| is due to using mmap. (Particularly looking at you prometheus).
|
| All things told, this is a pretty innocuous incident of mmap
| causing problems, but I would encourage any aspiring DB engineers
| to read https://db.cs.cmu.edu/mmap-cidr2022 as it gives a great
| overview of the range of problems that can occur when using mmap
|
| I think some would argue that mmap is "fine" for append only
| workloads (and is certainly more reasonable compared to a DB with
| arbitrary updates) but even here, lots of factors like metadata,
| scaling number of tables, etc will _eventually_ bring you to hit
| some fundamental problems when using mmap.
|
| The interesting opportunity in my mind, especially with
| improvements in async IO (both at FS level and in tools like
| rust), is to build higher level abstractions that bring the
| "simplicity" of mmap, but with more purpose-built semantics ideal
| for databases.
| 0xbadcafebee wrote:
| There are other methods you can use to increase performance under
| memory pressure, but you'd end up handling i/o directly and
| maintaining your own index of memory and disk accesses, page-
| aligned reads/writes, etc. It would be easier to just require
| your users buy more memory, but when there's a hack like this
| available, that seems preferable to implementing your own VMM and
| disk i/o subsystem.
___________________________________________________________________
(page generated 2023-05-02 23:00 UTC)