[HN Gopher] But how, exactly, do databases use mmap?
___________________________________________________________________
But how, exactly, do databases use mmap?
Author : brunoac
Score : 168 points
Date : 2021-01-23 13:06 UTC (9 hours ago)
(HTM) web link (brunocalza.me)
(TXT) w3m dump (brunocalza.me)
| rcgorton wrote:
| I found some of the 'sizing' snippets in the example came across
| as disingenuous: if you KNOW the size of the file, mmap it
| initially using that without the looping overhead. And you
| presumably know how much memory you have on a given system. The
| description (at least as how I read the article) implies bolt is
| a truly naive implementation of a key/value DB
| perbu wrote:
| The author notices that Bolt doesn't use mmap for writes. The
| reason is surprisingly simple, once you know how it works. Say
| you want to overwrite a page at some locations that isn't present
| in memory. You'd write to it and you'd think that is that. But
| when this happens the CPU triggers a page fault, the OS steps in
| and reads the underlying page into memory. It then relinquishes
| control back to the application. The application then continues
| to overwrite that page.
|
| So for each write that isn't mapped into memory you'll trigger a
| read. Bad.
|
| Early versions of Varnish Cache struggled with this and this was
| the reason they made a malloc-based backend instead. mmaps are
| great for reads, but you really shouldn't write through them.
| tayo42 wrote:
| Is the trade off in varnish worth it? Workloads for a cache
| should be pretty read heavy, writes should be infrequent unless
| it's being filled for the first time
| KMag wrote:
| I think the main problem with mmap'd writes is that they're
| blocking and synchronous.
|
| I presume most database record writes are smaller than a page.
| In that case, other methods (unless you're using O_DIRECT,
| which ads its own difficulties) still have the kernel read a
| whole page of memory into the page cache before writing the
| selected bytes. So, unless you're using O_DIRECT for your
| writes, you're still triggering the exact same read-modify-
| write, it's just that with the file APIs you can use async I/O
| or use select/poll/epoll/kqueue, etc. to avoid these necessary
| reads from blocking your writer thread.
| cperciva wrote:
| There's an even better reason for databases to not write to
| memory mapped pages: Pages get synched out to disk at the
| kernel's leisure. This can be ok for a cache but it's
| definitely not what you want for a database!
| [deleted]
| eqvinox wrote:
| That's what msync() is for.
| monocasa wrote:
| Right, but it can sync arbitrary ranges sooner, which is
| also awful for consistency.
| reader_mode wrote:
| Shouldn't your write strategy be resilient to that kind
| of stuff (eg. shutdown during a partial update) ?
| gmueckl wrote:
| Don't you need exact guarantees on write ordering to
| achieve that?
| jorangreef wrote:
| Yes, for almost all databases, although there was a cool
| paper from the University of Wisconsin Madison a few
| years ago that showed how to design something that could
| work without write barriers, and under the assumption
| that disks don't always fsync correctly:
|
| "the No-Order File System (NoFS), a simple, lightweight
| file system that employs a novel technique called
| backpointer based consistency to provide crash
| consistency without ordering writes as they go to disk"
|
| http://pages.cs.wisc.edu/~vijayc/nofs.htm
| vlovich123 wrote:
| Does that generalize to databases? My understanding is
| that file systems are a restricted case of databases that
| don't necessarily support all operations (eg transactions
| are smaller, can't do arbitrary queries within a
| transaction, etc etc).
| bonzini wrote:
| You can do write/sync/write/sync in order to achieve
| that. It would be nicer to have FUA support in system
| calls (or you can open the same file to two descriptors,
| one with O_SYNC and one without).
| dooglius wrote:
| I think you mean mlock
| cperciva wrote:
| If you're tracking what needs to be flushed to disk when,
| you might as well just be making explicit pwrite syscalls.
| cma wrote:
| Isn't there a way around this? When coding for graphics stuff
| writing to GPU mapped memory people usually take pains to turn
| off compiler optimizations that might XOR memory against itself
| to zero it out or AND it against 0 and cause a read, and other
| things like that.
|
| https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-...
|
| > Even the following C++ code can read from memory and trigger
| the performance penalty because the code can expand to the
| following x86 assembly code. C++ code: Copy
| *((int*)MappedResource.pData) = 0;
|
| x86 assembly code: Copy AND DWORD PTR [EAX],0
|
| > Use the appropriate optimization settings and language
| constructs to help avoid this performance penalty. For example,
| you can avoid the xor optimization by using a volatile pointer
| or by optimizing for code speed instead of code size.
|
| I guess mmapped files still may need a read to know whether to
| do copy on write, where mapped memory for the CPU in that case
| is specifically marked for upload only and gets something
| flagged that writes it regardless of if there is a change, but
| mmap maybe has something similar?
|
| (edit: this seems to say nothing similar is possible with mmap
| on x86 https://stackoverflow.com/questions/31014515/write-only-
| mapp...
|
| but how does it work for GPUs? Something to do with fixed pci-e
| support on the cpu (base address register
| https://en.wikipedia.org/wiki/PCI_configuration_space)?
| ww520 wrote:
| I believe GPU solves this by having read only and write only
| buffers in the rendering pipeline.
| alaties wrote:
| The answer is that it works pretty similarly, but GPUs
| usually do this in specialized hardware whereas mmap'ing of
| files for DMA-style access is implemented mostly in software.
|
| https://insujang.github.io/2017-04-27/gpu-architecture-
| overv... has a pretty good visual of what's doing what for
| GPU DMA. You can imagine much of what happens here is almost
| pure software for mmap'd files.
| monocasa wrote:
| As other have said, you need hardware support to do this
| similarly to how GPUs do it.
|
| That being said, that hardware support exists with NVDIMMs.
| remram wrote:
| You'd need a way to indicate when you start and end
| overwriting the page. You need to avoid the page being
| swapped out mid-overwrite and not read back in. You'd also
| pay a penalty for zeroing it when it gets mapped pre-
| overwrite. The map primitives are just not meant for this.
| rini17 wrote:
| I think on Linux there's madvise syscall with "remove" flag,
| which you can issue on memory pages you intend to completely
| overwrite. I have no idea on performance or other practical
| issues.
| icedchai wrote:
| Yes, this can definitely be a problem. I worked on a
| transaction processing system that was entirely based on a in-
| house memory mapped database. All reads and writes went through
| mmap. At startup, it read through all X gigabytes of data to
| "ensure" everything was hot, in memory, and also built the in
| memory indexes.
|
| This actually worked fine in production, since the systems were
| properly sized and dedicated to this. On dev systems with low
| memory and often running into swap, you'd run into cases with
| crazy delays... sometimes a second or two for something that
| would normally be a few milliseconds.
| ramoz wrote:
| Perhaps a part 2 would dive a bit deeper into os caching and
| hardware (SSDs, their interfaces etc)
| shoo wrote:
| See also: sublime HQ blog about complexities of shipping a
| desktop application using mmap [1] and corresponding 200+ comment
| HN thread [2]:
|
| > When we implemented the git portion of Sublime Merge, we chose
| to use mmap for reading git object files. This turned out to be
| considerably more difficult than we had first thought. Using mmap
| in desktop applications has some serious caveats [...]
|
| > you can rewrite your code to not use memory mapping. Instead of
| passing around a long lived pointer into a memory mapped file all
| around the codebase, you can use functions such as pread to copy
| only the portions of the file that you require into memory. This
| is less elegant initially than using mmap, but it avoids all the
| problems you're otherwise going to have.
|
| > Through some quick benchmarks for the way Sublime Merge reads
| git object files, pread was around 2/3 as fast as mmap on
| linux. In hindsight it's difficult to justify using mmap over
| pread, but now the beast has been tamed and there's little reason
| to change any more.
|
| [1] https://www.sublimetext.com/blog/articles/use-mmap-with-care
| [2] https://news.ycombinator.com/item?id=19805675
| minitoar wrote:
| Interana mmaps the heck out of stuff. I've found that relying on
| the file cache works great. Though our access patterns are
| admittedly pretty simple.
| 29athrowaway wrote:
| malloc is implemented using mmap.
|
| You map memory manually when you need very low level control over
| memory.
| jeffbee wrote:
| `malloc` is not one thing. Some mallocs use mmap and others use
| brk. Some implementations use both.
| kevin_thibedeau wrote:
| Some use neither.
| PaulHoule wrote:
| I like mmap and I don't.
|
| It is incompatible with non-blocking I/O since your process will
| be stopped if it tries to access part of the file that is not
| mapped -- this isnt a syscall blocking (which you might work
| around) but rather any attempt to access mapped memory.
|
| I like mmap for tasks like seeking into ZIP files, where you can
| look at the back 1% of the file, then locate and extract one of
| the subfiles; the trouble there is that the really fun case is to
| do this over the network with http (say to solve Python
| dependencies, to extract the metadata from wheel files) in which
| case this method doesnt work.
| Sesse__ wrote:
| mmap is great for rapid prototyping. For anything I/O-heavy,
| it's a mess. You have zero control over how large your I/Os are
| (you're very much at the mercy of heuristics that are optimized
| for loading executables), readahead is spotty at best
| (practical madvise implementation is a mess), async I/O doesn't
| exist, you can't interleave compression in the page cache,
| there's no way of handling errors (I/O error = SIGBUS/SIGSEGV),
| and write ordering is largely inaccessible. Also, you get
| issues such as page table overhead for very large files, and
| address space limitations for 32-bit systems.
|
| In short, it's a solution that looks so enticing at first, but
| rapidly costs much more than it's worth. As systems grow more
| complex, they almost inevitably have to throw out mmap.
| rapsey wrote:
| Process will be stopped or thread?
| ithkuil wrote:
| Thread
| codetrotter wrote:
| > the trouble there is that the really fun case is to do this
| over the network with http (say to solve Python dependencies,
| to extract the metadata from wheel files) in which case this
| method doesnt work
|
| If the web server can tell you the total size of the file by
| responding to a HEAD request, and it support range requests
| then it will be possible.
|
| https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ...
|
| Or am I missing something?
| johndough wrote:
| You are correct, this works. There even is a file system
| built around this idea: https://github.com/fangfufu/httpdirfs
| remram wrote:
| You can't do this with mmap though, you can't instruct _the
| OS_ to grab pages via HTTP range requests.
| kccqzy wrote:
| Write a fuse layer.
| amelius wrote:
| > It is incompatible with non-blocking I/O since your process
| will be stopped if it tries to access part of the file that is
| not mapped
|
| Yeah, but the same problem occurs in normal memory when the OS
| has swapped out the page.
|
| So perhaps non-blocking I/O (and cooperative multitasking) is
| the problem here.
| loeg wrote:
| > Yeah, but the same problem occurs in normal memory when the
| OS has swapped out the page.
|
| I'd argue that swapping is an orthogonal problem which can be
| solved in a number of ways: disable swap at the OS level,
| mlock() in the application, maybe others.
|
| mmap is really a bad API for IO -- it hides synchronous IO
| and doesn't produce useful error statuses at access.
|
| > So perhaps non-blocking I/O (and cooperative multitasking)
| is the problem here.
|
| I'm not sure how non-blocking IO is "the problem." It's
| something Windows has had forever, and unix-y platforms have
| wanted for quite a long time. (Long history of poll, epoll,
| kqueue, aio, and now io_uring.)
| amelius wrote:
| > it hides synchronous IO and doesn't produce useful error
| statuses at access.
|
| You can trap IO errors if necessary. E.g. you can raise
| signals just like segfaults generate signals.
|
| > I'm not sure how non-blocking IO is "the problem."
|
| The point is that non-blocking IO wants to abstract away
| the hardware, but the abstraction is leaky. Most programs
| which use non-blocking IO actualy want to implement
| multitasking without relying threads. But that turns out to
| be the wrong approach.
| loeg wrote:
| > The point is that non-blocking IO wants to abstract
| away the hardware, but the abstraction is leaky.
|
| Why do you say it doesn't match hardware? Basically all
| hardware is asynchronous -- submit a request, get a
| completion interrupt, completion context has some success
| or failure status. Non-blocking IO is fundamentally a
| good fit for hardware. It's blocking IO that is a poor
| abstraction for hardware.
|
| > Most programs which use non-blocking IO actualy want to
| implement multitasking without relying threads. But that
| turns out to be the wrong approach.
|
| Why is that the wrong approach? Approximately every high-
| performance httpd for the last decade or two has used a
| multitasking, non-blocking network IO model rather than
| thread-per-request. The overhead of threads is just very
| high. They would like to use the same model for non-
| network IO, but Unix and unix-alikes have historically
| not exposed non-blocking disk IO to applications.
| io_uring is a step towards a unified non-blocking IO
| interface for applications, and also very similar to how
| the operating system interacts with most high-performance
| devices (i.e., a bunch of queues).
| amelius wrote:
| > Why do you say it doesn't match hardware?
|
| Because the CPU itself can block. In this case on memory
| access. Most (all?) async software assumes the CPU can't
| block. A modern CPU has a pipelining mechanism, where
| parts can simply block, waiting for e.g. memory to
| return. If you want to handle this all nicely, you have
| to respect the api of this process which happens to go
| through the OS. So for example, while waiting for your
| memory page to be loaded, the OS can run another thread
| (which it can't in the async case because there isn't any
| other thread).
| quotemstr wrote:
| You use mmap whether you want to or not: the system executes
| your program by mmaping your executable and jumping into it!
| You can always take a hard fault at any time because the kernel
| is allowed to evict your code pages on demand even if you
| studiously avoid mmap for your data files. And it can do this
| eviction even if you have swap turned off.
|
| If you want to guarantee that your program doesn't block, you
| need to use mlockall.
| geofft wrote:
| This is technically true, but the use case we're talking
| about is programs that are much smaller than their data.
| Postgres, for instance, is under 50 MB, but is often used to
| handles databases in the gigabytes or terabytes range. You
| can mlockall() the binary if you want, but you probably can't
| actually fit the entire database into RAM even if you wanted
| to.
|
| Also, when processing a large data file (say you're walking a
| B-tree or even just doing a search on an unindexed field),
| the code you're running tends to be a small loop, within the
| same few pages, so it might not even leave the CPU's cache,
| let alone get swapped out of RAM, but you need to access a
| very large amount of data, so it's much more likely the data
| you want could be swapped out. If you know some things about
| the data structure (e.g., there's an index or lookup table
| somewhere you care about, but you're traversing each node
| once), you can use that to optimize which things are flushed
| from your cache and which aren't.
| jorangreef wrote:
| But that's a different order of magnitude problem: control
| plane vs data plane.
|
| At some point, we could also say that the line fill buffer
| blocks our programs (more often than we realize).
|
| All of this is accurate, but at different scales.
| PaulHoule wrote:
| Also many systems in 2021 have a lot of RAM and hardly ever
| swap.
| loeg wrote:
| You're not wrong. Applications and libraries that want to be
| non-blocking should mlock their pages and avoid mmap for
| further data access. ntpd does this, for example.
|
| After application startup, you _can_ avoid _additional_ mmap.
| amelius wrote:
| This is one area where Rust, a modern systems language, has
| disappointed me. You can't allocate data structures inside
| mmap'ed areas, and expect them to work when you load them again
| (i.e., the mmap'ed area's base address might have changed). I
| hope that future languages take this usecase into account.
| simias wrote:
| I'm not sure I see the issue. This approach (putting raw binary
| data into files) is filled with footguns. What if you add,
| remove or reorder fields? What if your file was externally
| modified and now doesn't match the expected layout? What if the
| data contains things like file descriptors or pointers that
| can't meaningfully be mapped that way? Even changing the
| compilation flags can produce binary incompatibilities.
|
| I'm not saying that it's not sometimes very useful but it's
| tricky and low level enough that some unsafe low level plumbing
| is, I think, warranted. You have to know what you're doing if
| you decide to go down that route, otherwise you're much better
| off using something like Serde to explicitly handle
| serialization. There's some overhead of course, but 99% of the
| time it's the right thing to do.
| amelius wrote:
| The footguns can be solved in part by the type-system
| (preventing certain types from being stored), and (if
| necessary) by cooperation with the OS (e.g. to guarantee that
| a file is not modified between runs).
|
| How else would you lazy-load a database of (say) 32GB into
| memory, almost instantly?
|
| And why require everybody to write serialization code when
| just allocating the data inside a mmap'ed file is so much
| easier? We should be focusing on new problems rather than
| reinventing the wheel all the time. Persistence has been an
| issue in computing since the start, and it's about time we
| put it behind us.
| simias wrote:
| >How else would you lazy-load a database of (say) 32GB into
| memory, almost instantly?
|
| By using an existing database engine that will do it for
| me. If you need to deal with that amount of data and
| performance is really important you have a lot more to
| worry about than having to use unsafe blocks to map your
| data structures.
|
| Maybe we just have different experiences and work on
| different types of projects but I feel like being able to
| seamlessly dump and restore binary data transparently is
| both very difficult to implement reliably and quite niche.
|
| Note in particular that machine representation is not
| necessarily the most optimal way to store data. For
| instance any kind of Vec or String in rust will use 3 usize
| to store length, capacity and the data pointer which on 64
| bit architectures is 24 bytes. If you store many small
| strings and vectors it adds up to a huge amount of waste.
| Enum variants are also 64 bits on 64 bit architectures if I
| recall correctly.
|
| For instance I use bincode with serde to serialize data
| between instances of my application, bincode maps almost
| 1:1 the objects with their binary representation. I noticed
| that by implementing a trivial RLE encoding scheme on top
| of bincode for running zeroes I can divide the average
| message size by a factor 2 to 3. And bincode only encodes
| length, not capacity.
|
| My point being that I'm not sure that 32GB of memory-mapped
| data would necessarily load faster than <16GB of lightly
| serialized data. Of course in some cases it might, but
| that's sort of my point, you really need to know what
| you're doing if you decide to do this.
| burntsushi wrote:
| > How else would you lazy-load a database of (say) 32GB
| into memory, almost instantly?
|
| That's what the fst crate[1] does. It's likely working at a
| lower level of abstraction than you intend. But the point
| is that it works, is portable and doesn't require any
| cooperation from the OS other than the ability to memory
| map files. My imdb-rename tool[2] uses this technique to
| build an on-disk database for instantaneous searching. And
| then there is the regex-automata crate[3] that permits
| deserializing a regex instantaneously from any kind of
| slice of bytes.[4]
|
| I think you should maybe provide some examples of what
| you're suggesting to make it more concrete.
|
| [1] - https://crates.io/crates/fst
|
| [2] - https://github.com/BurntSushi/imdb-rename
|
| [3] - https://crates.io/crates/regex-automata
|
| [4] - https://docs.rs/regex-
| automata/0.1.9/regex_automata/#example...
| geofft wrote:
| I had a use case recently for serializing C data structures
| in Rust (i.e., being compatible with an existing protocol
| defined as "compile this C header, and send the structs down
| a UNIX socket"), and I was a little surprised that the
| straightforward way to do it is to unsafely cast a #[repr(C)]
| structure to a byte-slice, and there isn't a Serde serializer
| for C layouts. (Which would even let you serialize C layouts
| for a different platform!)
|
| I think you could also do something Serde-ish that handles
| the original use case where you can derive something on a
| structure as long as it contains only plain data types (no
| pointers) or nested such structures. Then it would be safe to
| "serialize" and "deserialize" the structure by just
| translating it into memory (via either mmap or direct
| reads/writes), without going through a copy step.
|
| The other complication here is multiple readers - you might
| want your accessor functions to be atomic operations, and you
| might want to figure out some way for multiple processes
| accessing the same file to coordinate ordering updates.
|
| I kind of wonder what Rust's capnproto and Arrow bindings do,
| now....
| burntsushi wrote:
| It's likely that the "safe transmute" working group[1] will
| help facilitate this sort of thing. They have an RFC[2].
| See also the bytemuck[3] and zerocopy[4] crates which
| predate the RFC, where at least the latter has 'derive'
| functionality.
|
| [1] - https://github.com/rust-lang/project-safe-transmute
|
| [2] - https://github.com/jswrenn/project-safe-
| transmute/blob/rfc/r...
|
| [3] - https://docs.rs/bytemuck/1.5.0/bytemuck/
|
| [4] - https://docs.rs/zerocopy/0.3.0/zerocopy/index.html
| comonoid wrote:
| Yes, you can.
|
| You cannot with standard data structures, but you can with your
| custom ones.
|
| That's all about trade-offs, anyway, there is no magic bullet.
| the8472 wrote:
| Work on custom allocators is underway, some of the std data
| structures already support them on nightly.
|
| https://github.com/rust-lang/wg-allocators/issues/7
| remram wrote:
| What about Rust makes this more difficult than doing the same
| thing in C++?
| quotemstr wrote:
| You can't do that in C++ or any language. You need to do your
| own relocations and remember enough information to do them. You
| can't count on any particular virtual address being available
| on a modern system, not if you want to take advantage of ASLR.
|
| The trouble is that we have to mark relocated pages dirty
| because the kernel isn't smart enough to understand that it can
| demand fault and relocate on its own. Well, either that, or do
| the relocation anew on each access.
| whimsicalism wrote:
| I don't see what the issue in doing this is in C++.
|
| The only thing that'll break will be the pointers and
| references to things outside of the mmap'd area.
| simias wrote:
| By that logic you can do it in unsafe Rust as well then.
| Obviously in safe Rust having potentially dangling
| "pointers and references to things outside of the mmap'd
| area" is a big no-no.
|
| And note that even intra-area pointers would have to be
| offset if the base address changes. Unless you go through
| the trouble of only storing relative offsets to begin with,
| but the performance overhead might be significant.
| Hello71 wrote:
| libsigsegv (slow) or userfaultfd (less slow) can be used for
| this purpose.
| secondcoming wrote:
| It works with C++ if you use boost::interprocess. Its data
| structures use offset_ptr internally rather than assuming
| every pointer is on the heap.
| quotemstr wrote:
| Sure. But that counts as "doing your own relocations".
| Unsafe Rust could do the same, yes?
| whimsicalism wrote:
| What is being relocated?
| ithkuil wrote:
| If you use offsets instead of pointers you're doing
| relocations "on the fly"
| secondcoming wrote:
| I don't know enough about Rust to say. If it doesn't have
| the concept of a 'fancy pointer' then I assume no, you'd
| have to essentially reproduce what boost::interprocess
| does.
| amelius wrote:
| That introduces different data-types, rather than using the
| existing ones (instantiated with different pointer-types).
| secondcoming wrote:
| Indeed. I don't know if there's a plan for the standard
| type to move to offset-ptr, or if there's even a
| std::offset_ptr, but it would be great if there was.
|
| For us, some of the 'different data type' pain was
| alleviated with transparent comparators. YMMV.
|
| Edit: It seems C++11 has added some form of support for
| it... 'fancy pointers'
|
| https://en.cppreference.com/w/cpp/named_req/Allocator#Fan
| cy_...
| jnwatson wrote:
| There's no placement new in Rust? That's disappointing.
| steveklabnik wrote:
| Not in stable yet, no. It's desired, but has taken a while to
| design, as there have been higher priority things for a
| while. We'll get there!
| turminal wrote:
| This is impossible without significant performance impact. No
| language can change that.
|
| Edit: except theoretically for data structures that have
| certain characteristics known in advance
| amelius wrote:
| Well, one approach is to parameterize your data-types such
| that they are fast in the usual case, but become perhaps
| slightly slower (but still on par with hand-written code) in
| the more versatile case.
| waynesonfire wrote:
| Thanks for diving into this DB! I find it interesting that many
| databases share such similar architectural principles. NIH. It's
| super fun to build a database so why not.
|
| Also, don't beat yourself over how deep you'll be diving into the
| design. Why apologize for this? Those that want a deep expository
| would quickly move on.
| rossmohax wrote:
| mmap is not as free as people think. VM subsystem is full of
| inefficient locks. Here is a very good writeup on a problem BBC
| encountered with Varnish:
| https://www.bbc.co.uk/blogs/internet/entries/17d22fb8-cea2-4...
| jeffbee wrote:
| Apparently in a way that the author of the article, and probably
| the authors of bolt, do not really understand.
| bonzini wrote:
| The right answer is that they shouldn't. A database has much more
| information than the operating system about what, how and when to
| cache information. Therefore the database should handle its own
| I/O caching using O_DIRECT on Linux or the equivalent on Windows
| or other Unixes.
|
| The article at https://www.scylladb.com/2017/10/05/io-access-
| methods-scylla... is a bit old (2017) but it explains the trade-
| offs
| quotemstr wrote:
| Yep. Every mature, high performing, non-embedded database
| evolves towards getting the underlying operating system out of
| the way as much as possible.
| natmaka wrote:
| > A database has much more information than the operating
| system about what, how and when to cache information
|
| Yes, on a dedicated server. However many DB engines instances
| run on non-dedicated servers, for example along a web server
| flanked with various processes sometimes reading the local
| filesystem or using RAM (Varnish, memcached...), and often-run
| tasks (tempfiles purge, log aggregation, monitoring probes,
| MTA...). In such a case letting the DB engine use too much RAM,
| and therefore reducing its global efficiency while limiting
| buffercache size, may (all other things being equal) imply more
| 'read' operations, reducing overall performance.
| sradman wrote:
| Great point. Selecting the RDBMS page cache size is a key
| performance parameter that is near impossible to get right on
| a mixed-use host, both non-dedicated servers and client
| desktop/laptop. SQL Anywhere, which emphasizes zero-admin,
| has long supported _Dynamic Cache Sizing_ [1] specifically
| for this mixed-use case which is /was its bread-and-butter. I
| don't know if any other RDBMSes do the same (MS SQL?).
|
| As a side note, Apache Arrow's main use case is similar, a
| column oriented data store shared by one-or-more client
| processes (Python, R, Julia, Matlab, etc.) on the same
| general purpose host. This is also now a key distinction
| between the Apple M1 and its big.LITTLE ARM SoC vs. Amazon
| Graviton built for server-side virtualized/containerized
| instances. We should not conflate the two use-cases and
| understand that the best solution for one use case may not be
| the best for the other.
|
| [1] http://dcx.sybase.com/1200/en/dbusage/perform-
| bridgehead-405...
| jorangreef wrote:
| Yes, and it's not only about performance, but also safety
| because O_DIRECT is the only safe way to recover from the
| journal after fsync failure (when the page cache can no longer
| be trusted by the database to be coherent with the disk):
| https://www.usenix.org/system/files/atc20-rebello.pdf
|
| From a safety perspective, O_DIRECT is now table stakes.
| There's simply no control over the granularity of read/write
| EIO errors when your syscalls only touch memory and where you
| have no visibility into background flush errors.
| formerly_proven wrote:
| Around four years ago I was working on a transactional data
| store and ran into these issues that virtually no one tells
| you how durable I/O is supposed to work. There were very few
| articles on the internet that went beyond some of the basic
| stuff (e.g. create file => fsync directory) and perhaps one
| article explaining what needs to be considered when using
| sync_file_range. Docs and POSIX were useless. I noticed that
| there seemed to be inherent problems with I/O error handling
| when using the page cache, i.e. whenever something that
| wasn't the app itself caused write I/O you really didn't know
| any more if all the data got there.
|
| Some two years later fsyncgate happened and since then I/O
| error handling on Linux has finally gotten at least some
| attention and people seemed to have woken up to the fact that
| this is a genuinely hard thing to do.
| sradman wrote:
| O_DIRECT prevents file double buffering by the OS and DBMS page
| cache. MMAP removes the need for the DBMS page cache and relies
| on the OS's paging algorithm. The gain is zero memory copy and
| the ability for multiple processes to access the same data
| efficiently.
|
| Apache Arrow takes advantage of mmap to share data across
| different language processes and enables fast startup for short
| lived processes that re-access the same OS cached data.
| geofft wrote:
| Yes, but the claim is that the buffer you should remove is
| the OS's one, not the DBMS's one, because for the DBMS use
| case (one very large file with deep internal structure,
| generally accessed by one long-running process), the DBMS has
| information the OS doesn't.
|
| Arrow is a different use case, for which mmap makes sense.
| For something like a short-lived process that stores config
| or caches in SQLite, it probably is actually closer to Arrow
| than to (e.g.) Postgres, so mmap likely also makes sense for
| that. (Conversely, if you're not relying on Arrow's sharing
| properties and you have a big Python notebook that's doing
| some math on an extremely large data file on disk in a single
| process, you might actually get better results from O_DIRECT
| than mmap.)
|
| In particular, "zero memory copy" only applies if you are
| accessing the same data from multiple processes (either at
| once or sequentially). If you have a single long-running
| database server, you have to copy the data from disk to RAM
| _anyway_. O_DIRECT means there 's one copy, from disk to a
| userspace buffer; mmap means there's one copy, from disk to a
| kernel buffer. If you can arrange for a long-lived userspace
| buffer, there's no performance advantage to using the kernel
| buffer.
| sradman wrote:
| > but the claim is that the buffer you should remove is the
| OS's one
|
| I was not trying to minimize O_DIRECT, I was trying to
| emphasize the key advantage succinctly and also explain the
| Apache Arrow use case of mmap which the article does not
| discuss.
| masklinn wrote:
| > Therefore the database should handle its own I/O caching
| using O_DIRECT on Linux or the equivalent on Windows or other
| Unixes.
|
| That's not wrong, but at the same time it adds complexity and
| requires effort which can't be spent elsewhere unless you've
| got someone who really only wants to DIO and wouldn't work on
| anything else anyway.
|
| Postgres has never used DIO, and while there have been rumbling
| about moving to DIO (especially following the fsync mess) as
| Andres Freund noted:
|
| > efficient DIO usage is a metric ton of work, and you need a
| large amount of differing logic for different platforms. It's
| just not realistic to do so for every platform. Postgres is
| developed by a small number of people, isn't VC backed etc. The
| amount of resources we can throw at something is fairly
| limited. I'm hoping to work on adding linux DIO support to pg,
| but I'm sure as hell not going to do be able to do the same on
| windows (solaris, hpux, aix, ...) etc.
| jorangreef wrote:
| I have found that planning for DIO from the start makes for a
| better, simpler design when designing storage systems,
| because it keeps the focus on logical/physical sector
| alignment, latent sector error handling, and caching from the
| beginning. And even better to design data layouts to work
| with block devices.
|
| Retrofitting DIO onto a non-DIO design and doing this cross-
| platform is going to be more work, but I don't think that's
| the fault of DIO (when you're already building a database
| that is).
| jandrewrogers wrote:
| PostgreSQL has two main challenges with direct I/O. The basic
| one is that it adversely impacts portability, as mentioned,
| and is complicated in implementation because file system
| behavior under direct I/O is not always consistent.
|
| The bigger challenge is that PostgreSQL is not architected
| like a database engine designed to use direct I/O
| effectively. Adding even the most rudimentary support will be
| a massive code change and implementation effort, and the end
| result won't be comparable to what you would expect from a
| modern database kernel designed to use direct I/O. This
| raises questions about return on investment.
| api wrote:
| You can also mount a file system in synchronous mode on most
| OSes, which may make sense for a DB storage volume (but not
| other parts of the system).
| jnwatson wrote:
| In theory that's true. In practice, utilizing the highly-
| optimized already-in-kernel-mode page cache can produce
| tremendous performance. LMDB, for example, is screaming fast,
| and doesn't use DIO.
| the8472 wrote:
| There was a patch set (introducing the RWF_UNCACHED flag) to
| get buffered IO with most of the benefits of O_DIRECT and
| without its drawbacks, but it looks like it hasn't landed.
|
| There also are new options to give the kernel better page cache
| hints via the new MADV_COLD or MADV_PAGEOUT flags. These ones
| did land.
| nullsense wrote:
| I think of the major database vendors only postgres uses mmap
| and everyone else does their own I/O caching management.
___________________________________________________________________
(page generated 2021-01-23 23:00 UTC)