[HN Gopher] Are you sure you want to use MMAP in your database m...
___________________________________________________________________
Are you sure you want to use MMAP in your database management
system? [pdf]
Author : Aissen
Score : 103 points
Date : 2022-01-14 16:03 UTC (6 hours ago)
(HTM) web link (db.cs.cmu.edu)
(TXT) w3m dump (db.cs.cmu.edu)
| [deleted]
| assface wrote:
| RavenDB's response to this paper:
| https://ayende.com/blog/196161-C/re-are-you-sure-you-want-to...
| 10000truths wrote:
| You don't _want_ the OS to take care of reading from disk and
| page caching /eviction. You want the DB itself to have explicit
| control over that, because the DB has information on access
| patterns and table format that the OS is not aware of. It is
| better equipped than the OS to anticipate what portions of
| tables/indices need to be cached in memory. It is better
| equipped to calculate when/where/what/how much to prefetch from
| disk. It is better equipped to determine when to buffer writes
| and when to flush to disk.
|
| Sure, it might be more work than using mmap. But it's also more
| correct, forces you to handle edge cases, and much more
| amenable to platform-specific improvements a la
| kqueue/io_uring.
| matheusmoreira wrote:
| > the DB has information on access patterns and table format
| that the OS is not aware of
|
| Aren't system calls such as madvise supposed to allow user
| space to let the kernel know precisely that information?
| jandrewrogers wrote:
| The madvise() functions and similar are a blunt and
| imprecise instrument. The kernel is free to ignore them,
| and frequently does in practice. It also does not prevent
| the kernel from proactively doing things you don't want it
| to do with your buffer pool at the worst possible time.
|
| A user space buffer pool gives you precise and
| deterministic control of many of these behaviors.
| ayende wrote:
| That is an interesting statement, when discussing what
| you want to do
|
| In this case, there is the issue of who is the you on
| question
|
| For the database in isolation, maybe not ideal
|
| For a system whre db and app run on the same machine? The
| OS can make sure you are on friendly terms and not
| fighting
|
| Same for trying to SSH to a bust server and the OS can
| balance things out
| masklinn wrote:
| > precisely
|
| Madvise is discussed in the paper, and it notes
| specifically that:
|
| * madvise is not _precise_
|
| * madvise is... an advice, which the system is completely
| free to disregard
|
| * madvise is error-prone, providing the wrong hint can have
| dire consequences
| sltkr wrote:
| The counterargument to this is that the kernel can make
| decisions based on nonlocal information about the system.
|
| If your database server is the only process in the system
| that is using significant memory, then sure, you might as
| well manage it yourself. But if there are multiple processes
| competing for memory, the kernel is better equipped to decide
| which processes' pages should be paged out or kept into
| memory.
| electricshampo1 wrote:
| Generally for perf critical use cases you dedicate the
| machine to the database. This simplifies many things
| (avoiding having to reason about sharing, etc etc).
| munchler wrote:
| This makes me wonder whether there would be value in an
| OS that is also a DBMS (or vice versa). In other words,
| if the DBMS has total control over the hardware, perhaps
| performance can be maximized without too much additional
| complexity.
| sedachv wrote:
| This is a bad idea from the 1960s: IBM TPF, MUMPS, Pick.
| As soon as the hardware changes it becomes slower and
| more complicated.
| ayende wrote:
| I'm the author (well, one of) RavenDB
|
| You are correct to an extent, but there are a few things yo
| noted.
|
| * you can design your system so the access pattern that the
| OS is optimized for matches your needs
|
| * you can use madvise() to give some useful hints
|
| * the amount of complexity you don't have to deal with is
| staggering
| tytso wrote:
| OTOH, if you care about that last 5 percent or so of
| performance there is the complexity that what the OS has
| optimized for might be different between different OS's
| (e.g., MacOS, Linux, FreeBSd, etc.) and indeed, might
| change between different versions of Linux, or even, in the
| case of buffered writeback, between different _filesystems_
| on the same version of Linux. This is probably historically
| one of the most important reasons why enterprise databases
| like Oracle DB, DB2, etc., have used direct I /O, and not
| buffered I/O or mmap.
|
| Speaking as an OS developer, we're not going to try to
| optimize buffered I/O for a particular database. We'll be
| using becnhmarks like compilebench and postmark to optimize
| our I/O, and if your write patterns, or readahead patterns,
| or caching requirements, don't match those workloads,
| well.... sucks to be you.
|
| I'll also point out that those big companies that actually
| pay the salarise of us file system developers (e.g.,
| Oracle, Google, etc.) for the most part use Direct I/O for
| our performance critical workloads. If database companies
| that want to use mmap want to hire file system developers
| and contribute benchmarks and performance patches for ext4,
| xfs, etc., speaking as the ext4 maintainer, I'll welcome
| that, and we do have a weekly video conference where I'd
| love to have your engineers join to discuss your
| contributions. :-)
| ayende wrote:
| The key from my perspective is that I CAN design my
| access patterns to match what you'll optimized.
|
| Another aspect to remember is that mmap being even
| possible for databases as the primary mechanism is quite
| new.
|
| Go 15 years ago and you are in 32 bit land. That rule out
| mmap as your approach.
|
| At this point, I might as well skip the OS and go direct
| IO.
|
| As for differ OS behavior, I generally find that they all
| roughly optimize for the same thing.
|
| I need best perf on Linux and Windows. Other systems I
| can get away with just being pretty good
| tyingq wrote:
| The really key part seems to be this:
|
| _" If you aren't using mmap, on the other hand, you still need
| to handle of all those issues"_
|
| Which seems like a reasonable statement. Is it less work to
| make your own top-to-bottom buffer pool, and would that
| necessarily avoid similar issues? Or is it less work to use
| mmap(), but address the issues?
| bluestreak wrote:
| Questdb's author here. I do share Ayende's sentiment. There
| are things that the OP paper doesn't mention, which can help
| mitigate some of the disadvantages:
|
| - single-threaded calls to 'fallocate' will help avoiding
| sparse files and SIGBUS during memory write - over-
| allocating, caching memory addresses and minimizing OS calls
| - transactional safety can be implemented via shared memory
| model - hugetlb can minimize TLB shootdowns
|
| I personally do not have any regrets using mmap because of
| all the benefits they provide
| AdamProut wrote:
| I suppose. Some problems with mmap() are a bit hard to fix
| from user land though. You will hit contention on locks
| inside the kernel (mmap_sem) if the database does concurrent
| high throughput mmap()/unmap(). I don't follow linux kernel
| development closely to know if this has been improved
| recently, but it was easy to reproduce it 4-5 years ago.
| ayende wrote:
| Almost no one is going to have a lot of map calls
|
| Uou map the file once, then fault it in
| tyingq wrote:
| That makes sense. I wasn't going right to the conclusion
| that working around mmap() issues was easier, but it didn't
| seem to be explored much. Is the contention around having
| one file mmap()ed, or is it reduced if you use more files?
| dboreham wrote:
| When I worked on/with BerkeleyDB in the late 90s we came to
| the conclusion that the various OS mmap() implementations had
| been tweaked/fixed to the point where they worked for the
| popular high profile applications (in those days: Oracle). So
| it can appear like everything is fine, but that probably
| means your code behaves the same way as <popular database du
| jour>.
| tytso wrote:
| Um... Oracle (and other enterprise databases like DB2)
| don't use mmap. They use Direct I/O. Oracle does have
| anonymous (non-file-backed) memory which is mmap'ed and
| shared across various Oracle processes, called the Shared
| Global Area (SGA), but it's not used for I/O.
| ayende wrote:
| Yes, isn't that wonderful?
|
| You get to take advantage of literally decades of
| experience
|
| What is more, if you can match the profile of the
| optimization, you can benefit even more
| jandrewrogers wrote:
| Some issues with mmap() can be avoided entirely if you have
| your own buffer pool. Others are easier to handle because
| they are made explicit and more buffer state is exposed to
| the program logic. That's the positive side.
|
| The downside is that writing an excellent buffer pool is not
| trivial, especially if you haven't done it before. There are
| many cross-cutting design concerns that have to be accounted
| for. In my experience, an excellent C++ implementation tends
| to be on the order of 2,000 lines of code -- someone has to
| write that. It also isn't simple code, the logic is
| relatively dense and subtle.
| tptacek wrote:
| From that article: the whole fsyncgate thing seems like a
| pretty strong counterargument to "mmap adds more complexity
| than it removes":
|
| https://danluu.com/fsyncgate/
| ayende wrote:
| That actually doesn't matter This is orthogonal to mmap
| ikawe wrote:
| > Off the top of my head, most embedded databases implement a
| single writer model. LMDB, Voron (RavenDB's storage engine),
| LevelDB, Lucene
|
| And let's not forget sqlite!
|
| > There can only be a single writer at a time to an SQLite
| database.
|
| (from https://www.sqlite.org/isolation.html)
| moonbug wrote:
| why settle for errno when you can have a segfault.
| kabdib wrote:
| Yup. Why use the operating system's async I/O system when you
| can simply burn a thread and do blocking I/O? </snark>
|
| Been down that primrose path, have the road rash to prove it.
| mmap() is great until you realize that pretty much all you've
| avoided is some buffer management that you probably need to do
| anyway. The OS just doesn't have the information it needs to do
| a great (or even correct) job of caching database pages.
| wahern wrote:
| > Why use the operating system's async I/O system when you
| can simply burn a thread and do blocking I/O? </snark>
|
| mmap isn't non-blocking; page faults are blocking, no
| different from a read or write to a (non-direct I/O) file
| using a syscall.
|
| Until recently io_uring literally burned a thread (from a
| thread pool) for every read or write regular file operation,
| too. Though now it finally has hooks into the buffer cache so
| it can opportunistically perform the operation from the same
| thread that dequeued the command, pushing it to a worker
| thread if it would need to wait for a cache fault.[1]
|
| [1] Technically the same behavior could be implemented in
| user space using userfaultfd, but the latency would likely be
| higher on faults.
| randbox wrote:
| A user process doesn't have the information it needs to do a
| good job of coordinating updates from multiple writers to
| database pages and indices. With MMAP, writers have access to
| shared atomics which they can update using compare-exchange
| operations to prevent data races which would be common when
| using read() and write() without locks.
| jstimpfle wrote:
| Are you saying that without mmap() there will be data
| races??
| randbox wrote:
| icedchai wrote:
| I worked at a company that developed its own proprietary database
| for a financial application. The entire database, several
| gigabytes, not large by today's standards, was mmap'd and read at
| startup to warm the page cache. We also built in-memory "indexes"
| at startup.
|
| This was back in the early 2000's, when having 4 gigabytes of RAM
| would be considered large. The "database server" was single
| threaded, and all changes were logged to a WAL-ish file before
| updating the mmap'd database. It was fun stuff to work on. It
| worked well, but it wasn't a general purpose DB.
| jasone wrote:
| Check out the creative use of emoji in the running header!
| [deleted]
| jnwatson wrote:
| I have a great deal of experience in running very large memory-
| mapped databases using LMDB.
|
| The default Linux settings dealing with memory mapped files are
| pretty horrible. The observed poor performance is directly
| related to not configuring several very important kernel
| parameters.
| rectang wrote:
| One possible advantage of using mmap over a buffer pool can be
| programmer ergonomics.
|
| Reading data into a buffer pool in process RAM takes time to warm
| up, and the pool can only be accessed by a single process. In
| contrast, for an mmap-backed data structure, assuming that files
| are static once written (which can be the case for an multi-
| version concurrency control (MVCC) architecture), you open an
| mmap read-only connection from any process and the so long as the
| data is already in the OS cache, you get instant fast reads. This
| makes managing database connections much easier, since
| connections are cheap and the programmer can just open as many as
| they want whenever and wherever they want.
|
| It is true that cache eviction strategy used by the OS is likely
| to be suboptimal. So if you're in a position to only run a single
| database process, you might decide to make different tradeoffs.
| apavlo wrote:
| This link should be to this page that includes more info and the
| accompanying video:
|
| https://db.cs.cmu.edu/mmap-cidr2022/
| dsiegel2275 wrote:
| Thank you for sharing your DB course(s) videos on the YouTube.
| I'm a CMU staff member (Open Learning Initiative) that would
| never be able to enroll on-site, given likely my lower priority
| for getting a seat, but watching your videos online has been
| fantastic.
| lmwnshn wrote:
| Since you have a CMU ID, AFAIK you should be able to enroll
| in Piazza in addition to following along with the
| assignments/projects (if you wanted to).
| jandrewrogers wrote:
| The pragmatic consideration that usually influences the decision
| to use mmap() is the large discontinuity in skill and expertise
| required to replace it. Writing your own alternative to mmap()
| can be significantly superior in terms of performance and
| functionality, and often lends itself to a cleaner database
| architecture. However, this presumes a sufficiently sophisticated
| design for an mmap() replacement. The learning curve is steep and
| the critical nuances of sophisticated and practical designs are
| poorly explored in readily available literature, providing little
| in the way of "how-to" guides that you can lean on.
|
| As a consequence, early attempts to replace mmap() are often
| quite poor. You don't know what you don't know, and details of
| the implementation that are often glossed over turn out to be
| critical in practice. For example, most people eventually figure
| out that LRU cache replacement is a bad idea, but many of the
| academic alternatives cause CPU cache thrashing in real systems,
| replacing one problem with another. There are clever and non-
| obvious design elements that can greatly mitigate this but they
| are treated as implementation details in most discussions of
| cache replacement and largely not discoverable if you are writing
| one for the first time.
|
| While mmap() is a mediocre facility for a database, I think we
| also have to be cognizant that replacing it competently is not a
| trivial ask for most software engineers. If their learning curve
| is anything like mine, I went from mmap() to designing obvious
| alternatives with many poorly handled edge cases, and eventually
| figuring out how to design non-obvious alternatives that could
| smoothly handled very diverse workloads. That period of "poor
| alternatives" in the middle doesn't produce great databases but
| it almost feels necessary to properly grok the design problem.
| Most people would rather spend their time working on other parts
| of a database.
| stingraycharles wrote:
| When you say "replacing mmap()", could you elaborate a bit on
| it? The way you write it sounds like you're describing a
| reimplementation of mmap() with the same API, while I believe
| the actual goal would be to completely rewrite the persistence
| and caching layer to be like a "real" database.
| jandrewrogers wrote:
| The implementation is essentially a complete replacement for
| the kernel page cache and I/O scheduler, much of the behavior
| of which is hidden behind the mmap() functions. It is never a
| drop-in replacement and you wouldn't want it to be but it is
| functionally quite similar.
|
| For example, while the storage will usually have a linear
| address space, the "pointer" to that address space won't be a
| literal pointer even though it may behave much like one.
| There may be stricter invariants around write back and page
| fault behavior, and madvise()-like calls have deterministic
| effects. You often have cheap visibility into details of
| page/buffer state that you don't get with mmap() that can be
| used in program logic. And so on. Different but similar.
| lmilcin wrote:
| The task is deceivingly simple.
|
| You have a file and a bunch of memory and you need to make
| sure data is being moved from file to memory when needed and
| from memory to file when needed.
|
| mmap() is one algorithm to do it, and the idea is that it is
| not necessarily the best one.
|
| Knowing more about your data and application and needs should
| theoretically enable you to design an algorithm that will be
| more efficient at moving data back and forth.
| dicroce wrote:
| I have written a couple of mmap() based time series databases.
| In my case, these were databases for holding video. For my
| uses, mmap() has been great. I strongly agree with your
| comment. Maybe mmap() isn't the greatest, but it has worked for
| me.
| tmountain wrote:
| The original version of MongoDB used mmap, and I worked at a
| company that had a ton of issues with cache warmup and the
| cache getting trashed by competing processes. Granted this was
| a long time ago, but the main issue was the operating system's
| willingness to reallocate large swaths of memory from the
| address space to whatever process was asking for memory right
| now.
|
| Once the working set got trashed, performance would go through
| the floor, and our app would slow to a crawl while the cache
| went through the warmup cycle.
|
| Long story short, with that model, Mongo couldn't "own" the
| memory it was using, and this lead to chronic problems.
| Wiredtiger fixed this completely, but I still think this is a
| cautionary tale for anyone considering building a DB without a
| dedicated memory manager.
| whatshisface wrote:
| One might say that to rely on your operating system is to
| rely on your system operators.
| pengaru wrote:
| Choosing mmap() gets you something that works sooner than later.
|
| But then you have a pile of blocking-style synchronous code
| likely exploiting problematic assumptions to rewrite when you
| realize you want something that doesn't just work, but works
| _well_.
| tptacek wrote:
| Interesting parallels in this work to Tanenbaum's "RPC Considered
| Harmful"+; in both cases, you've got an abstraction that papers
| over a huge amount of complexity, and it ends up burning you
| because a lot of that complexity turns out to be pretty important
| and the abstraction has cost you control over it.
|
| +
| _https://www.cs.vu.nl/~ast/Publications/Papers/euteco-1988.pd..._
| ncmncm wrote:
| Yes.
|
| Whenever you need better performance and reliability, identify
| an abstraction beloved of CS professors, and bypass it.
|
| When I last checked, libtorrent was utterly failing to use
| O_DIRECT semantics. I started making a patch, but there are
| several places that do file ops, and the main one was more
| complicated than I could afford to dive into at the time.
| PaulHoule wrote:
| Most of the times I used mmap I wasn't happy in the end.
|
| I went through a phase when I thought it was fun to do extreme
| random access on image files, archives and things like that. At
| some point I think "I want to do this for a file I fetch over the
| network" and that needs a rewrite.
| randbox wrote:
| How do you implement lockless atomic updates for multiple writers
| across multiple threads & processes without mmap?
|
| With mmap it is straight forward for processes to open persistent
| arrays of atomics as a file, and use compare and exchange
| operations to prevent data races when multiple threads or
| processes update the same page without any file locks, advisory
| locks, or mutexes.
|
| With manual read() and write() calls, the data may be overwritten
| by another writer before the update is committed.
| jstimpfle wrote:
| Why do you need lockless atomic updates to a file-backed memory
| area? Genuinely curious.
| pizlonator wrote:
| This is a great write-up!
|
| Makes me wonder if there is an alternative universe in which
| there is a syscall with semantics similar to mmap that avoids
| these pitfalls. It's not like mmap's semantics are the only
| semantics that we could have for memory-mapped IO.
| sprachspiel wrote:
| This would be exactly the kind of innovation we would need in
| computer science. Instead we often get stuck in local minima
| (in this case a 40-year old POSIX interface) without realizing
| how much pain this causes.
| dboreham wrote:
| Do they mean: mmap() ?
| dang wrote:
| Url changed from https://db.cs.cmu.edu/mmap-cidr2022/, which has
| the abstract and a link to this video:
|
| https://www.youtube.com/watch?v=1BRGU_AS25c
|
| and this code: https://github.com/viktorleis/mmapbench
| gnabgib wrote:
| I personally prefer the abstract to jumping straight into a
| full paper, especially since it's quite rich (not one of those
| two line entries like some arXiv paper abstracts). After
| reading the abstract I did end up opening the PDF.. but I'm
| hesitant to pay the PDF tax early. Is this one of those
| "original source" type decisions?
| dang wrote:
| Yes. I hear you about the downside, but the downside of the
| more superficial-accessible 'home page' is that people will
| not read any further, and instead simply respond generically.
___________________________________________________________________
(page generated 2022-01-14 23:00 UTC)