[HN Gopher] Why mmap is faster than system calls
___________________________________________________________________
Why mmap is faster than system calls
Author : vinnyglennon
Score : 226 points
Date : 2021-01-09 16:53 UTC (6 hours ago)
(HTM) web link (sasha-f.medium.com)
(TXT) w3m dump (sasha-f.medium.com)
| layoutIfNeeded wrote:
| >Further, since it is unsafe to directly dereference user-level
| pointers (what if they are null -- that'll crash the kernel!) the
| data referred to by these pointers must be copied into the
| kernel.
|
| False. If the file was opened with O_DIRECT, then the kernel uses
| the user-space buffer directly.
|
| From man write(2):
|
| O_DIRECT (Since Linux 2.4.10) Try to minimize cache effects of
| the I/O to and from this file. In general this will degrade
| performance, but it is useful in special situations, such as when
| applications do their own caching. File I/O is done directly
| to/from user-space buffers. The O_DIRECT flag on its own makes an
| effort to transfer data synchronously, but does not give the
| guarantees of the O_SYNC flag that data and necessary metadata
| are transferred. To guarantee synchronous I/O, O_SYNC must be
| used in addition to O_DIRECT. See NOTES below for further
| discussion.
| wtallis wrote:
| I don't think O_DIRECT makes any guarantees about zero-copy
| operation. It merely disallows kernel-level caching of that
| data. But the kernel may make a private copy that isn't
| caching.
| layoutIfNeeded wrote:
| Who said it was guaranteed to be zero-copy?
|
| The original article said that the data _must_ be copied
| based on some bogus handwavy argument, and I've pointed out
| that the manpage of write(2) contradicts this when it says
| the following:
|
| >File I/O is done directly to/from user-space buffers.
| jstimpfle wrote:
| > Why can't the kernel implementation use AVX? Well, if it did,
| then it would have to save and restore those registers on each
| system call, and that would make domain crossing even more
| expensive. So this was a conscious decision in the Linux kernel.
|
| I don't follow. So a syscall that could profit from AVX can't use
| it because then _all_ syscalls would have to restore AVX
| registers? Why can 't the restoring just happen specifically in
| those syscalls that make use of AVX?
| xymostech wrote:
| I think by "each system call" she meant it like "every time it
| calls read()", since it would be read() that was using the AVX
| registers. Since the example program just calls read() over and
| over, this could add a significant amount of overhead.
| PaulDavisThe1st wrote:
| It's not just syscalls. It's every context switch. If the
| process is in the midst of using AVX registers in kernel code,
| but is suddenly descheduled, those registers have to be
| saved/restored. You can't know if the task is using AVX or not,
| so you have to either always save/restore them, or adopt the
| policy that these registers are not saved/restored.
| jabl wrote:
| I vaguely recall that the Linux kernel has used lazy
| save/restore of FP registers since way back when.
| jeffbee wrote:
| You'd have to have a static analysis of which syscalls can
| transitively reach which functions, which is probably not
| possible because linux uses tables of function pointers for
| many purposes. Also if thread 1 enters the kernel, suspends
| waiting for some i/o, and the kernel switches to thread 2, how
| would it know it needed to restore thread 2's registers because
| of AVX activity of thread 1? And if it did how would it have
| known to save them?
| jstimpfle wrote:
| Not a kernel person, but how about a flag for the thread data
| structure?
| jeffbee wrote:
| Yeah actually now that I'm part way through that first cup
| of coffee, the 2nd part of my comment doesn't make sense,
| the kernel already has to do a heavier save of a task's
| register state when it switches tasks.
| CyberRabbi wrote:
| I believe if you turn PTI off the syscall numbers for sequential
| copies would be a lot higher.
| CodesInChaos wrote:
| Memory mapped files are very tricky outside the happy path. In
| particular recovery from errors and concurrent modification
| leading to undefined behaviour. It's a good choice for certain
| use-cases, such as reading assets shipped with the application,
| where no untrusted process can write to the file and errors can
| be assumed to not happen.
|
| For high performance code I'd use io_uring.
| jws wrote:
| Summary: Mostly syscalls and mmap do the same things just
| substituting a page fault for a syscall to get to kernel mode,
| but... In user space her code is using AVX optimized memory copy
| instructions which are not accessible in kernel mode yielding a
| significant speed up.
|
| Bonus summary: She didn't use the mmapped data in place in order
| to make a more apples-to-apples comparison. If you can use the
| data in place then you will get even better performance.
| spockz wrote:
| Why doesn't the kernel have access to AVX optimised memory copy
| instructions?
| jws wrote:
| The size of the state required to be saved and restored on
| each system call makes it a losing proposition.
| PaulDavisThe1st wrote:
| Each context switch, not syscall.
| topspin wrote:
| The kernel does have access to these instructions. It is a
| deliberate choice by kernel developers not to use them in the
| case discussed here. In other cases the kernel does use such
| instructions.
| anaisbetts wrote:
| *She, not he
| jws wrote:
| Thanks. Curse this language. I just want to refer to people!
| It's simple encapsulation and abstraction. I shouldn't have
| to care about implementation details irrelevant to the
| context.
| damudel wrote:
| Don't worry about it. Some people lose their marbles
| because they think females get erased when male language is
| used. Just erase both genders and you'll be fine. Use
| singular they.
| [deleted]
| ryanianian wrote:
| "They" is an acceptable gender-neutral pronoun.
| FentanylFloyd wrote:
| it's an idiotic newspeakism, lol
|
| it's well enough that 'you' can be both singular and
| plural, we don't need another one
| kortilla wrote:
| Acceptable to some, still not frequent enough though to
| be normalized.
| lolc wrote:
| I don't even notice it anymore.
| itamarst wrote:
| It's been used since the time of Jane Austen (by Jane
| Austen, in fact), it's perfectly normal:
| https://pemberley.com/janeinfo/austheir.html
| [deleted]
| cpach wrote:
| How about "they"...?
| throw_away wrote:
| singular they is the generalization you're looking for
| jws wrote:
| I'm old enough that "they" is not singular, it is a
| grammatical error punishable by red ink and deducted
| points.
| Spivak wrote:
| But the usage of they to refer a single person is older
| than anyone alive if that makes you feel better about
| sticking it to your picky grade school teachers.
| jfk13 wrote:
| So is the use of "he" to refer to an individual of
| unspecified gender.
|
| (The OED quotations for sense 2(b) "In anaphoric
| reference to a singular noun or pronoun of undetermined
| gender" go back to at least 1200AD.)
| jfim wrote:
| You can use "the author" or refer to the article or paper.
|
| [name of paper] mentions that X is faster than Y. The
| author suggests the cause of the speed up is Z, while we
| believe it is W.
| usrnm wrote:
| Just a nitpick: "Alexandra" is the female version of the name
| "Alexander" in Russian, so it's a "she", not "he".
| andi999 wrote:
| And 'Sasha' the nickname of 'Alexander'.... man who can think
| this up, this is like you were calling 'Richard' 'Dick'.
| tucnak wrote:
| Sasha is universally applied to both males and females,
| although to be fair, in Russian, it's culturally much
| acceptable to call Alexander Sasha in any context
| whatsoever, whereas Sasha as-in female Alexandra is
| reserved for informal communication.
|
| Disclaimer: I speak Russian.
| LudwigNagasena wrote:
| Sasha is an informal version for both genders, I don't
| think there is any difference.
|
| Source: I am Russian
| FpUser wrote:
| Second that. Was born in Russia as well
| whidden wrote:
| As someone who grew up in a former Russian territory that
| speaks no Russian, even I knew that.
| enedil wrote:
| In Polish, Aleksandra (also female version) is shortened up
| to Ola, good luck with that ;)
| eps wrote:
| Sasha is derived from Alexander via its dimunitive, but
| obsolete form - Aleksashka - shortned to Sashka, further
| simplified to Sasha as per established name format of Masha
| (Maria), Dasha (Daria), Pasha (Pavel, Paul), Glasha
| (Glafira), Natasha (Natalia), etc.
| dmytroi wrote:
| Did some research on the topic of high bandwidth/high IOPS file
| accesses, some of my conclusions could be wrong though, but as I
| discovered modern NVMe drives need to have some queue pressure on
| them to perform at advertised speeds, as in hardware level they
| are essentially just a separate CPU running in background that
| has command queue(s). They also need to have requests align with
| flash memory hierarchy to perform at advertised speeds. So that's
| puts a quite finicky limitation on your access patterns: 64-256kb
| aligned blocks, 8+ accesses in parallel. To see that just try
| CrystalDiskMark and put queue depth at 1-2, and/or block size on
| something small, like 4kb, and see how your random speed
| plummets.
|
| So given the limitations on the access pattern, if you just mmap
| your file and memcpy the pointer, you'll get like ~1 access
| request in flight if I understand right. And also as default page
| size is 4kb, that will be 4kb request size. And then your mmap
| relies on IRQ's to get completion notifications (instead of
| polling the device state), somewhat limiting your IOPS. Sure
| prefetching will help of course, but it is relying on a lot of
| heuristic machinery to get the correct access pattern, which
| sometimes fails.
|
| As 7+GB/s drives and 10+Gbe networks become more and more
| mainstream, the main point where people will realize these
| requirements are in file copying, for example Windows explorer
| struggles to copy files at rates 10-25GBe+ simply because how
| it's file access architecture is designed. And hopefully then we
| will be better equip to reason about "mmap" vs "read" (really
| should be pread here to avoid the offset sem in the kernel).
| wtallis wrote:
| Yep, mmap is really bad for performance on modern hardware
| because you can only fault on one page at a time (per thread),
| but SSDs require a high queue depth to deliver the advertised
| throughput. And you can't overcome that limitation by using
| more threads, because then you spend all your time on context
| switches. Hence, io_uring.
| kccqzy wrote:
| Can't you just use MAP_POPULATE which asks the system to
| populate the entire mapped address range, which is kind of
| like page-faulting on every page simultaneously?
| astrange wrote:
| If you're reading sequentially this shouldn't be a problem
| because the VM system can pick up hints, or you can use
| madvise.
|
| If you're reading randomly this is true and you want some
| kind of async I/O or multiple read operation.
|
| mmap is also dangerous because there's no good way to return
| errors if the I/O fails, like if the file is resized or is on
| an external drive.
| jandrewrogers wrote:
| Even if you use madvise() for a large sequential read, the
| kernel will often restrict its behavior to something
| suboptimal with respect to performance on modern hardware.
| im3w1l wrote:
| If I _read_ with a huge block size, say 100mb. Will the OS
| request things in a sane way?
| foota wrote:
| Typically reviews of drives publish rates at different queue
| depths, or at least specify the queue depths tested.
| silvestrov wrote:
| Why doesn't Intel CPUs implement a modern version of Z80's LDIR
| instruction (a memmove in a single instruction)?
|
| Then the kernel wouldn't have to save any registers. (I'd really
| like if she had documented exactly which CPU/system she used for
| benchmarking).
| beagle3 wrote:
| It's called REP MOVSB (or MOVSW, MOVSD, maybe also MOVSQ?). It
| has existed since the 8086 day; and for reasons I don't know,
| it supposedly works well for big blocks these days (>1K or so)
| but is supposedly slower than register moves for small blocks.
| JoshTriplett wrote:
| > it supposedly works well for big blocks these days (>1K or
| so) but is supposedly slower than register moves for small
| blocks.
|
| On current processors with Fast Short REP MOVSB (FSRM), REP
| MOVSB is the fastest method for all sizes. On processors
| without FSRM, but with ERMS, REP MOVSB is faster for anything
| longer than ~128 bytes.
| beagle3 wrote:
| Thanks! Is there a simple rule-of-the-thumb about when can
| one rely on FSRM?
| JoshTriplett wrote:
| You should check the corresponding CPUID bit, but in
| general, Ice Lake and newer.
| sedatk wrote:
| LDIR is slower than unrolling multiple LDI instructions by the
| way.
| jeffbee wrote:
| Intel CPUs have REP MOVS, which is basically the same thing.
| aleden wrote:
| The boost.interprocess library presents the capability to keep
| data structures (std::list, std::vector, ...) in shared memory
| (i.e. a memory-mapped file)- "offset pointers" are key to that. I
| can think of no other programming language that can pull this
| off, with such grace.
| justin_ wrote:
| I'm not sure the conclusion that vector instructions are
| responsible for the speed-up is correct. Both implementations
| seem to use ERMS (using REP MOVSB instructions)[0]. Looking at
| the profiles, the syscall implementation spends time in the [xfs]
| driver (even in the long test), while the mmap implementation
| does not. It appears the real speed-up is related to how memory-
| mapped pages interact with the buffer cache.
|
| I might be misunderstanding things. What is really going on here?
|
| [0] Lines 56 and 180 here:
| http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...
| petters wrote:
| I thought this strange as well. The author even directly links
| to the source code where REP MOVSB is used.
| pjmlp wrote:
| On the context of Linux.
| [deleted]
| jFriedensreich wrote:
| Made me think about LMMD
| (https://en.m.wikipedia.org/wiki/Lightning_Memory-Mapped_Data...)
| and wonder why mmap didn't seem to have catched on more in
| storage engines
| ricardo81 wrote:
| *LMDB
|
| I use it a bit. The transactional aspect of it requires a bit
| consideration but generally the performance is good. I'd
| originally used libJudy in a bunch of places for fast lookups
| but the init time for programs was being slowed by having to
| preload everything. Using an mmap/LMDB is a decent middle
| ground.
| jandrewrogers wrote:
| For storage engines that prioritize performance and
| scalability, mmap() is a poor choice. Not only is it slower and
| less scalable than alternatives but it also has many more edge
| cases and behaviors you have to consider. Compared to a good
| O_DIRECT/io_submit storage engine design, which is a common
| alternative, it isn't particularly close. And now we have
| io_uring as an alternative too.
|
| If your use case is quick-and-dirty happy path code then mmap()
| works fine. In more complex and rigorous environments, like
| database engines, mmap() is not well-behaved.
| utopcell wrote:
| Last year we were migrating part of YouTube's serving to a new
| system and we were observing unexplainable high tail latency. It
| was eventually attributed to mlock()ing some mmap()ed files,
| which ended up freezing the whole process for significant amounts
| of time.
|
| Be weary of powerful abstractions.
| AshamedCaptain wrote:
| Claiming "mmap is faster than system calls" is dangerous.
|
| I once worked for a company where they also heard someone say
| "mmap is faster than read/write" and as a consequence rewrite
| their while( read() ) loop into the following monstrosity:
|
| 1. mmap a 4KB chunk of the file
|
| 2. memcpy it into the destination buffer
|
| 3. munmap the 4KB chunk
|
| 4. repeat until eof
|
| This is different from the claim in the article -- the above
| monstrosity is individually mmaping each 4KB block, while I
| presume the article's benchmark is mmaping the entire file in
| memory at once, which makes much more sense.
|
| After I claimed the "monstrosity" was absurdly stupid, someone
| pointed to a benchmark they made and found that the "monstrosity"
| version was actually faster. To me, this made no sense. The
| monstrosity has triple the syscall overhead vs the read()
| version, requires manipulating page tables for every 4KB block
| and as a consequence had several page faults for each 4KB block
| of the file. Yet it was true: their benchmarks showed the
| monstrosity version to be slightly faster.
|
| The idealist in me couldn't stand this and I reverted this
| change, using for my own (unrelated) experiments a binary which
| used the older, classic, read() loop instead of mmap.
|
| Eventually I noticed I was getting results much faster using my
| build on my single-socket Xeon than they were getting on their
| $$$ server farms. Despite what the benchmark said.
|
| Turns out, the "monstrosity" was indeed faster, but if you had
| several of these binaries running concurrently on the same
| machine, they would all slow down each other, as if the kernel
| was having scale issues with multiple processes constantly
| changing their page tables. The thing would slow down to single-
| core levels of performance.
|
| I still have no idea why the benchmark was apparently slightly
| faster, but obviously they were checking it either isolated or on
| machines where the other processes where running read() loops. I
| guess that by wasting more kernel CPU time on yourself you may
| starve other processes in the system leaving more user time for
| yourself. But once every process does it, the net result is still
| significantly lowered performance for everyone.
|
| Just yet another anecdote for the experience bag...
| piyh wrote:
| Out of curiosity, what was the use case where they were trying
| to get these marginal gains out of their program?
| jabl wrote:
| On Linux mmap_sem contention is a well-known concurrency
| bottleneck, you may have been hitting that. Multiple efforts
| over the years have failed to fix it, IIRC. I guess one day
| they'll find a good solution, but until then, take care.
| beached_whale wrote:
| mmap is a more ergonomic interface than read too. How often are
| people copying a file to a local buffer, or the whole file a
| chunk at a time, in order to use the file like an array of
| bytes. mmap gives us a range of bytes right off the start. Even
| if not optimal, the simplicity in usage often means less room
| for bugs.
| alexchamberlain wrote:
| The kernel can, in theory, estimate what page you're going to
| load next, so loading 4KB at a time may not have the page
| faults you'd expect.
| searealist wrote:
| Your anecdote doesn't follow your warning.
|
| Using mmap in an unusual way (to read chunks) on presumably
| legacy hardware doesn't generalize to using it in the obvious
| way (mmap entire files or at least larger windows) on modern
| hardware.
| Mathnerd314 wrote:
| I think the story is to always benchmark first, and also to
| make sure your benchmarks reflect real-world use. What's
| dangerous is assuming something is faster without
| benchmarking.
| searealist wrote:
| I think many people reading that anecdote may come away
| with the idea that mmap is bad (and a monstrosity even) and
| read is good rather than your interpretation that you
| should benchmark better.
|
| I dislike this kind of muddying the waters and I hope my
| comment provides another perspective for readers.
| anonunivgrad wrote:
| Best place to start is to have a good mental model of how
| things work and why they would be performant or not for a
| particular use case. Otherwise you're just taking shots in
| the dark.
| AshamedCaptain wrote:
| Indeed, my warning is about being cautious when making
| generalized claims.
| searealist wrote:
| If someone claimed running was faster than walking and then
| I told a story about how I once saw someone running in snow
| shoes on the grass and it was slower than walking then that
| would just be muddying the waters.
| segfaultbuserr wrote:
| I remember a quote, I cannot find the source for now, but
| it basically says "A book can either be completely
| correct, or be readable, but not both."
| pletnes wrote:
| It seems a more apples-to-apples comparison would be to open
| a file, seek(), read() a block, then close() the file. Just
| as bizarre as the repeated mmap, of course.
| segfaultbuserr wrote:
| Regardless of how bizarre it is, I've seen this in real
| code in embedded applications before. It's a workaround of
| buggy serial port drivers (flow control or buffering is
| probably broken): You open the port, read/write a line,
| close it, and open it again...
| craftinator wrote:
| Hah I came here to say pretty much the same thing!
| Recently ran into it and coding that workaround on a
| resource constrained system felt absolutely bonkers.
| stanfordkid wrote:
| Isn't the whole point of mmap to randomly access the data
| needed in memory? Did they think memcpy is a totally free
| operation or something, without any side effects?
| labawi wrote:
| Was it perhaps a multi-threaded task? Because that would almost
| definitely crawl.
|
| In general, unmapping expensive, much more expensive than
| mapping memory, because you need to do a TLB-
| shootdown/flush/whatever to make sure a cached version of the
| old mapping is not used. A read/write does a copy, so no need
| to mess with mappings and TLBs, hence it can scale very well.
| CyberDildonics wrote:
| If someone hears "mmap is faster than system calls" and then
| mmaps and munmaps 4KB chunks at a time in a loop, not realizing
| that mmap and munmap are actually system calls and that the
| benefit is not about calling those functions as much as
| possible, there is no saving them.
|
| That's not the fault of a 'dangerous' claim, that's the fault
| of people who go heads first into something without taking 20
| minutes to understand what they are doing or 20 minutes to
| profile after.
| 411111111111111 wrote:
| You'd need significantly more time then 20 minutes to form an
| informed opinion on the topic of you don't already know
| basically everything about it.
|
| The only thing you could do in that timespan is reading a
| single summary on the topic and hope that this includes all
| relevant information. Which is unlikely and the reason why
| things are often mistakenly taken out of context.
|
| And as the original comment mentioned: they _did_ benchmark
| and it showed an improvement. They just didn 't stresstest
| it, but that's unlikely doable within 20 minutes either.
| CyberDildonics wrote:
| In 20 minutes you can read what mmap does and see that you
| you can map a file and copy it like memory.
|
| In another 20 minutes you can compile and run a medium
| sized program.
|
| Neither of those is enough time for someone to go deep into
| something, but you can look up the brand new thing you're
| using and see where your bottlenecks are.
| anonunivgrad wrote:
| Yep, there's no substitute for a qualitative understanding of
| the system.
| tucnak wrote:
| Am I right to assume that Alexandra is well-known in the field?
| I've never heard the name.
| lionsdan wrote:
| https://www.ece.ubc.ca/faculty/alexandra-sasha-fedorova
| stingraycharles wrote:
| Apparently she's a researcher and MongoDB consultant.
|
| https://www.ece.ubc.ca/~sasha/
| [deleted]
| bzb6 wrote:
| It's an ad.
| noncoml wrote:
| The meat of it is:
|
| the functions (for copying data) used for syscall and mmap are
| very different, and not only in the name.
|
| __memmove_avx_unaligned_erms, called in the mmap experiment, is
| implemented using Advanced Vector Extensions (AVX) (here is the
| source code of the functions that it relies on).
|
| The implementation of copy_user_enhanced_fast_string, on the
| other hand, is much more modest. That, in my opinion, is the huge
| reason why mmap is faster. Using wide vector instructions for
| data copying effectively utilizes the memory bandwidth, and
| combined with CPU pre-fetching makes mmap really really fast.
|
| Why can't the kernel implementation use AVX? Well, if it did,
| then it would have to save and restore those registers on each
| system call, and that would make domain crossing even more
| expensive. So this was a conscious decision in the Linux kernel
| fangyrn wrote:
| I'm a bit of an idiot, when I think of AVX I think of something
| that speeds up computation (specifically matrice stuff), not
| memory access. How wrong am I?
| aliceryhl wrote:
| AVX is useful for both.
| jstimpfle wrote:
| It's a set of SIMD (single instruction, multiple data)
| extensions to the amd64 instruction set. They allow you to
| operate on larger chunks of data with a single instruction -
| for example, do 16 integer multiplications in parallel, etc.
| jeffbee wrote:
| Its registers are just larger. The way x86 moves memory is
| through registers, register-to-register or register-to/from-
| memory. The AVX registers move up to 64 bytes in one move. A
| general purpose register moves at most 8 bytes.
| cwzwarich wrote:
| Wouldn't REP MOVSB be as fast as an AVX memcpy for 4 KB sizes
| on recent Intel CPUs?
| adzm wrote:
| It should be, I think, though it's a complicated question
| whose answer varies on so much, cpu architecture and how it
| is used. There is a great discussion on it here, too.
|
| https://stackoverflow.com/questions/43343231/enhanced-rep-
| mo...
| justin_ wrote:
| The glibc implementation[0] uses Enhanced REP MOVSB when the
| array is long enough. It takes a few cycles to start up the
| ERMS feature, so it's only used on longer arrays.
|
| Edit: Wait a minute... if this is true, then how can AVX be
| responsible for the speed up? Is it related to the size of
| the buffers being copied internally?
|
| [0] Line 48 here: http://sourceware.org/git/?p=glibc.git;a=bl
| ob;f=sysdeps/x86_...
| JoshTriplett wrote:
| > The glibc implementation[0] uses Enhanced REP MOVSB when
| the array is long enough. It takes a few cycles to start up
| the ERMS feature, so it's only used on longer arrays.
|
| That isn't true anymore either, on sufficiently recent
| processors with "Fast Short REP MOVSB (FSRM)". If the FSRM
| bit is set (which it is on Ice Lake and newer), you can
| just always use REP MOVSB.
| jabl wrote:
| Still waiting for the "Yes, This Time We Really Mean It
| Fast REP MOVSB" (YTTWRMIFRM) bit.
|
| More seriously, if REP MOVSB can be counted on always
| being the fastest method that's fantastic. One thing that
| easily gets forgotten in microbenchmarking is I$
| pollution by those fancy unrolled SIMD loops with 147
| special cases.
| amluto wrote:
| This is a poor explanation and poor benchmarking. Let's see:
|
| copy_user_enhanced_fast_string uses a CPU feature that
| (supposedly) is very fast. Benchmarking it against AVX could be
| interesting, but it would need actual benchmarking instead of
| handwaving. It's worth noting that using AVX at all carries
| overhead, and it's not always the right choice even if it's
| faster in a tight loop.
|
| Page faults, on x86_64, are much slower than syscalls. KPTI and
| other mitigations erode this difference to some extent. But
| surely the author should have compared the number of page faults
| to the number of syscalls. Perf can do this.
|
| Finally, munmap() is very, very expensive, as is discarding a
| mapped page. This is especially true on x86. Workloads that do a
| lot of munmapping need to be careful, especially in multithreaded
| programs.
| bigdict wrote:
| Hold up. Isn't mmap a system call?
| chrisseaton wrote:
| > Hold up. Isn't mmap a system call?
|
| That's not what they mean. You set up a memory map with the
| mmap system call, yes, but that's not the point.
|
| The point is then that you can read and write mapped files by
| reading and writing memory addresses directly - you do not have
| to use a system call to perform each read and write.
| bzb6 wrote:
| So like DMA?
| ndesaulniers wrote:
| DMA is more like a bulk memory transfer operation usually
| facilitated by specific hardware that generally is
| asynchronous and requires manual synchronization. Usually
| hardware devices perform DMAs of memory regions, like a
| memcpy() but between physical memories.
|
| Memory mappings established via mmap() more so set up the
| kernel to map in pages when faults on accesses occur. In
| this case you're not calling into the kernel, the TLB is
| generating an interrupt when you go to read an address
| referring to memory that's not yet paged in, which the
| kernel than handles and restores control flow to userspace
| without userspace being any wiser (unless userspace is
| keeping track of time). Handling page faults is faster than
| the syscalls involved in read() calls, it would seem.
| _0ffh wrote:
| I think that comparison would be more confusing than
| helpful.
| [deleted]
| jabl wrote:
| As a word of warning, mmap is fine if the semantics match the
| application.
|
| mmap is not a good idea for a general purpose read()/write()
| replacement, e.g. as advocated in the 1994 "alloc stream
| facility" paper by Krieger et al. I worked with an I/O library
| that followed this strategy, and we had no end of trouble how to
| robustly deal with resizing files, and also how to do the
| windowing in a good way (this was in the time where we needed to
| care about systems with 32-bit pointers, VM space getting tight,
| but still needed to care about files larger than 2 GB). And then
| we needed the traditional read/write fallback path anyway, in
| order to deal with special files like tty's, pipes etc. In the
| end I ripped out the mmap path, and we saw a perf improvement in
| some benchmark by x300.
| searealist wrote:
| What year / hardware / kernel version are you talking about?
| jabl wrote:
| Oh uh, IIRC 2004/2005 or thereabouts. Personally I was using
| PC HW running an up to date Linux distro, as I guess was the
| vast majority of the userbase, but there was a long tail of
| all kinds of weird and wonderful targets where the software
| was deployed.
| [deleted]
| iforgotpassword wrote:
| Also error handling. read and write can return errors, but what
| happens when you write to a mmaped pointer and the underlying
| file system has some issue? Assigning a value to a variable
| cannot return an error.
|
| So you get a fine SIGBUS to your application and it crashes.
| Just the other day I used imagemagick and it always crashed
| with a SIGBUS and just when I started googling the issue I
| remembered mmap, noticed that the partition ran out of space,
| freed up some more and the issue was gone.
|
| So you might want to set up a handler for that signal, but now
| the control flow suddenly jumps to another function if an error
| occurs, and you have to somehow figure out where in your
| program the error occurred and then what? Then you remember
| that longjmp exists and you end up with a steaming pile of
| garbage code.
|
| Only use mmap if you absolutely must. Don't just "mmap all teh
| files" as it's the new cool thing you learned about.
| chrchang523 wrote:
| Yeah, this is the biggest reason I stay the hell away from
| mmap now. Signal handlers are a much worse minefield than
| error handling in any standard file I/O API I've seen.
| klodolph wrote:
| You don't have to longjmp, you can remap the memory and set a
| flag, return from the signal handler, handle the error later,
| if you like.
| jabl wrote:
| Indeed. The issue with file resizing I mentioned was mostly
| related to error handling (what if another
| process/thread/file descriptor/ truncates the file, etc.).
| But yes, there are of course other errors as well, like the
| fs running out of space you mention.
| justin66 wrote:
| There's nothing wrong with using a read only mmap in
| conjunction with another method for writes.
| iforgotpassword wrote:
| You have exactly the same problem on a read error.
| justin66 wrote:
| Not the problem you described in your second paragraph.
| nlitened wrote:
| Is it still the case in 64-bit systems?
| jabl wrote:
| Except for running out of VM space, all the other issues are
| still there. And even if you have (for the time being)
| practically unlimited VM space, you may still not want to
| mmap a file of unbounded size, since setting up all those
| mappings takes quite a lot of time if you're using the
| default 4 kB page size. So you probably want to do some kind
| of windowing anyway. But then if the access pattern is random
| and the file is large, you have to continually shift the
| window (munmap + mmap) and performance goes down the drain.
| So I don't think going to 64-bit systems tilts the balance in
| favor of mmap.
| pocak wrote:
| Linux allocates page tables lazily, and fills them lazily.
| The only upfront work is to mark the virtual address range
| as valid and associated with the file. I'd expect mapping
| giant files to be fast enough to not need windowing.
| jabl wrote:
| Good point, scratch that part of my answer.
|
| There are still some cases where you'd not want unlimited
| VM mapping, but those are getting a bit esoteric and at
| least the most obvious ones are in the process of getting
| fixed.
| Matthias247 wrote:
| The 4-16kB buffer sizes are all rather tiny and inefficient for
| high throughput use-cases, which makes those results not that
| relevant. Something between 64kB to 1MB seems more applicable.
| aloknnikhil wrote:
| Previous discussion:
| https://news.ycombinator.com/item?id=24842648
| baybal2 wrote:
| There is a third option on the table! Using DMA controller.
|
| People say what's the difference in between copying memory by
| CPU, or by DMA controller? The difference is exactly that.
|
| You rarely need that, but in some cases you have:
|
| 1. You do very long copies, and want to have full CPU
| performance.
|
| 2. You do very long copies, and want to have caches to not be
| flushed during it.
|
| 3. You care about power consumption, as DMA controller may let
| the CPU core to enter low power mode quicker.
|
| 4. On some CPU architectures, you can get wildly spread out pages
| quicker than with CPU/software.
| beagle3 wrote:
| The DMA needs to cooperate with the MMU which is on the CPU
| these days (and has been for almost 3 decades now). It's a lot
| of work to set up DMA correctly given physical<->logical memory
| mapping - so it's only worth it if you have a really big chunk
| to copy.
| GeorgeTirebiter wrote:
| This is quite interesting. This, to me, seems like a systems
| bug. In the Embedded World, it is exceedingly common to use
| DMA for all high-speed transfers -- it's effectively a
| specialized parallel hardware 'MOV' instruction. Also, I have
| never had an occasion on modern PC hw to need mmap; read()
| lseek() are clean and less complex overall. Maybe I lack
| imagination.
| astrange wrote:
| mmap is being used by libraries under you; it's useful for
| files on internal drives, that won't be deleted, and want
| to be accessed randomly, and you don't want to allocate
| buffers to copy things out of them.
|
| For instance, anytime you call a shared library it's been
| mmapped by dyld/ld.so.
| xoo1 wrote:
| It's all very benchmark-chasing theoretical. In practice
| performance is more complicated and mmap is this weird corner-
| case over engineered inconsistent optimization thing that often
| wastes or even "leaks" memory which can be used for actually
| important for performance caches, it's also awful at error
| handling and so on. I had to literally patch LevelDB to disable
| mmap on amd64 once, which eliminated OOMs on those servers,
| allowed me to run way more LevelDB instances and overall improved
| performance so significantly, that I had to write this comment.
| jstimpfle wrote:
| Yup, I don't like using mmap() for the reason alone that it
| means giving up a lot of control.
| BikiniPrince wrote:
| One mechanism we developed was to build a variant of our
| storage node that could run in isolation. This meant that
| synthetic testing would give us some optimal numbers for
| hardware vetting and performance changes.
|
| I proved quite quickly our application was quite thread poor
| and the costs of fixing it was quite worth it. Using other
| synthetic benchmarks to compare what the systems were capable
| of.
|
| I was gone before that was finished, but it was quite an
| improvement. It also allowed cold volumes to exist in an over
| subscription model.
|
| None of this excuses good real world telemetry and evaluation
| of your outliers.
| CalChris wrote:
| Neither _mmap()_ nor _read() /write()_ leak memory.
| jstimpfle wrote:
| But they might "leak" it. What parent meant is that as an
| mmap() user you have no control how much of a mapping takes
| actual memory while you're visiting random memory-mapped file
| locations. Is that documented somewhere?
| vlovich123 wrote:
| madvise gives you pretty good control over the paging, no?
| Generally I think you can MADVISE_NOTNEEDED to page out
| content if you need to do it more aggressively, no? The
| benefit is that the kernel understands this enough that it
| can evict those page buffers, things it can't do when those
| buffers live in user-space.
| jandrewrogers wrote:
| No, madvise() does not give good control over paging
| behavior. As the syscall indicates, it is merely a
| suggestion. The kernel is free to ignore it and
| frequently does. This non-determinism makes it nearly
| useless for many types of page scheduling optimizations.
| For some workloads, the kernel consistently makes poor
| paging choices and there is no way to force it to make
| good paging choices. You have much better visibility into
| the I/O behavior of your application than the kernel
| does.
|
| In my experience, at least for database-y workloads, if
| you care enough about paging behavior to use madvise(),
| you might as well just use any number of O_DIRECT
| alternatives that offer deterministic paging control. It
| is much easier than trying to cajole the kernel into
| doing what you need via madvise().
| rowanG077 wrote:
| That's called a space leak not a memory leak.
| AnotherGoodName wrote:
| The paging system will only page in what's being used right
| now though and paging out has zero cost. Old data will
| naturally be paged out. To put it directly the answer is
| each mmap file will need 1 page of physical memory (the
| area currently being read/written). There may be old pages
| left around since there's no reason for the OS to page
| anything out unless some other application asked for the
| memory. But if they do mmap will go to 1 page just fine and
| there's zero cost to paging out.
|
| I feel mmap gets a bad reputation when people look at
| memory usage tools that look at total virtual memory
| allocated.
|
| I can mmap a 100GB of files, use 0 physical memory and a
| lot of memory usage tools will report 100GB of memory usage
| of a certain type (virtual memory allocated). You then get
| articles about application X using GB of memory. Anyone
| trying to correct this is ignored.
|
| Google Chrome is somewhat unfairly hit by this. All those
| articles along the lines of "Why is Google using 4GB with
| no tabs after i viewed some large PDFs". The answer is that
| it reserved 4GB of 'addresses' that it has mapped to files.
| If another application wants to use that memory there's
| zero cost to paging out those files from memory. The OS is
| designed to do this and it's what mmap is for.
| labawi wrote:
| > paging out has zero cost
|
| Paging out, as in removing a mapping, can be surprisingly
| costly, because you need to invalidate any cached TLB
| entries, possibly even in other CPUs.
|
| > each mmap file will need 1 page of physical memory
|
| Technically, a lower limit would be about 2 or so usable
| pages, because you can't use more than that
| simultaneously. However unmaps are expensive, so the
| system won't be too eager to page out.
|
| Also, for pages to be accessible, they need to be
| specified in the page table (actually tree, of virtual ->
| physical mappings). A random address may require about
| 1-3 pages for page table aside from the 1 page of actual
| data (but won't need more page tables for the next MB).
|
| > application X using GB of memory
|
| I think there is a difference between reserved, allocated
| and file-backed mmapped memory. Address space, file-
| backed mmapped memory is easily paged-out, not sure what
| different types of reserved addresses/memory are, but
| chrome probably doesn't have lots of mmapped memory that
| can be paged out. If it's modified, then it must be
| swapped, otherwise it's just reserved and possibly
| mapped, but _never_ used memory.
| AnotherGoodName wrote:
| I'd argue the costs with paging out are already accounted
| for by the other process paging in though. The other
| process that paged in and led to the need to page out had
| already led to the need to change the page table and
| flush cache.
| labawi wrote:
| Paging in free memory (adding a mapping) is cheap (no
| need to flush). Removing a mapping is expensive (need to
| flush). Also, processes have their own (mostly)
| independent page tables.
|
| I don't think it would be reasonable accounting, when
| paging-in is cheap, but only if there is no need to page
| out (available free memory). Especially when trying to
| argue that paging out is zero-cost.
| silon42 wrote:
| mmap has less deterministic memory pressure and more complex
| interactions with overcommit (if enabled).
| jeffbee wrote:
| LevelDB is kinda like a single-tablet bigtable, but because of
| that its mmap i/o is not a result of battle hardening in
| production systems. bigtable doesn't use local unix i/o for any
| purpose at all, so I'm not surprised to hear that leveldb's
| local i/o subsystem is half baked.
| btown wrote:
| Curious now - were you running an unconventional workload that
| stressed LevelDB, or do you think some version of this advice
| could be applicable to typical workloads?
| einpoklum wrote:
| The author says that in userspace memcpy, AVX is used, but
|
| > The implementation of copy_user_enhanced_fast_string, on the
| other hand, is much more modest.
|
| Why is that? I mean, if you compiled your kernel for a wide range
| of machines, then fine, but if you compiled targeting your actual
| CPU, why would the kernel functions not use AVX?
| lrossi wrote:
| From the mouth of Linus:
|
| https://marc.info/?l=linux-kernel&m=95496636207616&w=2
|
| It's a bit old, but it should still apply. I remember that in
| general he was annoyed when seeing people recommend mmap instead
| of read/write for basic I/O usecases.
|
| In general, it's almost always better to use the specialized API
| (read, write etc.) instead of reinventing the wheel on your own.
| beagle3 wrote:
| LMDB (and its modern fork MDBX), and kdb+/shakti make
| incredibly good use of mmap - I suspect it is possible to get
| similar performance from read(), but probably at 10x the
| implementation complexity.
| nabla9 wrote:
| Yes. If you do lots of sequential/local reads you can reduce
| the number of context switches dramatically if you do something
| like: /* to reduce context switch */
| int bsize = 16*BUFSIZ; int fopen_bsize(char *
| filename) { int fp = fopen(filename, "r");
| setvbuf(fp, NULL, _IOFBF, bsize); return fp;
| }
___________________________________________________________________
(page generated 2021-01-09 23:00 UTC)