[HN Gopher] Adding 16 kb page size to Android
___________________________________________________________________
Adding 16 kb page size to Android
Author : mikece
Score : 196 points
Date : 2024-08-23 17:14 UTC (5 hours ago)
(HTM) web link (android-developers.googleblog.com)
(TXT) w3m dump (android-developers.googleblog.com)
| monocasa wrote:
| I wonder how much help they had by asahi doing a lot of the
| kernel and ecosystem work anablibg 16k pages.
|
| RISC-V being fixed to 4k pages seems to be a bit of an oversight
| as well.
| ashkankiani wrote:
| It's pretty cool that I can read "anablibg" and know that means
| "enabling." The brain is pretty neat. I wonder if LLMs would
| get it too. They probably would.
| mrob wrote:
| LLMs are at a great disadvantage here because they operate on
| tokens, not letters.
| platelminto wrote:
| I remember reading somewhere that LLMs are actually
| fantastic at reading heavily mistyped sentences! Mistyped
| to a level where humans actually struggle.
|
| (I will update this comment if I find a source)
| thanatropism wrote:
| Tihs probably refers to comon mispelllings an typo's.
| HeatrayEnjoyer wrote:
| It's actually not. You can scramble every letter within
| words and it can mostly unscramble it. Keep the first
| letter and it recovers almost 100%.
| mrbuttons454 wrote:
| Until I read your comment I didn't even notice...
| evilduck wrote:
| Question I wrote:
|
| > I encountered the typo "anablibg" in the sentence "I wonder
| how much help they had by asahi doing a lot of the kernel and
| ecosystem work anablibg 16k pages." What did they actually
| mean?
|
| GPT-4o and Sonnet 3.5 understood it perfectly. This isn't
| really a problem for the large models.
|
| For local small models:
|
| * Gemma2 9b did not get it and thought it meant "analyzing".
|
| * Codestral (22b) did not it get it and thought it meant
| "allocating".
|
| * Phi3 Mini failed spectacularly.
|
| * Phi3 14b and Qwen2 did not get it and thought it was
| "annotating".
|
| * Mistral-nemo thought it was a portmanteau "anabling" as a
| combination of "an" and "enabling". Partial credit for being
| close and some creativity?
|
| * Llama3.1 got it perfectly.
| jandrese wrote:
| Seems like there is a bit of a roll of the dice there. The
| ones that got it right may have just been lucky.
| HeatrayEnjoyer wrote:
| Ran it a few times in new sessions, 0 failures so far.
| slaymaker1907 wrote:
| I wonder how much of a test this is for the LLM vs whatever
| tokenizer/preprocessing they're doing.
| Retr0id wrote:
| fwiw I failed to figure it out as a human, I had to check
| the replies.
| Alifatisk wrote:
| Is there any task Gemma is better at compared to others?
| treyd wrote:
| I wonder if they'd do better if there was the context that
| it's in a thread titled "Adding 16 kb page size to
| Android"? The "analyzing" interpretation is plausible if
| you don't know what 16k pages, kernels, Asahi, etc are.
| im3w1l wrote:
| I asked chatgpt and it did get it.
|
| Personally, when I read the comment my brain kinda skipped
| over the word since it contained the part "lib" I assumed it
| was some obscure library that I didn't care about. It doesn't
| fit grammatically but I didn't give it enough thought to
| notice.
| IshKebab wrote:
| Probably wouldn't be too hard to add a 16 kB page size
| extension. But I think the Svnapot extension is their solution
| to this problem. If you're not familiar it lets you mark a set
| of pages as being part of a contiguously mapped 64 kB region.
| No idea how the performance characteristics vary. It relieves
| TLB pressure, but you still have to create 16 4kB page table
| entries.
| monocasa wrote:
| Svnapot is a poor solution to the problem.
|
| On one hand it means that that each page table entry takes up
| half a cache line for the 16KB case, and two whole cache
| lines in the 64KB case. This really cuts down on the page
| walker hardware's ability to effectively prefetch TLB
| entries, leading to basically the same issues as this classic
| discussion about why tree based page tables are generally
| more effective than hash based page tables (shifted forward
| in time to today's gate counts).
| https://yarchive.net/comp/linux/page_tables.html This is why
| ARM shifted from a Svnapot like solution to the "translation
| granule queryable and partially selectable at runtime"
| solution.
|
| Another issue is the fact that a big reason to switch to 16KB
| or even 64KB pages is to allow for more address range for
| VIPT caches. You want to allow high performance
| implementations to be able to look up the cache line while
| performing the TLB lookup in parallel, then compare the tag
| with the result of the TLB lookup. This means that
| practically only the untranslated bits of the address can be
| used by the set selection portion of the cache lookup. When
| you have 12 bits untranslated in a address, combined with 64
| byte cachelines gives you 64 sets, multiply that by 8 ways
| and you get the 32KB L1 caches very common in systems with
| 4KB page sizes (sometimes with some heroic effort to throw a
| ton of transistors/power at the problem to make a 64KB cache
| by essentially duplicating large parts of the cache lookup
| hardware for that extra bit of address). What you really want
| is for the arch to be able to disallow 4KB pages like on
| apple silicon which is the main piece that allows their giant
| 128LB and 192KB L1 caches.
| aseipp wrote:
| > What you really want is for the arch to be able to
| disallow 4KB pages like on apple silicon which is the main
| piece that allows their giant 128LB and 192KB L1 caches.
|
| Minor nit but they allow 4k pages. Linux doesn't support
| 16k and 4k pages at the same time; macOS does but is just
| very particular about 4k pages being used for scenarios
| like Rosetta processes or virtual machines e.g. Parallels
| uses it for Windows-on-ARM, I think. Windows will probably
| never support non-4k pages I'd guess.
|
| But otherwise, you're totally right. I wish RISC-V had gone
| with the configurable granule approach like ARM did. Major
| missed opportunity but maybe a fix will get ratified at
| some point...
| saagarjha wrote:
| Probably very little, since the Android ecosystem is quite
| divorced from the Linux one.
| nabla9 wrote:
| Android kernel is a mainstream Linux kernel, with additional
| drivers, and other functionality.
| temac wrote:
| The linux kernel already works perfectly fine with various
| base page sizes.
| twoodfin wrote:
| A little additional background: iOS has used 16KB pages since the
| 64-bit transition, and ARM Macs have inherited that design.
| arghwhat wrote:
| A more relevant bit of background is that 4KB pages lead to
| quite a lot of overhead due to the sheer number of mappings
| needing to be configured and cached. Using larger pages reduce
| overhead, in particular TLB misses as fewer entries are needed
| to describe the same memory range.
|
| While x86 chips mainly supports 4K, 2M and 1G pages, ARM chips
| tend to support more practical 16K page sizes - a nice balance
| between performance and wasting memory due to lower allocation
| granularity.
|
| Nothing in particular to do with Apple and iOS.
| jsheard wrote:
| Makes me wonder how much performance Windows is leaving on
| the table with its primitive support for large pages. It does
| support them, but it doesn't coalesce pages transparently
| like Linux does, and explicitly allocating them requires
| special permissions and is very likely to fail due to
| fragmentation if the system has been running for a while. In
| practice it's scarcely used outside of server software which
| immediately grabs a big chunk of large pages at boot and
| holds onto them forever.
| arghwhat wrote:
| Quite a bit, but 2M is an annoying size and the transparent
| handling is suboptimal. Without userspace cooperating, the
| kernel might end up having to split the pages at random due
| to an unfortunate unaligned munmap/madvise from an
| application not realizing it was being served 2M pages.
|
| Having Intel/AMD add 16-128K page support, or making it
| common for userspace to explicitly ask for 2M pages for
| their heap arenas is likely better than the page merging
| logic. Less fragile.
|
| 1G pages are practically useless outside specialized server
| software as it is very difficult to find 1G contiguous
| memory to back it on a "normal" system that has been
| running for a while.
| jsheard wrote:
| Would a reasonable compromise be to change the base
| allocation granularity to 2MB, and transparently sub-
| allocate those 2MB blocks into 64KB blocks (the current
| Windows allocation granularity) when normal pages are
| requested? That feels like it should keep 2MB page
| fragmentation to a minimum without breaking existing
| software, but given they haven't done it there's probably
| some caveat I'm overlooking.
| lanigone wrote:
| you might find this interesting
|
| https://www.hudsonrivertrading.com/hrtbeat/low-latency-
| optim...
| bewaretheirs wrote:
| Intel's menu of page sizes is an artifact of its page
| table structure.
|
| On x86 in 64-bit mode, page table entries are 64 bits
| each; the lowest level in the hierarchy (L1) is a 4K page
| containing 512 64-bit of PTEs which in total map 2M of
| memory, which is not coincidentally the large page size.
|
| The L1 page table pages are themselves found via a PTE in
| a L2 page table; one L2 page table page maps 512*2M = 1G
| of virtual address space, which is again, not
| coincidentally, the huge page size.
|
| Large pages are mapped by a L2 PTE (sometimes called a
| PDE, "page directory entry") with a particular bit set
| indicating that the PTE points at the large page rather
| than a PTE page. The hardware page table walker just
| stops at that point.
|
| And huge pages are similarly mapped by an L3 PTE with a
| bit set indicating that the L3 PTE is a huge page.
|
| Shoehorning an intermediate size would complicate page
| table updates or walks or probably both.
|
| Note that an OS can, of its own accord independent of
| hardware maintain allocations as a coarser granularity
| and sometimes get some savings out of this. For one
| historic example, the VAX had a tiny 512-byte page size;
| IIRC, BSD unix pretended it had a 1K page size and always
| updated PTEs in pairs.
| andai wrote:
| A lot of low level stuff is a lot slower on Windows, let
| alone the GUI. There's also entire blogs cataloging an
| abundance of pathological performance issues.
|
| The one I notice the most is the filesystem. Running Linux
| in VirtualBox, I got _7x_ the host speed for many small
| file operations. (On top of that Explorer itself has its
| own random lag.)
|
| I think a better question is how much performance are they
| leaving on the table by bloating the OS so much. Like they
| could have just not touched Explorer for 20 years and it
| would be 10x snappier now.
|
| I think the number is closer to 100x actually. Explorer on
| XP opens (fully rendered) after a single video frame...
| also while running virtualized inside Win10.
|
| Meanwhile Win10 Explorer opens after a noticeable delay,
| and then spends the next several hundred milliseconds
| painting the UI elements one by one...
| saagarjha wrote:
| None of this has to do with page size.
| pantalaimon wrote:
| Death by 1000 cuts
| Const-me wrote:
| > The one I notice the most is the filesystem
|
| I'm not sure it's the file system per se, I believe the
| main reason is the security model.
|
| NT kernel has rather sophisticated security. The
| securable objects have security descriptors with many
| access control entries and auditing rules, which inherit
| over file system and other hierarchies according to some
| simple rules e.g. allow+deny=deny. Trustees are members
| of multiple security groups, and security groups can
| include other security groups so it's not just a list,
| it's a graph.
|
| This makes access checks in NT relatively expensive. The
| kernel needs to perform access check every time a process
| creates or opens a file, that's why CreateFile API
| function is relatively slow.
| temac wrote:
| I've been trying to use auditing rules for a usage that
| seems completely in scope and obvious to prioritize from
| a security point of view (tracing access to EFS files
| and/or the keys allowing the access) and my conclusion
| was that you basically can't, the doc is garbage, the
| implementation is probably ad-hoc with lots of holes, and
| MS probably hasn't prioritised the maintenance of this
| feature since several decades (too busy adding ads in the
| start menu I guess)
|
| The NT security descriptors are also so complex they are
| probably a little useless in practice too, because it's
| too hard to use correctly. On top of that the associated
| Win32 API is also too hard to use correctly to the point
| that I found an important bug in the usage model
| described in MSDN, meaning that the doc writer did not
| know how the function actually work (in tons of cases you
| probably don't hit this case, but if you start digging in
| all internal and external users, who knows what you could
| find...)
|
| NT was full of good ideas but the execution is often
| quite poor.
| nullindividual wrote:
| > The one I notice the most is the filesystem.
|
| This is due to the extensible file system filter model in
| place; I'm not aware of another OS that implements this
| feature and is primarily used for antivirus, but can be
| used by any developer for any purpose.
|
| It applies to all file systems on Windows.
|
| DevDrive[0] is Microsoft's current solution to this.
|
| > Meanwhile Win10 Explorer opens after a noticeable delay
|
| This could be, again, largely due to 3rd party hooks (or
| 1st party software that doesn't ship with Windows) into
| Explorer.
|
| [0] https://devblogs.microsoft.com/visualstudio/devdrive/
| andai wrote:
| I'm glad you mentioned that. I noticed when running
| "Hello world" C program on Windows 10 that Windows
| performs over 100 reads of the Registry before running
| the program. Same thing when I right click a file...
|
| A few of those are 3rd party, but most are not.
| nullindividual wrote:
| Remember that Win32 process creation is expensive[0]. And
| on NT, processes don't run, _threads do_.
|
| The strategy of applications, like olde-tymey Apache
| using multiple processes to handle incoming connections
| is fine on UN*X, but terrible on Windows.
|
| [0] https://fourcore.io/blogs/how-a-windows-process-is-
| created-p...
| redleader55 wrote:
| > I'm not aware of another OS that implements this
| feature
|
| I'm not sure this is exactly what you mean, but Linux has
| inotify and all sorts of BPF hooks for filtering various
| syscalls, for example file operations.
| rincebrain wrote:
| FSFilters are basically a custom kernel module that can
| and will do anything they want on any filesystem access.
| (There's also network filters, which is how things like
| WinPcap get implemented.)
|
| So yes, you could implement something similar in Linux,
| but there's not, last I looked, a prebuilt toolkit and
| infrastructure for them, just the generic interfaces you
| can use to hook anything.
|
| (Compare the difference between writing a BPF module to
| hook all FS operations, and the limitations of eBPF, to
| having an InterceptFSCalls struct that you define in your
| custom kernel module to run your own arbitrary code on
| every access.)
| hinkley wrote:
| > The one I notice the most is the filesystem. Running
| Linux in VirtualBox, I got 7x the host speed for many
| small file operations. (On top of that Explorer itself
| has its own random lag.)
|
| That's a very old problem. In early days of subversion,
| the metadata for every directory existed in the
| directory. The rationale was that you could check out
| just a directory in svn. It was disastrously slow on
| Windows and the subversion maintainers had no answer for
| it, except insulting ones like "turn off virus scanning".
| Telling a windows user to turn off virus scanning is
| equivalent to telling someone to play freeze tag in
| traffic. You might as well just tell them, "go fuck
| yourself with a rusty chainsaw"
|
| Someone reorganized the data so it all happened at the
| root directory and the CLI just searched upward until it
| found the single metadata file. If memory serves that
| made large checkouts and updates about 2-3 times faster
| on Linux and 20x faster on windows.
| tedunangst wrote:
| I've lost count of how many blog posts about poor
| performance ended with the punchline "so then we turned off
| page coalescing".
| daghamm wrote:
| IIRC, 64-bit ARM can do 4K, 16K, 64K and 2M pages. But there
| are some special rules for the last one.
|
| https://documentation-
| service.arm.com/static/64d5f38f4a92140...
| HumblyTossed wrote:
| How is this "additional background"? This was a post by Google
| regarding Android.
| a1o wrote:
| > The very first 16 KB enabled Android system will be made
| available on select devices as a developer option. This is so you
| can use the developer option to test and fix
|
| > once an application is fixed to be page size agnostic, the same
| application binary can run on both 4 KB and 16 KB devices
|
| I am curious about this. When could an app NOT be agnostic to
| this? Like what an app must be doing to cause this to be
| noticeable?
| mlmandude wrote:
| If you use mmap/munmap directly within your application you
| could probably get into trouble by hardcoding the page size.
| vardump wrote:
| For example use mmap and just assume 4 kB pages.
| dmytroi wrote:
| Also ELF segment alignment, which is defaults to 4k.
| bri3d wrote:
| Only on Android, for what it's worth; most "vanilla" Linux
| aarch64 linkers chose 64K defaults several years ago. But
| yes, most Android applications with native (NDK) binaries
| will need to be rebuilt with the new 16kb max-page-size.
| edflsafoiewq wrote:
| jemalloc bakes in page size assumptions, see eg
| https://github.com/jemalloc/jemalloc/issues/467.
| sweeter wrote:
| Wine doesn't work on 16 KB page size among other things.
| mananaysiempre wrote:
| This seems especially peculiar given Windows has a 64K
| mapping granularity.
| tredre3 wrote:
| Windows uses 4KB pages.
| nullindividual wrote:
| 4K, 2M ("large page"), or 1G ("huge page") on x86-64. A
| single allocation request can consist of multiple page
| sizes. From _Windows Internal 7th Edt Part 1_ :
| On Windows 10 version 1607 x64 and Server 2016 systems,
| large pages may also be mapped with huge pages, which are
| 1 GB in size. This is done automatically if the
| allocation size requested is larger than 1 GB, but it
| does not have to be a multiple of 1 GB. For example, an
| allocation of 1040 MB would result in using one huge page
| (1024 MB) plus 8 "normal" large pages (16 MB divided by 2
| MB).
| mananaysiempre wrote:
| Right (on x86-32 and -64, because you can't have 64KB
| pages there, though larger page sizes do exist and get
| used). You still cannot (e.g.) MapViewOfFile() on an
| address not divisible by 64KB, because Alpha[1]. As far
| as I understand, Windows is mostly why the docs for the
| Blink emulator[2] (a companion project of Cosmopolitan
| libc) tell you any programs under it need to use
| sysconf(_SC_PAGESIZE) [aka getpagesize() aka
| getauxval(AT_PAGESZ)] instead of assuming 4KB.
|
| [1] https://devblogs.microsoft.com/oldnewthing/20031008-0
| 0/?p=42...
|
| [2] https://github.com/jart/blink/blob/master/README.md#c
| ompilin...
| o11c wrote:
| The fundamental problem is that system headers don't provide
| enough information. In particular, many programs need both "min
| runtime page size" and "max runtime page size" (and by this I
| mean non-huge pages).
|
| If you call `mmap` without constraint, you need to assume the
| result will be aligned to at least "min runtime page size". In
| practice it is _probably_ safe to assume 4K for this for
| "normal" systems, but I've seen it down to 128 bytes on some
| embedded systems, and I don't have much breadth there (this
| will break many programs though, since there are more errno
| values than that). I don't know enough about SPARC binary
| compatibility to know if it's safe to push this up to 8K for
| certain targets.
|
| But if you want to call `mmap` (etc.) with full constraint, you
| must work in terms of "max runtime page size". This is known to
| be up to at least 64K in the wild (aarch64), but some
| architectures have "huge" pages not much beyond that so I'm not
| sure (256K, 512K, and 1M; beyond that is almost certainly going
| to be considered huge pages).
|
| Besides a C macro, these values also need to be baked into the
| object file and the linker needs to prevent incompatible
| assumptions (just in case a new microarchitecture changes them)
| dotancohen wrote:
| Yes, but the context here is Java or Kotlin running on
| Android, not embedded C.
|
| Or do some Android applications run embedded C with only a
| Java UI? I'm not an Android dev.
| saagarjha wrote:
| Android apps can call into native code via JNI, which the
| platform supports.
| ignoramous wrote:
| Wonder if Android apps can also be fully native (C++)?
| orf wrote:
| Yes, Android apps can and do have native libraries.
| Sometimes this can be part of a SDK, or otherwise out of
| the developers control.
| warkdarrior wrote:
| Apps written in Flutter/Dart and React Native/Javascript
| both compile to native code with only shims to interface
| with the Java UI framework.
| david_allison wrote:
| The Android Native Development Kit (NDK) allows building
| native code libraries for Android (typically C/C++, but
| this can include Rust). These can then be loaded and
| accessed by JNI on the Java/Kotlin side
|
| * Brief overview of the NDK:
| https://developer.android.com/ndk/guides
|
| * Guide to supporting 16KB page sizes with the NDK
| https://developer.android.com/guide/practices/page-sizes
| fpoling wrote:
| Chrome browser on Android uses the same code base as Chrome
| on desktop including multi-process architecture. But it's
| UI is in Java communicating with C++ using JNI.
| lanigone wrote:
| you can also do 2M and 1G huge pages on x86, it gets kind of
| silly fast.
| ShroudedNight wrote:
| 1G huge pages had (have?) performance benefits on managed
| runtimes for certain scenarios (Both the JIT code cache and
| the GC space saw uplift on the SpecJ benchmarks if I recall
| correctly)
|
| If using relatively large quantities of memory 2M should
| enable much higher TLB hit rates assuming the CPU doesn't
| do something silly like only having 4 slots for pages
| larger than 4k !.!
| ignoramous wrote:
| What? Any pointers on how 1G speeds things up? I'd have
| taken a bigger page size to wreak havoc on process
| scheduling and filesystem.
| saagarjha wrote:
| Pages sizes are often important to code that relies on low-
| level details of the environment it's running in, like language
| runtimes. They might do things like mark some sections of code
| as writable or executable and thus would need to know what the
| granularity of those requests can be. It's also of importance
| to things like allocators that hand out memory backed by mmap
| pages. If they have, say, a bit field for each 16-byte region
| of a page that has been used that will change in size in ways
| they can detect.
| growse wrote:
| If you use a database library that does mmap to create a db
| file with SC_PAGE_SIZE (4KB) pages, and then upgrade your
| device to a 16KB one and backup/restore the app, now your data
| isn't readable.
| nox101 wrote:
| I don't know if this fits but I've seen code that allocated say
| 32 bytes from a function that allocated 1meg under the hood.
| Not knowing that's what was happening the app quickly ran out
| of memory. It arguably was not the app's fault. The API it was
| calling into was poorly designed and poorly named, such that
| the fact that you might need to know the block size to use the
| function was in no way indicated by the name of the function
| nor the names of any of its parameters.
| lostmsu wrote:
| Not entirely related (except the block size), but I am
| considering making and standardizing a system-wide content-based
| cache with default block size 16KB.
|
| The idea is that you'd have a system-wide (or not) service that
| can do two or three things:
|
| - read 16KB block by its SHA256 (also return length that can be
| <16KB), if cached
|
| - write a block to cache
|
| - maybe pin a block (e.g. make it non-evictable)
|
| I would be like a block-level file content dedup + eviction to
| keep the size limited.
|
| Should reduce storage used by various things due to dedup
| functionality, but may require internet for corresponding apps to
| work properly.
|
| With a peer-to-peer sharing system on top of it may significantly
| reduce storage requirements.
|
| The only disadvantage is the same as with shared website caches
| prior to cache isolation introduction: apps can poke what you
| have in your cache and deduce some information about you from it.
| monocasa wrote:
| I'd probably pick a size greater than 16KB for that. Windows
| doesn't expose translations less than 64KB in their version of
| mmap, and internally their file cache works in increments of
| 256KB. And these were numbers they picked back in the 90s.
| treyd wrote:
| I would go for higher than 16K. I believe BitTorrent's default
| minimum chunk size is 64K, for example. It really depends on
| the use case in question though, if you're doing random writes
| then larger chunk sizes quickly waste a ton of bandwidth,
| especially if you're doing recursive rewrites of a tree
| structure.
|
| Would a variable chunk size be acceptable for whatever it is
| you're building?
| devit wrote:
| Seems pretty dubious to do this without adding support for having
| both 4KB and 16KB processes at once to the Linux kernel, since it
| means all old binaries break and emulators which emulate normal
| systems with 4KB pages (Wine, console emulators, etc.) might
| dramatically lose performance if they need to emulate the MMU.
|
| Hopefully they don't actually ship a 16KB default before
| supporting 4KB pages as well in the same kernel.
|
| Also it would probably be reasonable, along with making the Linux
| kernel change, to design CPUs where you can configure a 16KB
| pagetable entry to map at 4KB granularity and pagefault after the
| first 4KB or 8KB (requires 3 extra bits per PTE or 2 if coalesced
| with the invalid bit), so that memory can be saved by allocating
| 4KB/8KB pages when 16KB would have wasted padding.
| username81 wrote:
| Shouldn't there be some kind of setting to change the page size
| per program? AFAIK AMD64 CPUs can do this.
| fouronnes3 wrote:
| Could they upstream that or would that require a fork?
| mgaunard wrote:
| why does it break userland? if you need to know the page size,
| you should query sysconf SC_PAGESIZE.
| akdev1l wrote:
| Assumptions in the software.
|
| Jemalloc is infamous for this:
| https://github.com/sigp/lighthouse/issues/5244
| Dwedit wrote:
| Emulating a processor with 4K size pages becomes much higher
| performance if you can use real addresses directly.
| fweimer wrote:
| It should not break userland. GNU/Linux (not necessarily
| Android though) has supported 64K pages pretty much from the
| start because that was the originally page size chosen for
| server-focus kernels and distributions. But there are some
| things that need to be worked around.
|
| Certain build processes determine the page size at compile
| time and assume it's the same at run time, and fail if it is
| not: https://github.com/jemalloc/jemalloc/issues/467
|
| Some memory-mapped files formats have assumptions about page
| granularity:
| https://bugzilla.redhat.com/show_bug.cgi?id=1979804
|
| The file format issue applies to ELF as well. Some people
| patch their toolchains (or use suitable linker options) to
| produce slightly smaller binaries that can only be loaded if
| the page size is 4K, even though the ABI is pretty clear in
| that you should link for compatibility with up to 64K pages.
| ndesaulniers wrote:
| Ossification.
|
| If the page size has been 4k for decades for most OS' and
| architectures, people get sloppy and hard code that literal
| value, rather than query for it.
| phh wrote:
| Google/Android doesn't care much about backward compatibility
| and broke programs released on Pixel 3 in Pixel 7. (the
| interdiction of 32bit-only apps is 2019 on Play Store, Pixel 7
| is first 64bits only device, while Google still released 32bits
| only device in 2023...). They quite regularly break apps in new
| Android versions (despite their infrastructure to handle
| backward compatibility), and app developers are used to brace
| themselves around Android & Pixel releases
| reissbaker wrote:
| Generally I've found Google to care much more about not
| breaking old apps compared to Apple, which often expects
| developers to rebuild apps for OS updates or else the apps
| stop working entirely (or buy entirely new machines to get OS
| updates at all, e.g. the Intel/Apple Silicon transition).
| Google isn't on the level of Windows "we will watch for
| specific binaries and re-introduce bugs in the kernel
| specifically for those binaries that they depend on" in terms
| of backwards compatibility, but I wouldn't go so far as to
| say they don't care. I'm not sure whether that's better or
| worse: there's definitely merit to Apple's approach, since it
| keeps them able to iterate quickly on UX and performance by
| dropping support for the old stuff.
| Veserv wrote:
| Having both 4KB and 16KB simultaneously is either easy or hard
| depending on which hardware feature they are using for 16KB
| pages.
|
| If they are using the configurable granule size, then that is a
| system-wide hardware configuration option. You literally can
| not map at smaller granularity while that bit is set.
|
| You might be able to design a CPU that allows your idea of
| partial pages, but there be dragons.
|
| If they are not configuring the granule size, instead opting
| for software enforcement in conjunction with always using the
| contiguous hint bit, then it might be possible.
|
| However, I am pretty sure they are talking about hardware
| granule size, since the contiguous hint is most commonly used
| to support 16 contiguous entrys (though the CPU designer is
| technically allowed to do whatever grouping they want) which
| would be 64KB.
| stingraycharles wrote:
| I'm a total idiot, how exactly is page size a CPU issue
| rather than a kernel issue? Is it about memory channel
| protocols / communication?
|
| Disks have been slowly migrating away from the 4kb sector
| size, is this a same thing going on? That you need to actual
| drive to support it, because of internal structuring (i.e.
| how exactly the CPU aligns things in RAM), and on some super
| low level 4kb / 16kb being the smallest unit of memory you
| can allocate?
|
| And does that then mean that there's less overhead in all
| kinds of memory (pre)fetchers in the CPU, because more can be
| achieved in less clock cycles?
| IshKebab wrote:
| The CPU has hardware that does a page table walk
| automatically when you access an address for which the
| translation is not cached in the TLB. Otherwise virtual
| memory would be really slow.
|
| Since the CPU hardware itself is doing the page table walk
| it needs to understand page tables and page table entries
| etc. including how big pages are.
|
| Also you need to know how big pages are for the TLB itself.
|
| The value of 4kB itself is pretty much arbitrary. It has to
| be a small enough number that you don't waste a load of
| memory by mapping memory that isn't used (e.g. if you ask
| for 4.01kB you're actually going to get 8kB), but a large
| enough number that you aren't spending all your time
| managing tiny pages.
|
| That's why increasing the page size makes things faster but
| waste more memory.
|
| 4kB arguably isn't optimal anymore since we have way more
| memory now than when it was de facto standardised so it
| doesn't matter as much if we waste a bit. Maybe.
| quotemstr wrote:
| As an aside, it's shame that hardware page table walking
| won out over software filled TLBs, as some older
| computers had. I wonder what clever and wonderful hacks
| we might have been able to invent had we not needed to
| give the CPU a raw pointer to a data structure the layout
| of which is fixed forever.
| IshKebab wrote:
| Yeah maybe, though in practice I think it would be just
| too slow.
| Denvercoder9 wrote:
| Page table layout isn't really fixed forever, x86 has
| changed its multiple times.
| fpoling wrote:
| Samsung SSD still reports to the system that their logical
| sector size is 512 bytes. In fact one of the recent models
| even removed the option to reconfigure the disk to use 4k
| logical sectors. Presumably Samsung has figured that since
| the physical sector is much larger and they need complex
| mapping of logic sectors in any case, they decided not to
| support 4K option and stick with 512 bytes.
| pwg wrote:
| > I'm a total idiot, how exactly is page size a CPU issue
| rather than a kernel issue?
|
| Because the size of a page is a hardware defined size for
| Intel and ARM CPU's (well, more modern Intel and ARM CPU's
| give the OS a choice of sizes from a small set of options).
|
| It (page size) is baked into the CPU hardware.
|
| > And does that then mean that there's less overhead in all
| kinds of memory (pre)fetchers in the CPU, because more can
| be achieved in less clock cycles?
|
| For the same size TLB (Translation Look-aside Buffer -- the
| CPU hardware that stores the "referencing info" for the
| currently active set of pages being used by the code
| running on the CPU) a larger page size allows more total
| memory to be accessible before taking a page fault and
| having to replace one or more of the entries in the TLB. So
| yes, it means less overhead, because CPU cycles are not
| used up in replacing as many TLB entries as often.
| s_tec wrote:
| Each OS process has its own virtual address space, which is
| why one process cannot read another's memory. The CPU
| implements these address spaces in hardware, since
| literally every memory read or write needs to have its
| address translated from virtual to physical.
|
| The CPU's address translation process relies on tables that
| the OS sets up. For instance, one table entry might say
| that the 4K memory chunk with virtual address
| 0x21000-0x21fff maps to physical address 0xf56e3000, and is
| both executable and read-only. So yes, the OS sets up the
| tables, but the hardware implements the protection.
|
| Since memory protection is a hardware feature, the hardware
| needs to decide how fine-grained the pages are. It's
| possible to build a CPU with byte-level protection, but
| this would be crazy-inefficient. Bigger pages mean less
| translation work, but they can also create more wasted
| space. Sizes in the 4K-64K range seem to offer good
| tradeoffs for everyday workloads.
| sweetjuly wrote:
| Hmm, I'm not sure that's quite right. ARMv8 supports per TTBR
| translation granules [1] and so you can have 4K and 16K user
| processes coexisting under an arbitrary page size kernel by
| just context switching TCR.TG0 at the same time as TTBR0.
| There is no such thing as a global granule size.
|
| [1]: https://arm.jonpalmisc.com/2023_09_sysreg/AArch64-tcr_el
| 2#fi...
| lxgr wrote:
| > all old binaries break and emulators which emulate normal
| systems with 4KB pages
|
| Would it actually affect the kind of emulators present on
| Android, i.e. largely software-only ones, as opposed to
| hardware virtualizers making use of a CPU's vTLB?
|
| Wine is famously not an emulator and as such doesn't really
| exist/make sense on (non-x86) Android (as it would only be able
| to execute ARM binaries, not x86 ones).
|
| For the downvote: Genuinely curious here on which type of
| emulator this could affect.
| Zefiroj wrote:
| The support for mTHP exists in upstream Linux, but the swap
| story is not quite there yet. THP availability also needs work
| and there are a few competing directions.
|
| Supporting multiple page sizes well transparently is non-
| trivial.
|
| For a recent summary on one of the approaches, TAO (THP
| Allocation Optimization), see this lwn article:
| https://lwn.net/Articles/974636/
| eyalitki wrote:
| RHEL tried that in that past with 64KB on AARCH64, it led to MANY
| bugs all across the software stack, and they eventually reverted
| it - https://news.ycombinator.com/item?id=27513209.
|
| I'm impressed by the effort on Google's side, yet I'll be
| surprised if this effort will pay off.
| nektro wrote:
| apple's m-series chips use a 16kb page size by default so the
| state of things has improved significantly with software
| wanting to support asahi and other related endeavors
| kcb wrote:
| Nvidia is pushing 64KB pages on their Grace-Hopper system.
| rincebrain wrote:
| I didn't realize they had reverted it, I used to run RHEL
| builds on Pi systems to test for 64k page bugs because it's not
| like there's a POWER SBC I could buy for this.
| daghamm wrote:
| Can someone explain those numbers to me?
|
| 5-10% performance boost sounds huge. Wouldn't we have much larger
| TLBd if page walk was really this expensive?
|
| On the other hand 9% increase in memory usage also sounds huge.
| How did this affect memory usage that much?
| scottlamb wrote:
| > 5-10% performance boost sounds huge. Wouldn't we have much
| larger TLBd if page walk was really this expensive?
|
| It's pretty typical for large programs to spend 15+% of their
| "CPU time" waiting for the TLB. [1] So larger pages really
| help, including changing the base 4 KiB -> 16 KiB (4x reduction
| in TLB pressure) and using 2 MiB huge pages (512x reduction
| where it works out).
|
| I've also wondered why the TLB isn't larger.
|
| > On the other hand 9% increase in memory usage also sounds
| huge. How did this affect memory usage that much?
|
| This is the granularity at which physical memory is assigned,
| and there are a lot of reasons most of a page might be wasted:
|
| * The heap allocator will typically cram many things together
| in a page, but it might say only use a given page for
| allocations in a certain size range, so not all allocations
| will snuggle in next to each other.
|
| * Program stacks each use at least one distinct page of
| physical RAM because they're placed in distinct virtual address
| ranges with guard pages between. So if you have 1,024 threads,
| they used at least 4 MiB of RAM with 4 KiB pages, 16 MiB of RAM
| with 16 KiB pages.
|
| * Anything from the filesystem that is cached in RAM ends up in
| the page cache, and true to the name, it has page granularity.
| So caching a 1-byte file would take 4 KiB before, 16 KiB after.
|
| [1] If you have an Intel CPU, toplev is particularly nice for
| pointing this kind of thing out.
| https://github.com/andikleen/pmu-tools
| 95014_refugee wrote:
| > I've also wondered why the TLB isn't larger.
|
| Fast CAMs are (relatively) expensive, is the excuse I always
| hear.
| ein0p wrote:
| No mention of Apple on the page. Apple has been using 16K pages
| for years now.
| dboreham wrote:
| Time to grab some THP popcorn...
| taeric wrote:
| I see they have measured improvements in the performance of some
| things. In particular, the camera app starts faster. Small
| percentage, but still real.
|
| Curious if there are any other changes you could do based on some
| of those learnings? The camera app, in particular, seems like a
| good one to optimize to start instantly. Especially so with the
| the shortcut "double power key" that many phones/people have
| setup.
|
| Specifically, I would expect you should be able to do something
| like the lisp norm of "dump image?" Startup should then largely
| be loading the image, not executing much if any initialization
| code? (Honestly, I mostly assume this already happens?)
| quotemstr wrote:
| Good. It's about time. 4KB pages come down to us from 32-bit time
| immemorial. We didn't bump the page size when we doubled the
| sizes of pointers and longs for the 64-bit transition. 4KB has
| been way too small for ages, and I'm glad we're biting the minor
| compatibility bullet and adopting a page size more suited to
| modern computing.
| lxgr wrote:
| Now I wonder: Does increased page size have any negative impacts
| on I/O performance or flash lifetime, e.g. for writebacks of
| dirty pages of memory-mapped files where only a small part was
| changed?
|
| Or is the write granularity of modern managed flash devices (such
| as eMMCs as used in Android smartphones) much larger than either
| 4 or 16 kB anyway?
| tadfisher wrote:
| Flash controllers expose blocks of 512B or 4096KB, but the
| actual NAND chips operate in terms of "erase blocks" which
| range from 1MB to 8MB (or really anything); in these blocks, an
| individual bit can be flipped from "0" to "1" once, and
| flipping any bit back to "0" requires erasing the entire block
| and flipping the desired bits back to "1" [0].
|
| All of this is hidden from the host by the NAND controller, and
| SSDs employ many strategies (including DRAM caching,
| heterogeneous NAND dies, wear-leveling and garbage-collection
| algorithms) to avoid wearing the storage NAND. Effectively you
| must treat flash storage devices as block devices of their
| advertised block size because you have no idea where your data
| ends up physically on the device, so any host-side algorithm is
| fairly worthless.
|
| [0]: https://spdk.io/doc/ssd_internals.html
| lxgr wrote:
| Writes on NAND happen at the block, not the page level,
| though. I believe the ratio between the two is usually
| something like 1:8 or so.
|
| Even blocks might still be larger than 4KB, but if they're
| not, presumably a NAND controller could allow such smaller
| writes to avoid write amplification?
|
| The mapping between physical and logical block address is
| complex anyway because of wear leveling and bad block
| management, so I don't think there's a need for write
| granularity to be the erase block/page or even write block
| size.
___________________________________________________________________
(page generated 2024-08-23 23:00 UTC)