[HN Gopher] Adding 16 kb page size to Android
       ___________________________________________________________________
        
       Adding 16 kb page size to Android
        
       Author : mikece
       Score  : 196 points
       Date   : 2024-08-23 17:14 UTC (5 hours ago)
        
 (HTM) web link (android-developers.googleblog.com)
 (TXT) w3m dump (android-developers.googleblog.com)
        
       | monocasa wrote:
       | I wonder how much help they had by asahi doing a lot of the
       | kernel and ecosystem work anablibg 16k pages.
       | 
       | RISC-V being fixed to 4k pages seems to be a bit of an oversight
       | as well.
        
         | ashkankiani wrote:
         | It's pretty cool that I can read "anablibg" and know that means
         | "enabling." The brain is pretty neat. I wonder if LLMs would
         | get it too. They probably would.
        
           | mrob wrote:
           | LLMs are at a great disadvantage here because they operate on
           | tokens, not letters.
        
             | platelminto wrote:
             | I remember reading somewhere that LLMs are actually
             | fantastic at reading heavily mistyped sentences! Mistyped
             | to a level where humans actually struggle.
             | 
             | (I will update this comment if I find a source)
        
               | thanatropism wrote:
               | Tihs probably refers to comon mispelllings an typo's.
        
               | HeatrayEnjoyer wrote:
               | It's actually not. You can scramble every letter within
               | words and it can mostly unscramble it. Keep the first
               | letter and it recovers almost 100%.
        
           | mrbuttons454 wrote:
           | Until I read your comment I didn't even notice...
        
           | evilduck wrote:
           | Question I wrote:
           | 
           | > I encountered the typo "anablibg" in the sentence "I wonder
           | how much help they had by asahi doing a lot of the kernel and
           | ecosystem work anablibg 16k pages." What did they actually
           | mean?
           | 
           | GPT-4o and Sonnet 3.5 understood it perfectly. This isn't
           | really a problem for the large models.
           | 
           | For local small models:
           | 
           | * Gemma2 9b did not get it and thought it meant "analyzing".
           | 
           | * Codestral (22b) did not it get it and thought it meant
           | "allocating".
           | 
           | * Phi3 Mini failed spectacularly.
           | 
           | * Phi3 14b and Qwen2 did not get it and thought it was
           | "annotating".
           | 
           | * Mistral-nemo thought it was a portmanteau "anabling" as a
           | combination of "an" and "enabling". Partial credit for being
           | close and some creativity?
           | 
           | * Llama3.1 got it perfectly.
        
             | jandrese wrote:
             | Seems like there is a bit of a roll of the dice there. The
             | ones that got it right may have just been lucky.
        
               | HeatrayEnjoyer wrote:
               | Ran it a few times in new sessions, 0 failures so far.
        
             | slaymaker1907 wrote:
             | I wonder how much of a test this is for the LLM vs whatever
             | tokenizer/preprocessing they're doing.
        
             | Retr0id wrote:
             | fwiw I failed to figure it out as a human, I had to check
             | the replies.
        
             | Alifatisk wrote:
             | Is there any task Gemma is better at compared to others?
        
             | treyd wrote:
             | I wonder if they'd do better if there was the context that
             | it's in a thread titled "Adding 16 kb page size to
             | Android"? The "analyzing" interpretation is plausible if
             | you don't know what 16k pages, kernels, Asahi, etc are.
        
           | im3w1l wrote:
           | I asked chatgpt and it did get it.
           | 
           | Personally, when I read the comment my brain kinda skipped
           | over the word since it contained the part "lib" I assumed it
           | was some obscure library that I didn't care about. It doesn't
           | fit grammatically but I didn't give it enough thought to
           | notice.
        
         | IshKebab wrote:
         | Probably wouldn't be too hard to add a 16 kB page size
         | extension. But I think the Svnapot extension is their solution
         | to this problem. If you're not familiar it lets you mark a set
         | of pages as being part of a contiguously mapped 64 kB region.
         | No idea how the performance characteristics vary. It relieves
         | TLB pressure, but you still have to create 16 4kB page table
         | entries.
        
           | monocasa wrote:
           | Svnapot is a poor solution to the problem.
           | 
           | On one hand it means that that each page table entry takes up
           | half a cache line for the 16KB case, and two whole cache
           | lines in the 64KB case. This really cuts down on the page
           | walker hardware's ability to effectively prefetch TLB
           | entries, leading to basically the same issues as this classic
           | discussion about why tree based page tables are generally
           | more effective than hash based page tables (shifted forward
           | in time to today's gate counts).
           | https://yarchive.net/comp/linux/page_tables.html This is why
           | ARM shifted from a Svnapot like solution to the "translation
           | granule queryable and partially selectable at runtime"
           | solution.
           | 
           | Another issue is the fact that a big reason to switch to 16KB
           | or even 64KB pages is to allow for more address range for
           | VIPT caches. You want to allow high performance
           | implementations to be able to look up the cache line while
           | performing the TLB lookup in parallel, then compare the tag
           | with the result of the TLB lookup. This means that
           | practically only the untranslated bits of the address can be
           | used by the set selection portion of the cache lookup. When
           | you have 12 bits untranslated in a address, combined with 64
           | byte cachelines gives you 64 sets, multiply that by 8 ways
           | and you get the 32KB L1 caches very common in systems with
           | 4KB page sizes (sometimes with some heroic effort to throw a
           | ton of transistors/power at the problem to make a 64KB cache
           | by essentially duplicating large parts of the cache lookup
           | hardware for that extra bit of address). What you really want
           | is for the arch to be able to disallow 4KB pages like on
           | apple silicon which is the main piece that allows their giant
           | 128LB and 192KB L1 caches.
        
             | aseipp wrote:
             | > What you really want is for the arch to be able to
             | disallow 4KB pages like on apple silicon which is the main
             | piece that allows their giant 128LB and 192KB L1 caches.
             | 
             | Minor nit but they allow 4k pages. Linux doesn't support
             | 16k and 4k pages at the same time; macOS does but is just
             | very particular about 4k pages being used for scenarios
             | like Rosetta processes or virtual machines e.g. Parallels
             | uses it for Windows-on-ARM, I think. Windows will probably
             | never support non-4k pages I'd guess.
             | 
             | But otherwise, you're totally right. I wish RISC-V had gone
             | with the configurable granule approach like ARM did. Major
             | missed opportunity but maybe a fix will get ratified at
             | some point...
        
         | saagarjha wrote:
         | Probably very little, since the Android ecosystem is quite
         | divorced from the Linux one.
        
           | nabla9 wrote:
           | Android kernel is a mainstream Linux kernel, with additional
           | drivers, and other functionality.
        
             | temac wrote:
             | The linux kernel already works perfectly fine with various
             | base page sizes.
        
       | twoodfin wrote:
       | A little additional background: iOS has used 16KB pages since the
       | 64-bit transition, and ARM Macs have inherited that design.
        
         | arghwhat wrote:
         | A more relevant bit of background is that 4KB pages lead to
         | quite a lot of overhead due to the sheer number of mappings
         | needing to be configured and cached. Using larger pages reduce
         | overhead, in particular TLB misses as fewer entries are needed
         | to describe the same memory range.
         | 
         | While x86 chips mainly supports 4K, 2M and 1G pages, ARM chips
         | tend to support more practical 16K page sizes - a nice balance
         | between performance and wasting memory due to lower allocation
         | granularity.
         | 
         | Nothing in particular to do with Apple and iOS.
        
           | jsheard wrote:
           | Makes me wonder how much performance Windows is leaving on
           | the table with its primitive support for large pages. It does
           | support them, but it doesn't coalesce pages transparently
           | like Linux does, and explicitly allocating them requires
           | special permissions and is very likely to fail due to
           | fragmentation if the system has been running for a while. In
           | practice it's scarcely used outside of server software which
           | immediately grabs a big chunk of large pages at boot and
           | holds onto them forever.
        
             | arghwhat wrote:
             | Quite a bit, but 2M is an annoying size and the transparent
             | handling is suboptimal. Without userspace cooperating, the
             | kernel might end up having to split the pages at random due
             | to an unfortunate unaligned munmap/madvise from an
             | application not realizing it was being served 2M pages.
             | 
             | Having Intel/AMD add 16-128K page support, or making it
             | common for userspace to explicitly ask for 2M pages for
             | their heap arenas is likely better than the page merging
             | logic. Less fragile.
             | 
             | 1G pages are practically useless outside specialized server
             | software as it is very difficult to find 1G contiguous
             | memory to back it on a "normal" system that has been
             | running for a while.
        
               | jsheard wrote:
               | Would a reasonable compromise be to change the base
               | allocation granularity to 2MB, and transparently sub-
               | allocate those 2MB blocks into 64KB blocks (the current
               | Windows allocation granularity) when normal pages are
               | requested? That feels like it should keep 2MB page
               | fragmentation to a minimum without breaking existing
               | software, but given they haven't done it there's probably
               | some caveat I'm overlooking.
        
               | lanigone wrote:
               | you might find this interesting
               | 
               | https://www.hudsonrivertrading.com/hrtbeat/low-latency-
               | optim...
        
               | bewaretheirs wrote:
               | Intel's menu of page sizes is an artifact of its page
               | table structure.
               | 
               | On x86 in 64-bit mode, page table entries are 64 bits
               | each; the lowest level in the hierarchy (L1) is a 4K page
               | containing 512 64-bit of PTEs which in total map 2M of
               | memory, which is not coincidentally the large page size.
               | 
               | The L1 page table pages are themselves found via a PTE in
               | a L2 page table; one L2 page table page maps 512*2M = 1G
               | of virtual address space, which is again, not
               | coincidentally, the huge page size.
               | 
               | Large pages are mapped by a L2 PTE (sometimes called a
               | PDE, "page directory entry") with a particular bit set
               | indicating that the PTE points at the large page rather
               | than a PTE page. The hardware page table walker just
               | stops at that point.
               | 
               | And huge pages are similarly mapped by an L3 PTE with a
               | bit set indicating that the L3 PTE is a huge page.
               | 
               | Shoehorning an intermediate size would complicate page
               | table updates or walks or probably both.
               | 
               | Note that an OS can, of its own accord independent of
               | hardware maintain allocations as a coarser granularity
               | and sometimes get some savings out of this. For one
               | historic example, the VAX had a tiny 512-byte page size;
               | IIRC, BSD unix pretended it had a 1K page size and always
               | updated PTEs in pairs.
        
             | andai wrote:
             | A lot of low level stuff is a lot slower on Windows, let
             | alone the GUI. There's also entire blogs cataloging an
             | abundance of pathological performance issues.
             | 
             | The one I notice the most is the filesystem. Running Linux
             | in VirtualBox, I got _7x_ the host speed for many small
             | file operations. (On top of that Explorer itself has its
             | own random lag.)
             | 
             | I think a better question is how much performance are they
             | leaving on the table by bloating the OS so much. Like they
             | could have just not touched Explorer for 20 years and it
             | would be 10x snappier now.
             | 
             | I think the number is closer to 100x actually. Explorer on
             | XP opens (fully rendered) after a single video frame...
             | also while running virtualized inside Win10.
             | 
             | Meanwhile Win10 Explorer opens after a noticeable delay,
             | and then spends the next several hundred milliseconds
             | painting the UI elements one by one...
        
               | saagarjha wrote:
               | None of this has to do with page size.
        
               | pantalaimon wrote:
               | Death by 1000 cuts
        
               | Const-me wrote:
               | > The one I notice the most is the filesystem
               | 
               | I'm not sure it's the file system per se, I believe the
               | main reason is the security model.
               | 
               | NT kernel has rather sophisticated security. The
               | securable objects have security descriptors with many
               | access control entries and auditing rules, which inherit
               | over file system and other hierarchies according to some
               | simple rules e.g. allow+deny=deny. Trustees are members
               | of multiple security groups, and security groups can
               | include other security groups so it's not just a list,
               | it's a graph.
               | 
               | This makes access checks in NT relatively expensive. The
               | kernel needs to perform access check every time a process
               | creates or opens a file, that's why CreateFile API
               | function is relatively slow.
        
               | temac wrote:
               | I've been trying to use auditing rules for a usage that
               | seems completely in scope and obvious to prioritize from
               | a security point of view (tracing access to EFS files
               | and/or the keys allowing the access) and my conclusion
               | was that you basically can't, the doc is garbage, the
               | implementation is probably ad-hoc with lots of holes, and
               | MS probably hasn't prioritised the maintenance of this
               | feature since several decades (too busy adding ads in the
               | start menu I guess)
               | 
               | The NT security descriptors are also so complex they are
               | probably a little useless in practice too, because it's
               | too hard to use correctly. On top of that the associated
               | Win32 API is also too hard to use correctly to the point
               | that I found an important bug in the usage model
               | described in MSDN, meaning that the doc writer did not
               | know how the function actually work (in tons of cases you
               | probably don't hit this case, but if you start digging in
               | all internal and external users, who knows what you could
               | find...)
               | 
               | NT was full of good ideas but the execution is often
               | quite poor.
        
               | nullindividual wrote:
               | > The one I notice the most is the filesystem.
               | 
               | This is due to the extensible file system filter model in
               | place; I'm not aware of another OS that implements this
               | feature and is primarily used for antivirus, but can be
               | used by any developer for any purpose.
               | 
               | It applies to all file systems on Windows.
               | 
               | DevDrive[0] is Microsoft's current solution to this.
               | 
               | > Meanwhile Win10 Explorer opens after a noticeable delay
               | 
               | This could be, again, largely due to 3rd party hooks (or
               | 1st party software that doesn't ship with Windows) into
               | Explorer.
               | 
               | [0] https://devblogs.microsoft.com/visualstudio/devdrive/
        
               | andai wrote:
               | I'm glad you mentioned that. I noticed when running
               | "Hello world" C program on Windows 10 that Windows
               | performs over 100 reads of the Registry before running
               | the program. Same thing when I right click a file...
               | 
               | A few of those are 3rd party, but most are not.
        
               | nullindividual wrote:
               | Remember that Win32 process creation is expensive[0]. And
               | on NT, processes don't run, _threads do_.
               | 
               | The strategy of applications, like olde-tymey Apache
               | using multiple processes to handle incoming connections
               | is fine on UN*X, but terrible on Windows.
               | 
               | [0] https://fourcore.io/blogs/how-a-windows-process-is-
               | created-p...
        
               | redleader55 wrote:
               | > I'm not aware of another OS that implements this
               | feature
               | 
               | I'm not sure this is exactly what you mean, but Linux has
               | inotify and all sorts of BPF hooks for filtering various
               | syscalls, for example file operations.
        
               | rincebrain wrote:
               | FSFilters are basically a custom kernel module that can
               | and will do anything they want on any filesystem access.
               | (There's also network filters, which is how things like
               | WinPcap get implemented.)
               | 
               | So yes, you could implement something similar in Linux,
               | but there's not, last I looked, a prebuilt toolkit and
               | infrastructure for them, just the generic interfaces you
               | can use to hook anything.
               | 
               | (Compare the difference between writing a BPF module to
               | hook all FS operations, and the limitations of eBPF, to
               | having an InterceptFSCalls struct that you define in your
               | custom kernel module to run your own arbitrary code on
               | every access.)
        
               | hinkley wrote:
               | > The one I notice the most is the filesystem. Running
               | Linux in VirtualBox, I got 7x the host speed for many
               | small file operations. (On top of that Explorer itself
               | has its own random lag.)
               | 
               | That's a very old problem. In early days of subversion,
               | the metadata for every directory existed in the
               | directory. The rationale was that you could check out
               | just a directory in svn. It was disastrously slow on
               | Windows and the subversion maintainers had no answer for
               | it, except insulting ones like "turn off virus scanning".
               | Telling a windows user to turn off virus scanning is
               | equivalent to telling someone to play freeze tag in
               | traffic. You might as well just tell them, "go fuck
               | yourself with a rusty chainsaw"
               | 
               | Someone reorganized the data so it all happened at the
               | root directory and the CLI just searched upward until it
               | found the single metadata file. If memory serves that
               | made large checkouts and updates about 2-3 times faster
               | on Linux and 20x faster on windows.
        
             | tedunangst wrote:
             | I've lost count of how many blog posts about poor
             | performance ended with the punchline "so then we turned off
             | page coalescing".
        
           | daghamm wrote:
           | IIRC, 64-bit ARM can do 4K, 16K, 64K and 2M pages. But there
           | are some special rules for the last one.
           | 
           | https://documentation-
           | service.arm.com/static/64d5f38f4a92140...
        
         | HumblyTossed wrote:
         | How is this "additional background"? This was a post by Google
         | regarding Android.
        
       | a1o wrote:
       | > The very first 16 KB enabled Android system will be made
       | available on select devices as a developer option. This is so you
       | can use the developer option to test and fix
       | 
       | > once an application is fixed to be page size agnostic, the same
       | application binary can run on both 4 KB and 16 KB devices
       | 
       | I am curious about this. When could an app NOT be agnostic to
       | this? Like what an app must be doing to cause this to be
       | noticeable?
        
         | mlmandude wrote:
         | If you use mmap/munmap directly within your application you
         | could probably get into trouble by hardcoding the page size.
        
         | vardump wrote:
         | For example use mmap and just assume 4 kB pages.
        
         | dmytroi wrote:
         | Also ELF segment alignment, which is defaults to 4k.
        
           | bri3d wrote:
           | Only on Android, for what it's worth; most "vanilla" Linux
           | aarch64 linkers chose 64K defaults several years ago. But
           | yes, most Android applications with native (NDK) binaries
           | will need to be rebuilt with the new 16kb max-page-size.
        
         | edflsafoiewq wrote:
         | jemalloc bakes in page size assumptions, see eg
         | https://github.com/jemalloc/jemalloc/issues/467.
        
         | sweeter wrote:
         | Wine doesn't work on 16 KB page size among other things.
        
           | mananaysiempre wrote:
           | This seems especially peculiar given Windows has a 64K
           | mapping granularity.
        
             | tredre3 wrote:
             | Windows uses 4KB pages.
        
               | nullindividual wrote:
               | 4K, 2M ("large page"), or 1G ("huge page") on x86-64. A
               | single allocation request can consist of multiple page
               | sizes. From _Windows Internal 7th Edt Part 1_ :
               | On Windows 10 version 1607 x64 and Server 2016 systems,
               | large pages may also be mapped with huge pages, which are
               | 1 GB in size. This is done automatically if the
               | allocation size requested is larger than 1 GB, but it
               | does not have to be a multiple of 1 GB. For example, an
               | allocation of 1040 MB would result in using one huge page
               | (1024 MB) plus 8 "normal" large pages (16 MB divided by 2
               | MB).
        
               | mananaysiempre wrote:
               | Right (on x86-32 and -64, because you can't have 64KB
               | pages there, though larger page sizes do exist and get
               | used). You still cannot (e.g.) MapViewOfFile() on an
               | address not divisible by 64KB, because Alpha[1]. As far
               | as I understand, Windows is mostly why the docs for the
               | Blink emulator[2] (a companion project of Cosmopolitan
               | libc) tell you any programs under it need to use
               | sysconf(_SC_PAGESIZE) [aka getpagesize() aka
               | getauxval(AT_PAGESZ)] instead of assuming 4KB.
               | 
               | [1] https://devblogs.microsoft.com/oldnewthing/20031008-0
               | 0/?p=42...
               | 
               | [2] https://github.com/jart/blink/blob/master/README.md#c
               | ompilin...
        
         | o11c wrote:
         | The fundamental problem is that system headers don't provide
         | enough information. In particular, many programs need both "min
         | runtime page size" and "max runtime page size" (and by this I
         | mean non-huge pages).
         | 
         | If you call `mmap` without constraint, you need to assume the
         | result will be aligned to at least "min runtime page size". In
         | practice it is _probably_ safe to assume 4K for this for
         | "normal" systems, but I've seen it down to 128 bytes on some
         | embedded systems, and I don't have much breadth there (this
         | will break many programs though, since there are more errno
         | values than that). I don't know enough about SPARC binary
         | compatibility to know if it's safe to push this up to 8K for
         | certain targets.
         | 
         | But if you want to call `mmap` (etc.) with full constraint, you
         | must work in terms of "max runtime page size". This is known to
         | be up to at least 64K in the wild (aarch64), but some
         | architectures have "huge" pages not much beyond that so I'm not
         | sure (256K, 512K, and 1M; beyond that is almost certainly going
         | to be considered huge pages).
         | 
         | Besides a C macro, these values also need to be baked into the
         | object file and the linker needs to prevent incompatible
         | assumptions (just in case a new microarchitecture changes them)
        
           | dotancohen wrote:
           | Yes, but the context here is Java or Kotlin running on
           | Android, not embedded C.
           | 
           | Or do some Android applications run embedded C with only a
           | Java UI? I'm not an Android dev.
        
             | saagarjha wrote:
             | Android apps can call into native code via JNI, which the
             | platform supports.
        
               | ignoramous wrote:
               | Wonder if Android apps can also be fully native (C++)?
        
             | orf wrote:
             | Yes, Android apps can and do have native libraries.
             | Sometimes this can be part of a SDK, or otherwise out of
             | the developers control.
        
             | warkdarrior wrote:
             | Apps written in Flutter/Dart and React Native/Javascript
             | both compile to native code with only shims to interface
             | with the Java UI framework.
        
             | david_allison wrote:
             | The Android Native Development Kit (NDK) allows building
             | native code libraries for Android (typically C/C++, but
             | this can include Rust). These can then be loaded and
             | accessed by JNI on the Java/Kotlin side
             | 
             | * Brief overview of the NDK:
             | https://developer.android.com/ndk/guides
             | 
             | * Guide to supporting 16KB page sizes with the NDK
             | https://developer.android.com/guide/practices/page-sizes
        
             | fpoling wrote:
             | Chrome browser on Android uses the same code base as Chrome
             | on desktop including multi-process architecture. But it's
             | UI is in Java communicating with C++ using JNI.
        
           | lanigone wrote:
           | you can also do 2M and 1G huge pages on x86, it gets kind of
           | silly fast.
        
             | ShroudedNight wrote:
             | 1G huge pages had (have?) performance benefits on managed
             | runtimes for certain scenarios (Both the JIT code cache and
             | the GC space saw uplift on the SpecJ benchmarks if I recall
             | correctly)
             | 
             | If using relatively large quantities of memory 2M should
             | enable much higher TLB hit rates assuming the CPU doesn't
             | do something silly like only having 4 slots for pages
             | larger than 4k !.!
        
             | ignoramous wrote:
             | What? Any pointers on how 1G speeds things up? I'd have
             | taken a bigger page size to wreak havoc on process
             | scheduling and filesystem.
        
         | saagarjha wrote:
         | Pages sizes are often important to code that relies on low-
         | level details of the environment it's running in, like language
         | runtimes. They might do things like mark some sections of code
         | as writable or executable and thus would need to know what the
         | granularity of those requests can be. It's also of importance
         | to things like allocators that hand out memory backed by mmap
         | pages. If they have, say, a bit field for each 16-byte region
         | of a page that has been used that will change in size in ways
         | they can detect.
        
         | growse wrote:
         | If you use a database library that does mmap to create a db
         | file with SC_PAGE_SIZE (4KB) pages, and then upgrade your
         | device to a 16KB one and backup/restore the app, now your data
         | isn't readable.
        
         | nox101 wrote:
         | I don't know if this fits but I've seen code that allocated say
         | 32 bytes from a function that allocated 1meg under the hood.
         | Not knowing that's what was happening the app quickly ran out
         | of memory. It arguably was not the app's fault. The API it was
         | calling into was poorly designed and poorly named, such that
         | the fact that you might need to know the block size to use the
         | function was in no way indicated by the name of the function
         | nor the names of any of its parameters.
        
       | lostmsu wrote:
       | Not entirely related (except the block size), but I am
       | considering making and standardizing a system-wide content-based
       | cache with default block size 16KB.
       | 
       | The idea is that you'd have a system-wide (or not) service that
       | can do two or three things:
       | 
       | - read 16KB block by its SHA256 (also return length that can be
       | <16KB), if cached
       | 
       | - write a block to cache
       | 
       | - maybe pin a block (e.g. make it non-evictable)
       | 
       | I would be like a block-level file content dedup + eviction to
       | keep the size limited.
       | 
       | Should reduce storage used by various things due to dedup
       | functionality, but may require internet for corresponding apps to
       | work properly.
       | 
       | With a peer-to-peer sharing system on top of it may significantly
       | reduce storage requirements.
       | 
       | The only disadvantage is the same as with shared website caches
       | prior to cache isolation introduction: apps can poke what you
       | have in your cache and deduce some information about you from it.
        
         | monocasa wrote:
         | I'd probably pick a size greater than 16KB for that. Windows
         | doesn't expose translations less than 64KB in their version of
         | mmap, and internally their file cache works in increments of
         | 256KB. And these were numbers they picked back in the 90s.
        
         | treyd wrote:
         | I would go for higher than 16K. I believe BitTorrent's default
         | minimum chunk size is 64K, for example. It really depends on
         | the use case in question though, if you're doing random writes
         | then larger chunk sizes quickly waste a ton of bandwidth,
         | especially if you're doing recursive rewrites of a tree
         | structure.
         | 
         | Would a variable chunk size be acceptable for whatever it is
         | you're building?
        
       | devit wrote:
       | Seems pretty dubious to do this without adding support for having
       | both 4KB and 16KB processes at once to the Linux kernel, since it
       | means all old binaries break and emulators which emulate normal
       | systems with 4KB pages (Wine, console emulators, etc.) might
       | dramatically lose performance if they need to emulate the MMU.
       | 
       | Hopefully they don't actually ship a 16KB default before
       | supporting 4KB pages as well in the same kernel.
       | 
       | Also it would probably be reasonable, along with making the Linux
       | kernel change, to design CPUs where you can configure a 16KB
       | pagetable entry to map at 4KB granularity and pagefault after the
       | first 4KB or 8KB (requires 3 extra bits per PTE or 2 if coalesced
       | with the invalid bit), so that memory can be saved by allocating
       | 4KB/8KB pages when 16KB would have wasted padding.
        
         | username81 wrote:
         | Shouldn't there be some kind of setting to change the page size
         | per program? AFAIK AMD64 CPUs can do this.
        
         | fouronnes3 wrote:
         | Could they upstream that or would that require a fork?
        
         | mgaunard wrote:
         | why does it break userland? if you need to know the page size,
         | you should query sysconf SC_PAGESIZE.
        
           | akdev1l wrote:
           | Assumptions in the software.
           | 
           | Jemalloc is infamous for this:
           | https://github.com/sigp/lighthouse/issues/5244
        
           | Dwedit wrote:
           | Emulating a processor with 4K size pages becomes much higher
           | performance if you can use real addresses directly.
        
           | fweimer wrote:
           | It should not break userland. GNU/Linux (not necessarily
           | Android though) has supported 64K pages pretty much from the
           | start because that was the originally page size chosen for
           | server-focus kernels and distributions. But there are some
           | things that need to be worked around.
           | 
           | Certain build processes determine the page size at compile
           | time and assume it's the same at run time, and fail if it is
           | not: https://github.com/jemalloc/jemalloc/issues/467
           | 
           | Some memory-mapped files formats have assumptions about page
           | granularity:
           | https://bugzilla.redhat.com/show_bug.cgi?id=1979804
           | 
           | The file format issue applies to ELF as well. Some people
           | patch their toolchains (or use suitable linker options) to
           | produce slightly smaller binaries that can only be loaded if
           | the page size is 4K, even though the ABI is pretty clear in
           | that you should link for compatibility with up to 64K pages.
        
           | ndesaulniers wrote:
           | Ossification.
           | 
           | If the page size has been 4k for decades for most OS' and
           | architectures, people get sloppy and hard code that literal
           | value, rather than query for it.
        
         | phh wrote:
         | Google/Android doesn't care much about backward compatibility
         | and broke programs released on Pixel 3 in Pixel 7. (the
         | interdiction of 32bit-only apps is 2019 on Play Store, Pixel 7
         | is first 64bits only device, while Google still released 32bits
         | only device in 2023...). They quite regularly break apps in new
         | Android versions (despite their infrastructure to handle
         | backward compatibility), and app developers are used to brace
         | themselves around Android & Pixel releases
        
           | reissbaker wrote:
           | Generally I've found Google to care much more about not
           | breaking old apps compared to Apple, which often expects
           | developers to rebuild apps for OS updates or else the apps
           | stop working entirely (or buy entirely new machines to get OS
           | updates at all, e.g. the Intel/Apple Silicon transition).
           | Google isn't on the level of Windows "we will watch for
           | specific binaries and re-introduce bugs in the kernel
           | specifically for those binaries that they depend on" in terms
           | of backwards compatibility, but I wouldn't go so far as to
           | say they don't care. I'm not sure whether that's better or
           | worse: there's definitely merit to Apple's approach, since it
           | keeps them able to iterate quickly on UX and performance by
           | dropping support for the old stuff.
        
         | Veserv wrote:
         | Having both 4KB and 16KB simultaneously is either easy or hard
         | depending on which hardware feature they are using for 16KB
         | pages.
         | 
         | If they are using the configurable granule size, then that is a
         | system-wide hardware configuration option. You literally can
         | not map at smaller granularity while that bit is set.
         | 
         | You might be able to design a CPU that allows your idea of
         | partial pages, but there be dragons.
         | 
         | If they are not configuring the granule size, instead opting
         | for software enforcement in conjunction with always using the
         | contiguous hint bit, then it might be possible.
         | 
         | However, I am pretty sure they are talking about hardware
         | granule size, since the contiguous hint is most commonly used
         | to support 16 contiguous entrys (though the CPU designer is
         | technically allowed to do whatever grouping they want) which
         | would be 64KB.
        
           | stingraycharles wrote:
           | I'm a total idiot, how exactly is page size a CPU issue
           | rather than a kernel issue? Is it about memory channel
           | protocols / communication?
           | 
           | Disks have been slowly migrating away from the 4kb sector
           | size, is this a same thing going on? That you need to actual
           | drive to support it, because of internal structuring (i.e.
           | how exactly the CPU aligns things in RAM), and on some super
           | low level 4kb / 16kb being the smallest unit of memory you
           | can allocate?
           | 
           | And does that then mean that there's less overhead in all
           | kinds of memory (pre)fetchers in the CPU, because more can be
           | achieved in less clock cycles?
        
             | IshKebab wrote:
             | The CPU has hardware that does a page table walk
             | automatically when you access an address for which the
             | translation is not cached in the TLB. Otherwise virtual
             | memory would be really slow.
             | 
             | Since the CPU hardware itself is doing the page table walk
             | it needs to understand page tables and page table entries
             | etc. including how big pages are.
             | 
             | Also you need to know how big pages are for the TLB itself.
             | 
             | The value of 4kB itself is pretty much arbitrary. It has to
             | be a small enough number that you don't waste a load of
             | memory by mapping memory that isn't used (e.g. if you ask
             | for 4.01kB you're actually going to get 8kB), but a large
             | enough number that you aren't spending all your time
             | managing tiny pages.
             | 
             | That's why increasing the page size makes things faster but
             | waste more memory.
             | 
             | 4kB arguably isn't optimal anymore since we have way more
             | memory now than when it was de facto standardised so it
             | doesn't matter as much if we waste a bit. Maybe.
        
               | quotemstr wrote:
               | As an aside, it's shame that hardware page table walking
               | won out over software filled TLBs, as some older
               | computers had. I wonder what clever and wonderful hacks
               | we might have been able to invent had we not needed to
               | give the CPU a raw pointer to a data structure the layout
               | of which is fixed forever.
        
               | IshKebab wrote:
               | Yeah maybe, though in practice I think it would be just
               | too slow.
        
               | Denvercoder9 wrote:
               | Page table layout isn't really fixed forever, x86 has
               | changed its multiple times.
        
             | fpoling wrote:
             | Samsung SSD still reports to the system that their logical
             | sector size is 512 bytes. In fact one of the recent models
             | even removed the option to reconfigure the disk to use 4k
             | logical sectors. Presumably Samsung has figured that since
             | the physical sector is much larger and they need complex
             | mapping of logic sectors in any case, they decided not to
             | support 4K option and stick with 512 bytes.
        
             | pwg wrote:
             | > I'm a total idiot, how exactly is page size a CPU issue
             | rather than a kernel issue?
             | 
             | Because the size of a page is a hardware defined size for
             | Intel and ARM CPU's (well, more modern Intel and ARM CPU's
             | give the OS a choice of sizes from a small set of options).
             | 
             | It (page size) is baked into the CPU hardware.
             | 
             | > And does that then mean that there's less overhead in all
             | kinds of memory (pre)fetchers in the CPU, because more can
             | be achieved in less clock cycles?
             | 
             | For the same size TLB (Translation Look-aside Buffer -- the
             | CPU hardware that stores the "referencing info" for the
             | currently active set of pages being used by the code
             | running on the CPU) a larger page size allows more total
             | memory to be accessible before taking a page fault and
             | having to replace one or more of the entries in the TLB. So
             | yes, it means less overhead, because CPU cycles are not
             | used up in replacing as many TLB entries as often.
        
             | s_tec wrote:
             | Each OS process has its own virtual address space, which is
             | why one process cannot read another's memory. The CPU
             | implements these address spaces in hardware, since
             | literally every memory read or write needs to have its
             | address translated from virtual to physical.
             | 
             | The CPU's address translation process relies on tables that
             | the OS sets up. For instance, one table entry might say
             | that the 4K memory chunk with virtual address
             | 0x21000-0x21fff maps to physical address 0xf56e3000, and is
             | both executable and read-only. So yes, the OS sets up the
             | tables, but the hardware implements the protection.
             | 
             | Since memory protection is a hardware feature, the hardware
             | needs to decide how fine-grained the pages are. It's
             | possible to build a CPU with byte-level protection, but
             | this would be crazy-inefficient. Bigger pages mean less
             | translation work, but they can also create more wasted
             | space. Sizes in the 4K-64K range seem to offer good
             | tradeoffs for everyday workloads.
        
           | sweetjuly wrote:
           | Hmm, I'm not sure that's quite right. ARMv8 supports per TTBR
           | translation granules [1] and so you can have 4K and 16K user
           | processes coexisting under an arbitrary page size kernel by
           | just context switching TCR.TG0 at the same time as TTBR0.
           | There is no such thing as a global granule size.
           | 
           | [1]: https://arm.jonpalmisc.com/2023_09_sysreg/AArch64-tcr_el
           | 2#fi...
        
         | lxgr wrote:
         | > all old binaries break and emulators which emulate normal
         | systems with 4KB pages
         | 
         | Would it actually affect the kind of emulators present on
         | Android, i.e. largely software-only ones, as opposed to
         | hardware virtualizers making use of a CPU's vTLB?
         | 
         | Wine is famously not an emulator and as such doesn't really
         | exist/make sense on (non-x86) Android (as it would only be able
         | to execute ARM binaries, not x86 ones).
         | 
         | For the downvote: Genuinely curious here on which type of
         | emulator this could affect.
        
         | Zefiroj wrote:
         | The support for mTHP exists in upstream Linux, but the swap
         | story is not quite there yet. THP availability also needs work
         | and there are a few competing directions.
         | 
         | Supporting multiple page sizes well transparently is non-
         | trivial.
         | 
         | For a recent summary on one of the approaches, TAO (THP
         | Allocation Optimization), see this lwn article:
         | https://lwn.net/Articles/974636/
        
       | eyalitki wrote:
       | RHEL tried that in that past with 64KB on AARCH64, it led to MANY
       | bugs all across the software stack, and they eventually reverted
       | it - https://news.ycombinator.com/item?id=27513209.
       | 
       | I'm impressed by the effort on Google's side, yet I'll be
       | surprised if this effort will pay off.
        
         | nektro wrote:
         | apple's m-series chips use a 16kb page size by default so the
         | state of things has improved significantly with software
         | wanting to support asahi and other related endeavors
        
         | kcb wrote:
         | Nvidia is pushing 64KB pages on their Grace-Hopper system.
        
         | rincebrain wrote:
         | I didn't realize they had reverted it, I used to run RHEL
         | builds on Pi systems to test for 64k page bugs because it's not
         | like there's a POWER SBC I could buy for this.
        
       | daghamm wrote:
       | Can someone explain those numbers to me?
       | 
       | 5-10% performance boost sounds huge. Wouldn't we have much larger
       | TLBd if page walk was really this expensive?
       | 
       | On the other hand 9% increase in memory usage also sounds huge.
       | How did this affect memory usage that much?
        
         | scottlamb wrote:
         | > 5-10% performance boost sounds huge. Wouldn't we have much
         | larger TLBd if page walk was really this expensive?
         | 
         | It's pretty typical for large programs to spend 15+% of their
         | "CPU time" waiting for the TLB. [1] So larger pages really
         | help, including changing the base 4 KiB -> 16 KiB (4x reduction
         | in TLB pressure) and using 2 MiB huge pages (512x reduction
         | where it works out).
         | 
         | I've also wondered why the TLB isn't larger.
         | 
         | > On the other hand 9% increase in memory usage also sounds
         | huge. How did this affect memory usage that much?
         | 
         | This is the granularity at which physical memory is assigned,
         | and there are a lot of reasons most of a page might be wasted:
         | 
         | * The heap allocator will typically cram many things together
         | in a page, but it might say only use a given page for
         | allocations in a certain size range, so not all allocations
         | will snuggle in next to each other.
         | 
         | * Program stacks each use at least one distinct page of
         | physical RAM because they're placed in distinct virtual address
         | ranges with guard pages between. So if you have 1,024 threads,
         | they used at least 4 MiB of RAM with 4 KiB pages, 16 MiB of RAM
         | with 16 KiB pages.
         | 
         | * Anything from the filesystem that is cached in RAM ends up in
         | the page cache, and true to the name, it has page granularity.
         | So caching a 1-byte file would take 4 KiB before, 16 KiB after.
         | 
         | [1] If you have an Intel CPU, toplev is particularly nice for
         | pointing this kind of thing out.
         | https://github.com/andikleen/pmu-tools
        
           | 95014_refugee wrote:
           | > I've also wondered why the TLB isn't larger.
           | 
           | Fast CAMs are (relatively) expensive, is the excuse I always
           | hear.
        
       | ein0p wrote:
       | No mention of Apple on the page. Apple has been using 16K pages
       | for years now.
        
       | dboreham wrote:
       | Time to grab some THP popcorn...
        
       | taeric wrote:
       | I see they have measured improvements in the performance of some
       | things. In particular, the camera app starts faster. Small
       | percentage, but still real.
       | 
       | Curious if there are any other changes you could do based on some
       | of those learnings? The camera app, in particular, seems like a
       | good one to optimize to start instantly. Especially so with the
       | the shortcut "double power key" that many phones/people have
       | setup.
       | 
       | Specifically, I would expect you should be able to do something
       | like the lisp norm of "dump image?" Startup should then largely
       | be loading the image, not executing much if any initialization
       | code? (Honestly, I mostly assume this already happens?)
        
       | quotemstr wrote:
       | Good. It's about time. 4KB pages come down to us from 32-bit time
       | immemorial. We didn't bump the page size when we doubled the
       | sizes of pointers and longs for the 64-bit transition. 4KB has
       | been way too small for ages, and I'm glad we're biting the minor
       | compatibility bullet and adopting a page size more suited to
       | modern computing.
        
       | lxgr wrote:
       | Now I wonder: Does increased page size have any negative impacts
       | on I/O performance or flash lifetime, e.g. for writebacks of
       | dirty pages of memory-mapped files where only a small part was
       | changed?
       | 
       | Or is the write granularity of modern managed flash devices (such
       | as eMMCs as used in Android smartphones) much larger than either
       | 4 or 16 kB anyway?
        
         | tadfisher wrote:
         | Flash controllers expose blocks of 512B or 4096KB, but the
         | actual NAND chips operate in terms of "erase blocks" which
         | range from 1MB to 8MB (or really anything); in these blocks, an
         | individual bit can be flipped from "0" to "1" once, and
         | flipping any bit back to "0" requires erasing the entire block
         | and flipping the desired bits back to "1" [0].
         | 
         | All of this is hidden from the host by the NAND controller, and
         | SSDs employ many strategies (including DRAM caching,
         | heterogeneous NAND dies, wear-leveling and garbage-collection
         | algorithms) to avoid wearing the storage NAND. Effectively you
         | must treat flash storage devices as block devices of their
         | advertised block size because you have no idea where your data
         | ends up physically on the device, so any host-side algorithm is
         | fairly worthless.
         | 
         | [0]: https://spdk.io/doc/ssd_internals.html
        
           | lxgr wrote:
           | Writes on NAND happen at the block, not the page level,
           | though. I believe the ratio between the two is usually
           | something like 1:8 or so.
           | 
           | Even blocks might still be larger than 4KB, but if they're
           | not, presumably a NAND controller could allow such smaller
           | writes to avoid write amplification?
           | 
           | The mapping between physical and logical block address is
           | complex anyway because of wear leveling and bad block
           | management, so I don't think there's a need for write
           | granularity to be the erase block/page or even write block
           | size.
        
       ___________________________________________________________________
       (page generated 2024-08-23 23:00 UTC)