[HN Gopher] Metal shader converter and the missing device-scoped...
___________________________________________________________________
Metal shader converter and the missing device-scoped barrier
Author : raphlinus
Score : 25 points
Date : 2023-06-12 18:40 UTC (1 days ago)
(HTM) web link (raphlinus.github.io)
(TXT) w3m dump (raphlinus.github.io)
| tedunangst wrote:
| So how [well] does MoltenVK work? The prevailing attitude I've
| seen is basically "just target vulkan for everything because it
| just works" but I'm not sure how much experience is reflected in
| such claims.
| raphlinus wrote:
| If you're doing advanced compute work (including lock-free data
| structures), then it's best effort.
|
| https://github.com/linebender/vello/issues/42 is an issue from
| when Vello (then piet-gpu) had a single-pass prefix sum
| algorithm. Looking back, I'm fairly confident that it's a
| shader translation issue and that it wouldn't work with
| MoltenVK either, but we stopped investigating when we moved to
| a more robustly portable approach.
| bronxbomber92 wrote:
| I believe this post is referring to device-scoped _memory_
| barriers - also sometimes called fences - as opposed to
| _execution_ barriers.
|
| The former being a mechanism to ensure memory accesses follow a
| well defined order (e.g. it'd be bad if the memory accesses
| executed inside a critical section could be reordered before or
| after the lock and unlock calls).
|
| The latter being a mechanism that ensures all threads (within
| some scope, perhaps all threads running on the "device") reach
| the same point in the program before any are allowed to proceed.
| raphlinus wrote:
| That's correct, it's the _memory scope_ that I expect to be
| device-scoped. GPUs tend not to have execution barriers in the
| shader language beyond workgroup scope; generally the next
| coarser granularity for synchronization is a separate dispatch.
| However, single-pass prefix sum algorithms, including decoupled
| look-back, can function just fine with device-scoped memory
| barriers, and do not require execution barriers with coarser
| scope than workgroup.
| Animats wrote:
| Apple having to Think Different mean we need about two more
| layers in portable games.
| richdodd wrote:
| Does the M1/M2 use ARM designs in the GPU as well as the CPU? If
| so, it might be possible to work out what could be implemented by
| looking at the [arm docs](https://developer.arm.com/documentation
| /102203/0100/Valhall-...).
| richdodd wrote:
| Hmm OK according to the doucmentation they designed the GPU
| themselves, so there's no public information on them.
| nicoburns wrote:
| No, they have a custom GPU design originally derived from
| Imagination Technologies PowerVR GPUs.
| raphlinus wrote:
| The most complete documentation is in the applegpu repo[1] by
| dougallj showing a great deal of recent activity (including by
| alyssarosenzweig). Last I checked, the documentation of barrier
| instructions wasn't complete enough to tell whether these
| device-scoped barriers are possible. (Note: on RDNA2, they're
| accomplished by DLC and GLC flags on memory accesses, combined
| with cache flush instructions such as S_GL1_INV).
|
| There's also a lot of great material, accessibly written, on
| Alyssa's blog[2], see in particular the posts titled
| "Dissecting the Apple M1 GPU, part ${I}".
|
| [1]: https://github.com/dougallj/applegpu
|
| [2]: https://rosenzweig.io/
| DeRock wrote:
| Apple doesn't use ARM IP for either, and hasn't for many years.
___________________________________________________________________
(page generated 2023-06-13 23:01 UTC)