Post ASRdEVhLH5SWRP5Kds by promovicz@chaos.social
(DIR) More posts by promovicz@chaos.social
(DIR) Post #ASRah61PWAI5LJPXtY by mjg59@nondeterministic.computer
2023-02-07T18:08:30Z
0 likes, 0 repeats
Supporting AVX-512 increases the amount of data you need to store across context switches, which increases per-thread overhead. Apple took an innovative approach to this - disable AVX-512 by default, wait for a thread to hit an illegal instruction, enable AVX-512 for that thread, and replay the instruction, so you only take the overhead for threads that *use* AVX-512. But that works badly for apps that follow Intel's guide to detecting whether AVX-512 is available. https://github.com/golang/go/issues/43089
(DIR) Post #ASRavIoQl8Nn4B77Tc by kkarhan@mstdn.social
2023-02-07T18:11:13Z
0 likes, 0 repeats
@mjg59 I've yet to see the advantages of #AVX512 unless I can rely on it's presence.Otherwise supporting it seems to be rather pointless.
(DIR) Post #ASRb3TBe4TbnYhrnua by mjg59@nondeterministic.computer
2023-02-07T18:12:11Z
0 likes, 0 repeats
Apple took this a step further by trying to optimise for whether they needed to restore the AVX-512 registers by simply checking whether ZMM0-31 were all 0 or not. Unfortunately it's legitimate to have ZMM0-31 all be 0 and still have state in K0-7, which then blows up https://github.com/golang/go/issues/49233
(DIR) Post #ASRc8PxJbzb6wK5TNY by promovicz@chaos.social
2023-02-07T18:24:44Z
0 likes, 0 repeats
@mjg59 Isn't all of this supposed to be optimized through xsave/xrstor? Or are we past that for some reason?
(DIR) Post #ASRcRUe4Os7YzDBJ7g by mjg59@nondeterministic.computer
2023-02-07T18:28:24Z
0 likes, 0 repeats
@promovicz xsave will handle it, but you need a bigger buffer to store the state depending on which features are enabled
(DIR) Post #ASRcsdL8BLyrlqGRlI by rotopenguin@mastodon.social
2023-02-07T18:33:05Z
0 likes, 0 repeats
@mjg59 doesn't using AVX-512 also require a drawn-out transition period for the processor to change clocking and voltages? I thought that you would have to explicitly ask the OS for AVX512 anyway, since it impacts the whole chip and for longer than your timeslice.
(DIR) Post #ASRd1y9o9XxGwOQ4zw by hyc@mastodon.social
2023-02-07T18:34:55Z
0 likes, 0 repeats
@mjg59 not really innovative, people have been doing this trick for decades. I did the same thing in a kernel for the Atari Falcon 030, to skip save/restore of 68882 floating point registers on context switch, back in late 1980s.
(DIR) Post #ASRdEVhLH5SWRP5Kds by promovicz@chaos.social
2023-02-07T18:37:11Z
0 likes, 0 repeats
@mjg59 I thought that this is standard at this point. I see the conundrum with regards to support flags though. This architecture is getting messy... 🙂
(DIR) Post #ASRdqRM3s3xByVDY0m by xlerb@sfba.social
2023-02-07T18:43:47Z
0 likes, 0 repeats
@mjg59 Lazy FPU restore in general is a pretty old and well-known technique, isn't it? This seems like a more or less straightforward extension of the idea; I wonder if Intel didn't foresee it, or had some other reason for being less than ideally helpful here.Meanwhile, trying to search for “lazy FPU restore” to find the history, I see it was involved in yet another Spectre variant (https://en.wikipedia.org/wiki/Lazy_FP_state_restore). Sigh.
(DIR) Post #ASRf16OeEXoa764izg by wordshaper@weatherishappening.network
2023-02-07T18:57:10Z
0 likes, 0 repeats
@mjg59 Oh, that is awesomely clever on Apple's part. No wonder it fails weirdly.
(DIR) Post #ASS5ocJ8yUWo59luLI by broonie@mastodon.social
2023-02-07T23:57:14Z
0 likes, 0 repeats
@mjg59 huh, that’s how we’ve always handled SVE and now SME for arm64 Linux, but then for us the architecture allows us to trap all usage including all discovery mechanisms so it all works transparently.
(DIR) Post #ASSSoJcElkNJ3VWmcy by TomF@mastodon.gamedev.place
2023-02-08T04:15:01Z
0 likes, 0 repeats
@mjg59 Yeah, well, they're holding it wrong.
(DIR) Post #ASSUWmecGZTboQPOHg by bsmaalders@mas.to
2023-02-08T04:34:18Z
0 likes, 0 repeats
@mjg59 Solaris did the same thing w/ FP in general, although the utility was reduced by presence of FP variables in the default doprnt (iirc) routines. The kernel was off-limits for FP for speed reasons as well; later on, routines that wanted to use extended register save behavior enabled it explictly only when needed (bcopy, crypto, etc.).
(DIR) Post #ASXfUvuuvVEVFQ9Xoe by resuna@ohai.social
2023-02-10T16:30:44Z
0 likes, 0 repeats
@mjg59 Since Intel Mac is a lame duck, and it should only be only a performance issue, this should probably not be a priority.