[HN Gopher] TIL: Go's CompareAndSwap is not always Compare-and-swap
___________________________________________________________________
TIL: Go's CompareAndSwap is not always Compare-and-swap
Author : enz
Score : 15 points
Date : 2024-01-08 17:39 UTC (5 hours ago)
(HTM) web link (lu.sagebl.eu)
(TXT) w3m dump (lu.sagebl.eu)
| kevmo314 wrote:
| Related anecdote, a coworker suggested I use
| https://pkg.go.dev/math#FMA to optimize a multiply and add which
| surprised me quite a bit: why would there be an opt-in to fused
| multiply and add? Indeed, if you dive into the code
| (https://cs.opensource.google/go/go/+/refs/tags/go1.21.5:src/...)
| it's quite a bit more complicated than your normal a*x+b syntax,
| so how could this possibly yield a performance improvement?
|
| It turns out, with some more research
| (https://github.com/golang/go/issues/25819), that the function
| was added not to guarantee performance but to guarantee
| _precision_ , namely that fused mutiply and add yields higher
| precision than doing the operations stepwise and in certain
| situations you'd like to guarantee precision. Which is cool, but
| absolutely not what I would've guessed on first read, and the
| first commenter also closed the issue with the same take!
|
| So I was able to successfully counterpoint using math.FMA() as a
| performance optimization and maybe a small personal takeaway to
| not optimize unless I really know what the thing is doing.
| dominikh wrote:
| Note that the source you're seeing there is the fallback
| implementation, which is only used if there is no instruction
| for FMA in the architecture you're compiling for. On AMD64, for
| example, the call to math.FMA will be replaced by the
| VFMADD231SD instruction.
| wahern wrote:
| AFAIU, LL/SC is the more generic, powerful primitive. In theory
| LL/SC can be used as the hardware primitive for a much broader
| range of lock-free algorithms, as well as for software
| transactional memory generally. CAS algorithms are more commonly
| seen because it's the lowest common denominator, and the best x86
| offered. But because of the limited number of addresses that can
| be monitored in hardware without sacrificing performance or
| efficiency, in practice LL/SC implementations are weak and only
| slightly more useful than [double] CAS.
| perryizgr8 wrote:
| // Check support for LSE atomics
|
| MOVBU internal/cpu*ARM64+const_offsetARM64HasATOMICS(SB), R4
|
| CBZ R4, load_store_loop
|
| Why is this a runtime decision? Shouldn't the compiler know if
| the target machine supports the instruction or not?
| enz wrote:
| I believe Go wants to support "ARM64" that just works among the
| heterogeneous fleets of Arm machines (from Raspberry Pi to
| Graviton EC2 instances) and let the programmer not to worry
| about it. GCC seems to do the same if I believe the article
| linked on "MySQL on ARM": it emits code that dynamically
| decides to either use CAS or LL/SC depending on the LSE
| support.
___________________________________________________________________
(page generated 2024-01-08 23:01 UTC)