[HN Gopher] TIL: Go's CompareAndSwap is not always Compare-and-swap
       ___________________________________________________________________
        
       TIL: Go's CompareAndSwap is not always Compare-and-swap
        
       Author : enz
       Score  : 15 points
       Date   : 2024-01-08 17:39 UTC (5 hours ago)
        
 (HTM) web link (lu.sagebl.eu)
 (TXT) w3m dump (lu.sagebl.eu)
        
       | kevmo314 wrote:
       | Related anecdote, a coworker suggested I use
       | https://pkg.go.dev/math#FMA to optimize a multiply and add which
       | surprised me quite a bit: why would there be an opt-in to fused
       | multiply and add? Indeed, if you dive into the code
       | (https://cs.opensource.google/go/go/+/refs/tags/go1.21.5:src/...)
       | it's quite a bit more complicated than your normal a*x+b syntax,
       | so how could this possibly yield a performance improvement?
       | 
       | It turns out, with some more research
       | (https://github.com/golang/go/issues/25819), that the function
       | was added not to guarantee performance but to guarantee
       | _precision_ , namely that fused mutiply and add yields higher
       | precision than doing the operations stepwise and in certain
       | situations you'd like to guarantee precision. Which is cool, but
       | absolutely not what I would've guessed on first read, and the
       | first commenter also closed the issue with the same take!
       | 
       | So I was able to successfully counterpoint using math.FMA() as a
       | performance optimization and maybe a small personal takeaway to
       | not optimize unless I really know what the thing is doing.
        
         | dominikh wrote:
         | Note that the source you're seeing there is the fallback
         | implementation, which is only used if there is no instruction
         | for FMA in the architecture you're compiling for. On AMD64, for
         | example, the call to math.FMA will be replaced by the
         | VFMADD231SD instruction.
        
       | wahern wrote:
       | AFAIU, LL/SC is the more generic, powerful primitive. In theory
       | LL/SC can be used as the hardware primitive for a much broader
       | range of lock-free algorithms, as well as for software
       | transactional memory generally. CAS algorithms are more commonly
       | seen because it's the lowest common denominator, and the best x86
       | offered. But because of the limited number of addresses that can
       | be monitored in hardware without sacrificing performance or
       | efficiency, in practice LL/SC implementations are weak and only
       | slightly more useful than [double] CAS.
        
       | perryizgr8 wrote:
       | // Check support for LSE atomics
       | 
       | MOVBU internal/cpu*ARM64+const_offsetARM64HasATOMICS(SB), R4
       | 
       | CBZ R4, load_store_loop
       | 
       | Why is this a runtime decision? Shouldn't the compiler know if
       | the target machine supports the instruction or not?
        
         | enz wrote:
         | I believe Go wants to support "ARM64" that just works among the
         | heterogeneous fleets of Arm machines (from Raspberry Pi to
         | Graviton EC2 instances) and let the programmer not to worry
         | about it. GCC seems to do the same if I believe the article
         | linked on "MySQL on ARM": it emits code that dynamically
         | decides to either use CAS or LL/SC depending on the LSE
         | support.
        
       ___________________________________________________________________
       (page generated 2024-01-08 23:01 UTC)