[HN Gopher] From slow to SIMD: A Go optimization story
       ___________________________________________________________________
        
       From slow to SIMD: A Go optimization story
        
       Author : rbanffy
       Score  : 109 points
       Date   : 2024-01-23 17:57 UTC (5 hours ago)
        
 (HTM) web link (sourcegraph.com)
 (TXT) w3m dump (sourcegraph.com)
        
       | declan_roberts wrote:
       | I appreciate that there's so many engineers out there that _love_
       | to make these kinds of improvements. I need them! I'm a high-
       | level person focused mostly on the product, and we would never
       | have a good product without these smart people.
       | 
       | Keep up the good work!
        
       | koala_man wrote:
       | TIL that the Go compiler still does not have autovectorization.
        
         | mcronce wrote:
         | They still don't have a register-based calling convention on
         | architectures other than x86-64, right? Or is that information
         | out of date?
        
           | mseepgood wrote:
           | That information is out of date: ARM64, PPC, RISC-V
           | https://go.dev/src/cmd/compile/abi-internal
        
         | hoten wrote:
         | Maybe a hidden blessing. Just ran into a nasty MSVC auto
         | vectorization bug. Apparently it's hard to get right.
         | 
         | https://developercommunity.visualstudio.com/t/Bad-codegen-du...
        
         | jsheard wrote:
         | Is it really worth the trouble if you're not building on top of
         | something like LLVM which already has a vectorizer? We're still
         | waiting for the mythical sufficiently-smart-vectorizer, even
         | the better ones are still extremely brittle, and any serious
         | high-performance work still does explicit SIMD rather than
         | trying to coax the vectorizer into cooperating.
         | 
         | I'd rather see new languages focus on making better explicit
         | SIMD abstractions a la Intels ISPC, rather than writing yet
         | another magic vectorizer that only actually works in trivial
         | cases.
        
           | mgaunard wrote:
           | any of the polyhedral frameworks is reasonably good at
           | splitting loop nests into parallelizable ones.
           | 
           | Then it's just a codegen problem.
           | 
           | But yes, ultimately, the user needs to be aware of how the
           | language works, what is parallelizable and what isn't, and of
           | the cost of the operations that they ask their computer to
           | execute.
        
           | neonsunset wrote:
           | C# is doing that :)
           | 
           | https://learn.microsoft.com/en-
           | us/dotnet/api/system.runtime....
           | 
           | Examples of usage:
           | 
           | - https://github.com/U8String/U8String/blob/main/Sources/U8St
           | r...
           | 
           | - https://github.com/nietras/1brc.cs/blob/main/src/Brc/BrcAcc
           | u...
           | 
           | - https://github.com/dotnet/runtime/blob/main/src/libraries/S
           | y...
           | 
           | (and many more if you search github for the uses of
           | Vector128/256<byte> and the like!)
        
         | neonsunset wrote:
         | Even manual vectorization is pain...writing ASM, really?
         | 
         | Rust has unstable portable SIMD and a few third-party crates,
         | C++ has that as well, C# has _stable_ portable SIMD and a small
         | out of box BLAS-like library to help with most common tasks
         | (like SoftMax, Magnitude and etc. on top of spans of floats
         | over writing manually), hell it even exercises PackedSIMD when
         | ran in a browser. And now Java is getting Panama vectors some
         | time in the future (though the question of codegen quality
         | stands open given planned changes to unsafe API).
         | 
         | Go among these is uniquely disadvantaged. And if that's not
         | enough, you may want to visit 1Brc's challenge discussions and
         | see that Go _struggles_ to get anywhere close to 2s mark with
         | both C# and C++ blazing past it:
         | 
         | https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
         | 
         | https://github.com/gunnarmorling/1brc/discussions/67
        
         | icholy wrote:
         | Last time I checked, it didn't even unroll loops.
        
       | carbocation wrote:
       | I wonder whether avo could have been useful here?[1] I mention it
       | because it came up the last time we were talking about AVX
       | operations in go.[2, 3]
       | 
       | 1 = https://github.com/mmcloughlin/avo
       | 
       | 2 = https://news.ycombinator.com/item?id=34465297
       | 
       | 3 =
       | https://www.reddit.com/r/golang/comments/10hmh07/how_to_use_...
        
         | camdencheek wrote:
         | I did consider Avo! I even went as far as to implement a
         | version using Avo since it has a nice dot product example I
         | could use as a starting point. But ultimately, for as small as
         | these functions are, I felt that Avo was an unnecessary extra
         | layer to grok. Additionally, it's x86-only, and I knew in
         | advance I'd want to implement an ARM version as well since we
         | also do some embeddings stuff locally.
         | 
         | If I were to ever take this further and add loop unrolling or
         | something, I'd absolutely reach for Avo
        
       | miki123211 wrote:
       | I wonder how well a simple C for loop with -O3 and maybe -march
       | would do here.
       | 
       | From my brief forays into reading (mostly AARCH64) assembly, it
       | looks like C compilers can detect these kinds of patterns now and
       | just convert them all to SIMD by themselves, with no work from
       | the programmer. Even at -O2, converting an index-based loop into
       | one based on start and end pointers is not unusual. Go doesn't
       | seem to do this, the assembly output by the Go compiler looks
       | much closer to the actual code than what you get from C.
       | 
       | Rust iterators would also be fun to benchmark here, they're
       | supposed to be as fast as plain old loops, and they're probably
       | optimized to omit bounds checks entirely.
        
         | mratsim wrote:
         | It depends.
         | 
         | You need 2~3 accumulators to saturate instruction-level
         | parallelism with a parallel sum reduction. But the compiler
         | won't do it because it only creates those when the operation is
         | associative, i.e. (a+b)+c = a+(b+c), which is true for integers
         | but not for floats.
         | 
         | There is an escape hatch in -ffast-math.
         | 
         | I have extensive benches on this here:
         | https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...
        
         | steveklabnik wrote:
         | > Rust iterators would also be fun to benchmark here
         | 
         | I started to write this out, and then thought "you know what
         | given how common this is, I bet I could even just google it"
         | and thought that would be more interesting, as it makes it feel
         | more "real world." The first result I got is what I would have
         | written: https://stackoverflow.com/a/30422958/24817
         | 
         | Here's a godbolt with three different outputs: one at -O, one
         | at -O3, and one at -03 and -march=native
         | 
         | https://godbolt.org/z/6xf9M1cf3
         | 
         | Eyeballing it comments:
         | 
         | Looks like 2 and 3 both provide extremely similar if not
         | identical output.
         | 
         | Adding the native flag ends up generating slightly different
         | codegen, I am not at the level to be able to simply look at
         | that and know how meaningful the difference is.
         | 
         | It does appear to have eliminated the bounds check entirely,
         | and it's using xmm registers.
         | 
         | I am pleasantly surprised at this output because zip in
         | particular can sometimes hinder optimizations, but rustc did a
         | great job here.
         | 
         | ----------------------
         | 
         | For fun, I figured "why not also try as direct of the original
         | Go as possible." The only trick here is that Rust doesn't
         | really do the c-style for loop the way Go does, so I tried to
         | translate what I saw as the spirit of the example: compare the
         | two lengths and use the minimum for the loop length.
         | 
         | Here it is: https://godbolt.org/z/cTcddc8Gs
         | 
         | ... literally the same. I am very surprised at this outcome. It
         | makes me wonder if LLVM has some sort of idiom recognition for
         | dot product specifically.
         | 
         | EDIT: looks like it does not currently, see the comment at line
         | 28 and 29:
         | https://llvm.org/doxygen/LoopIdiomRecognize_8cpp_source.html
        
           | camel-cdr wrote:
           | Those aren't vectorized at all, just unrolled. vmulss/vaddss
           | just multiply/add the single-precision floating-point in the
           | vector register.
           | 
           | With clang you get basically the same codegen, although it
           | uses fused multiply adds.
           | 
           | The problem is that you need to enable -ffast-math, otherwise
           | the compiler can't change the order of floating point
           | operations, and thus not vectorize.
           | 
           | With clang that works wonderfully and it gives us a lovely
           | four times unrolled AVX2 fused multiply add loop, but
           | enabling it in rust doesn't seem to work:
           | https://godbolt.org/z/G4Enf59Kb
           | 
           | Edit: from what I can tell this is still an open issue???
           | https://github.com/rust-lang/rust/issues/21690
           | 
           | Edit: relevant SO post:
           | https://stackoverflow.com/questions/76055058/why-cant-the-
           | ru... Apparently you need to use
           | `#![feature(core_intrinsics)]`, `std::intrinsics::fadd_fast`
           | and `std::intrinsics::fmul_fast`.
        
             | steveklabnik wrote:
             | Ahh yes, thank you.
             | 
             | Rust doesn't have a -ffast-math flag, though it is
             | interesting that you passed it directly to llvm. I am kinda
             | glad that escape hatch doesn't work, to be honest.
             | 
             | There are currently unstable intrinsics that let you do
             | this, and you seemingly get close to clang codegen with
             | them: https://godbolt.org/z/EEW79Gbxv
             | 
             | The thread tracking this discusses another attempt at a
             | flag to enable this by turning on the CPU feature directly,
             | but that doesn't seem to affect codegen in this case.
             | https://github.com/rust-lang/rust/issues/21690
             | 
             | It would be nice to get these intrinsics stabilized, at
             | least.
             | 
             | EDIT: oops you figured this out while I was writing it,
             | haha.
        
               | cbm-vic-20 wrote:
               | Even without a -ffast-math flag, the current stable Rust
               | compiler will vectorize loops on integer types.
               | 
               | https://godbolt.org/z/KjErzacfv
               | 
               | Edit: ...and I now realize who I responded to, I'm sure
               | you already know this. :)
        
             | nemothekid wrote:
             | It was just last week I was reading a comment that made it
             | seem like you shouldn't really use -ffast-math[0], but here
             | looks like a non-rare reason why you would want to enable
             | it.
             | 
             | What is correct idiom here? It feels if this sort of thing
             | really matters to you, you should have the know how to
             | handroll a couple lines of ASM. I want to say this is rare,
             | but I had a project a couple years ago where I needed to
             | handroll some vectorized instructions on a raspberry pi.
             | 
             | [0] https://news.ycombinator.com/item?id=39013277
        
               | steveklabnik wrote:
               | The right path forward for Rust here in my opinion is to
               | do the same thing as is done for math operations like
               | saturating: stabilize a function or method that performs
               | the operation with this semantic, and then build thin
               | wrappers on top to make using them more convenient.
        
               | camel-cdr wrote:
               | Imo the best solution would be special "fast" floating-
               | point types, that have less strict requirements.
               | 
               | I personal almost always use -ffast-math by default in my
               | C programs that care about performance, because I almost
               | never care enough about the loss in accuracy. The only
               | case I remember needing it was when doing some random
               | number distribution tests where I cared about subnormals,
               | and got confused for a second because they didn't seem to
               | exist (-ffast-math disables them on x86).
        
               | sevagh wrote:
               | Definitely don't take an HN comment as a serious
               | suggestion. Enable fast-math for your code, run your
               | objective evaluation that's suitable for your domain, and
               | if it passes the test, enjoy the added speed.
               | 
               | FWIW I have oodles of numerical C++ code where fast-math
               | doesn't change the output.
        
             | sroussey wrote:
             | fast-math changes even libraries not compiled with it, so
             | buyer beware.
        
       | zeehio wrote:
       | Isn't there a BLAS equivalent in go that could help here?
        
         | camdencheek wrote:
         | Copying in a response to a similar question on Reddit:
         | 
         | I should really add some discussion around BLAS in particular,
         | which has an good implementation[0] of the float32 dot product
         | that outperforms any of the float32 implementations in the blog
         | post. I'm getting ~1.9m vecs/s on my benchmarking rig.
         | 
         | However, that BLAS became unusable for us as soon as we
         | switched to quantized vectors because there is no int8
         | implementation of the dot product in BLAS (though I'd love to
         | be proven wrong)
         | 
         | [0]:
         | https://pkg.go.dev/gonum.org/v1/gonum@v0.14.0/blas/blas32#Do...
        
           | mkhnews wrote:
           | OpenBLAS with INTERFACE64=1 ??
        
             | camdencheek wrote:
             | AFAICT, that still doesn't support 8-bit integer dot
             | product? (To clarify, I was using int8 to mean 8-bit
             | integer, not 8-byte)
        
       | mkhnews wrote:
       | Great stuff. We used to do this sort of opt for dot, axpy, gemm,
       | etc. in the BLAS and LAPACK libs a long time ago.
        
       | malkia wrote:
       | I learned yesterday about GoLang's assembler
       | https://go.dev/doc/asm - after browsing how arrow is implemented
       | for different languages (my experience is mainly C/C++) -
       | https://github.com/apache/arrow/tree/main/go/arrow/math - there
       | are bunch of .S ("asm" files) and I'm still not able to
       | comprehend how these work exactly (I guess it'll take more
       | reading) - it seems very peculiar.
       | 
       | The last time I've used inlined assembly was back in
       | Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until
       | they got disabled. Then did very little gcc with their more
       | strict specification (while the former you had to know how the
       | ABI worked, the latter too - but it was specced out).
       | 
       | Anyway - I wasn't expecting to find this in "Go" :) But I guess
       | you can always start with .go code then produce assembly (-S)
       | then optimize it, or find/hire someone to do it.
        
       | blobbers wrote:
       | This is the type of optimization we used to see in embedded
       | hardware constrained environments.
       | 
       | Nowadays, even on most embedded platforms resources are so
       | abundant and applications are HORIZONTALLY scaled that it seems
       | to feel like nobody bothers with VERTICAL scaling.
       | 
       | The thing about multiplication is it is commutative; an 8x
       | improvement in performance of a single node means you need 8x
       | less nodes! What a fun way to cut down your AWS bill and send
       | your SaaS company to profitability.
        
         | whatshisface wrote:
         | I don't know the truth about it, but what people say is that
         | most unprofitable SaaS startups are the way they are because
         | they're making _no money at all_ , not that their costs are
         | higher than their 2x-one-developer's-compensation revenues.
        
           | __loam wrote:
           | I think there's plenty of sass companies out there that are
           | subsidizing what their customers are paying them with vc
           | cash. I've been at 2 sass companies where infrastructure
           | optimization was starting to be a real concern. If your unit
           | economics are in the red and additional customers cost you
           | money, you need to optimize somehow.
        
       | Thaliana wrote:
       | > why is it significant that we slice like a[i:i+4:i+4] rather
       | than just a[i:i+4]?
       | 
       | Well I had never seen that "full slice" expression syntax before.
       | It turns out that it's important because it controls the capacity
       | of the new slice. The capacity of the new slice is now i+4 - i.
       | 
       | So by using the full slice expression you get a slice of length 4
       | and capacity 4. Without doing this the capacity would be equal to
       | the capacity of the original slice.
       | 
       | I suppose that by controlling the capacity that you eliminate the
       | bounds check.
        
       | jeremycw wrote:
       | The overhead of cgo seems like it would not be an issue if you
       | could pass in an array of vectors and moved the outer loop into
       | the procedure instead of calling the procedure for each vector.
       | This may only be viable if there is a low overhead way to pass
       | arrays back and forth between go and C. I'm no expert but a quick
       | google makes me think this is possible.
       | 
       | I get that hand rolling assembly is fun but as a theoretical
       | future maintainer I would much prefer C to ASM.
        
         | camdencheek wrote:
         | Definitely possible! However, I prefer to avoid cgo wherever
         | possible for more than just the overhead.
         | 
         | In my experience:
         | 
         | - It complicates builds by requiring a C toolchain
         | 
         | - It makes single, static binaries more difficult
         | 
         | - It makes portability more difficult (not that assembly is
         | portable though)
         | 
         | - It causes difficult-to-debug issues (I recently ran into an
         | issue where MacOS signing changed, causing all my cgo binaries
         | to be killed on startup)
         | 
         | - Debuggers don't work across Cgo boundaries (they do with go
         | ASM!)
         | 
         | I think Dave Cheney said it best:
         | https://dave.cheney.net/2016/01/18/cgo-is-not-go
        
       | sroussey wrote:
       | For other languages (including nodejs/bun/rust/python etc) you
       | can have a look at SimSIMD which I have contributed to this year
       | (made recompiled binaries for nodejs/bun part of the build
       | process for x86_64 and arm64 on Mac and Linux, x86 and x86_64 on
       | windows).
       | 
       | [0] https://github.com/ashvardanian/SimSIMD
        
       ___________________________________________________________________
       (page generated 2024-01-23 23:00 UTC)