[HN Gopher] From slow to SIMD: A Go optimization story
___________________________________________________________________
From slow to SIMD: A Go optimization story
Author : rbanffy
Score : 109 points
Date : 2024-01-23 17:57 UTC (5 hours ago)
(HTM) web link (sourcegraph.com)
(TXT) w3m dump (sourcegraph.com)
| declan_roberts wrote:
| I appreciate that there's so many engineers out there that _love_
| to make these kinds of improvements. I need them! I'm a high-
| level person focused mostly on the product, and we would never
| have a good product without these smart people.
|
| Keep up the good work!
| koala_man wrote:
| TIL that the Go compiler still does not have autovectorization.
| mcronce wrote:
| They still don't have a register-based calling convention on
| architectures other than x86-64, right? Or is that information
| out of date?
| mseepgood wrote:
| That information is out of date: ARM64, PPC, RISC-V
| https://go.dev/src/cmd/compile/abi-internal
| hoten wrote:
| Maybe a hidden blessing. Just ran into a nasty MSVC auto
| vectorization bug. Apparently it's hard to get right.
|
| https://developercommunity.visualstudio.com/t/Bad-codegen-du...
| jsheard wrote:
| Is it really worth the trouble if you're not building on top of
| something like LLVM which already has a vectorizer? We're still
| waiting for the mythical sufficiently-smart-vectorizer, even
| the better ones are still extremely brittle, and any serious
| high-performance work still does explicit SIMD rather than
| trying to coax the vectorizer into cooperating.
|
| I'd rather see new languages focus on making better explicit
| SIMD abstractions a la Intels ISPC, rather than writing yet
| another magic vectorizer that only actually works in trivial
| cases.
| mgaunard wrote:
| any of the polyhedral frameworks is reasonably good at
| splitting loop nests into parallelizable ones.
|
| Then it's just a codegen problem.
|
| But yes, ultimately, the user needs to be aware of how the
| language works, what is parallelizable and what isn't, and of
| the cost of the operations that they ask their computer to
| execute.
| neonsunset wrote:
| C# is doing that :)
|
| https://learn.microsoft.com/en-
| us/dotnet/api/system.runtime....
|
| Examples of usage:
|
| - https://github.com/U8String/U8String/blob/main/Sources/U8St
| r...
|
| - https://github.com/nietras/1brc.cs/blob/main/src/Brc/BrcAcc
| u...
|
| - https://github.com/dotnet/runtime/blob/main/src/libraries/S
| y...
|
| (and many more if you search github for the uses of
| Vector128/256<byte> and the like!)
| neonsunset wrote:
| Even manual vectorization is pain...writing ASM, really?
|
| Rust has unstable portable SIMD and a few third-party crates,
| C++ has that as well, C# has _stable_ portable SIMD and a small
| out of box BLAS-like library to help with most common tasks
| (like SoftMax, Magnitude and etc. on top of spans of floats
| over writing manually), hell it even exercises PackedSIMD when
| ran in a browser. And now Java is getting Panama vectors some
| time in the future (though the question of codegen quality
| stands open given planned changes to unsafe API).
|
| Go among these is uniquely disadvantaged. And if that's not
| enough, you may want to visit 1Brc's challenge discussions and
| see that Go _struggles_ to get anywhere close to 2s mark with
| both C# and C++ blazing past it:
|
| https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
|
| https://github.com/gunnarmorling/1brc/discussions/67
| icholy wrote:
| Last time I checked, it didn't even unroll loops.
| carbocation wrote:
| I wonder whether avo could have been useful here?[1] I mention it
| because it came up the last time we were talking about AVX
| operations in go.[2, 3]
|
| 1 = https://github.com/mmcloughlin/avo
|
| 2 = https://news.ycombinator.com/item?id=34465297
|
| 3 =
| https://www.reddit.com/r/golang/comments/10hmh07/how_to_use_...
| camdencheek wrote:
| I did consider Avo! I even went as far as to implement a
| version using Avo since it has a nice dot product example I
| could use as a starting point. But ultimately, for as small as
| these functions are, I felt that Avo was an unnecessary extra
| layer to grok. Additionally, it's x86-only, and I knew in
| advance I'd want to implement an ARM version as well since we
| also do some embeddings stuff locally.
|
| If I were to ever take this further and add loop unrolling or
| something, I'd absolutely reach for Avo
| miki123211 wrote:
| I wonder how well a simple C for loop with -O3 and maybe -march
| would do here.
|
| From my brief forays into reading (mostly AARCH64) assembly, it
| looks like C compilers can detect these kinds of patterns now and
| just convert them all to SIMD by themselves, with no work from
| the programmer. Even at -O2, converting an index-based loop into
| one based on start and end pointers is not unusual. Go doesn't
| seem to do this, the assembly output by the Go compiler looks
| much closer to the actual code than what you get from C.
|
| Rust iterators would also be fun to benchmark here, they're
| supposed to be as fast as plain old loops, and they're probably
| optimized to omit bounds checks entirely.
| mratsim wrote:
| It depends.
|
| You need 2~3 accumulators to saturate instruction-level
| parallelism with a parallel sum reduction. But the compiler
| won't do it because it only creates those when the operation is
| associative, i.e. (a+b)+c = a+(b+c), which is true for integers
| but not for floats.
|
| There is an escape hatch in -ffast-math.
|
| I have extensive benches on this here:
| https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...
| steveklabnik wrote:
| > Rust iterators would also be fun to benchmark here
|
| I started to write this out, and then thought "you know what
| given how common this is, I bet I could even just google it"
| and thought that would be more interesting, as it makes it feel
| more "real world." The first result I got is what I would have
| written: https://stackoverflow.com/a/30422958/24817
|
| Here's a godbolt with three different outputs: one at -O, one
| at -O3, and one at -03 and -march=native
|
| https://godbolt.org/z/6xf9M1cf3
|
| Eyeballing it comments:
|
| Looks like 2 and 3 both provide extremely similar if not
| identical output.
|
| Adding the native flag ends up generating slightly different
| codegen, I am not at the level to be able to simply look at
| that and know how meaningful the difference is.
|
| It does appear to have eliminated the bounds check entirely,
| and it's using xmm registers.
|
| I am pleasantly surprised at this output because zip in
| particular can sometimes hinder optimizations, but rustc did a
| great job here.
|
| ----------------------
|
| For fun, I figured "why not also try as direct of the original
| Go as possible." The only trick here is that Rust doesn't
| really do the c-style for loop the way Go does, so I tried to
| translate what I saw as the spirit of the example: compare the
| two lengths and use the minimum for the loop length.
|
| Here it is: https://godbolt.org/z/cTcddc8Gs
|
| ... literally the same. I am very surprised at this outcome. It
| makes me wonder if LLVM has some sort of idiom recognition for
| dot product specifically.
|
| EDIT: looks like it does not currently, see the comment at line
| 28 and 29:
| https://llvm.org/doxygen/LoopIdiomRecognize_8cpp_source.html
| camel-cdr wrote:
| Those aren't vectorized at all, just unrolled. vmulss/vaddss
| just multiply/add the single-precision floating-point in the
| vector register.
|
| With clang you get basically the same codegen, although it
| uses fused multiply adds.
|
| The problem is that you need to enable -ffast-math, otherwise
| the compiler can't change the order of floating point
| operations, and thus not vectorize.
|
| With clang that works wonderfully and it gives us a lovely
| four times unrolled AVX2 fused multiply add loop, but
| enabling it in rust doesn't seem to work:
| https://godbolt.org/z/G4Enf59Kb
|
| Edit: from what I can tell this is still an open issue???
| https://github.com/rust-lang/rust/issues/21690
|
| Edit: relevant SO post:
| https://stackoverflow.com/questions/76055058/why-cant-the-
| ru... Apparently you need to use
| `#![feature(core_intrinsics)]`, `std::intrinsics::fadd_fast`
| and `std::intrinsics::fmul_fast`.
| steveklabnik wrote:
| Ahh yes, thank you.
|
| Rust doesn't have a -ffast-math flag, though it is
| interesting that you passed it directly to llvm. I am kinda
| glad that escape hatch doesn't work, to be honest.
|
| There are currently unstable intrinsics that let you do
| this, and you seemingly get close to clang codegen with
| them: https://godbolt.org/z/EEW79Gbxv
|
| The thread tracking this discusses another attempt at a
| flag to enable this by turning on the CPU feature directly,
| but that doesn't seem to affect codegen in this case.
| https://github.com/rust-lang/rust/issues/21690
|
| It would be nice to get these intrinsics stabilized, at
| least.
|
| EDIT: oops you figured this out while I was writing it,
| haha.
| cbm-vic-20 wrote:
| Even without a -ffast-math flag, the current stable Rust
| compiler will vectorize loops on integer types.
|
| https://godbolt.org/z/KjErzacfv
|
| Edit: ...and I now realize who I responded to, I'm sure
| you already know this. :)
| nemothekid wrote:
| It was just last week I was reading a comment that made it
| seem like you shouldn't really use -ffast-math[0], but here
| looks like a non-rare reason why you would want to enable
| it.
|
| What is correct idiom here? It feels if this sort of thing
| really matters to you, you should have the know how to
| handroll a couple lines of ASM. I want to say this is rare,
| but I had a project a couple years ago where I needed to
| handroll some vectorized instructions on a raspberry pi.
|
| [0] https://news.ycombinator.com/item?id=39013277
| steveklabnik wrote:
| The right path forward for Rust here in my opinion is to
| do the same thing as is done for math operations like
| saturating: stabilize a function or method that performs
| the operation with this semantic, and then build thin
| wrappers on top to make using them more convenient.
| camel-cdr wrote:
| Imo the best solution would be special "fast" floating-
| point types, that have less strict requirements.
|
| I personal almost always use -ffast-math by default in my
| C programs that care about performance, because I almost
| never care enough about the loss in accuracy. The only
| case I remember needing it was when doing some random
| number distribution tests where I cared about subnormals,
| and got confused for a second because they didn't seem to
| exist (-ffast-math disables them on x86).
| sevagh wrote:
| Definitely don't take an HN comment as a serious
| suggestion. Enable fast-math for your code, run your
| objective evaluation that's suitable for your domain, and
| if it passes the test, enjoy the added speed.
|
| FWIW I have oodles of numerical C++ code where fast-math
| doesn't change the output.
| sroussey wrote:
| fast-math changes even libraries not compiled with it, so
| buyer beware.
| zeehio wrote:
| Isn't there a BLAS equivalent in go that could help here?
| camdencheek wrote:
| Copying in a response to a similar question on Reddit:
|
| I should really add some discussion around BLAS in particular,
| which has an good implementation[0] of the float32 dot product
| that outperforms any of the float32 implementations in the blog
| post. I'm getting ~1.9m vecs/s on my benchmarking rig.
|
| However, that BLAS became unusable for us as soon as we
| switched to quantized vectors because there is no int8
| implementation of the dot product in BLAS (though I'd love to
| be proven wrong)
|
| [0]:
| https://pkg.go.dev/gonum.org/v1/gonum@v0.14.0/blas/blas32#Do...
| mkhnews wrote:
| OpenBLAS with INTERFACE64=1 ??
| camdencheek wrote:
| AFAICT, that still doesn't support 8-bit integer dot
| product? (To clarify, I was using int8 to mean 8-bit
| integer, not 8-byte)
| mkhnews wrote:
| Great stuff. We used to do this sort of opt for dot, axpy, gemm,
| etc. in the BLAS and LAPACK libs a long time ago.
| malkia wrote:
| I learned yesterday about GoLang's assembler
| https://go.dev/doc/asm - after browsing how arrow is implemented
| for different languages (my experience is mainly C/C++) -
| https://github.com/apache/arrow/tree/main/go/arrow/math - there
| are bunch of .S ("asm" files) and I'm still not able to
| comprehend how these work exactly (I guess it'll take more
| reading) - it seems very peculiar.
|
| The last time I've used inlined assembly was back in
| Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until
| they got disabled. Then did very little gcc with their more
| strict specification (while the former you had to know how the
| ABI worked, the latter too - but it was specced out).
|
| Anyway - I wasn't expecting to find this in "Go" :) But I guess
| you can always start with .go code then produce assembly (-S)
| then optimize it, or find/hire someone to do it.
| blobbers wrote:
| This is the type of optimization we used to see in embedded
| hardware constrained environments.
|
| Nowadays, even on most embedded platforms resources are so
| abundant and applications are HORIZONTALLY scaled that it seems
| to feel like nobody bothers with VERTICAL scaling.
|
| The thing about multiplication is it is commutative; an 8x
| improvement in performance of a single node means you need 8x
| less nodes! What a fun way to cut down your AWS bill and send
| your SaaS company to profitability.
| whatshisface wrote:
| I don't know the truth about it, but what people say is that
| most unprofitable SaaS startups are the way they are because
| they're making _no money at all_ , not that their costs are
| higher than their 2x-one-developer's-compensation revenues.
| __loam wrote:
| I think there's plenty of sass companies out there that are
| subsidizing what their customers are paying them with vc
| cash. I've been at 2 sass companies where infrastructure
| optimization was starting to be a real concern. If your unit
| economics are in the red and additional customers cost you
| money, you need to optimize somehow.
| Thaliana wrote:
| > why is it significant that we slice like a[i:i+4:i+4] rather
| than just a[i:i+4]?
|
| Well I had never seen that "full slice" expression syntax before.
| It turns out that it's important because it controls the capacity
| of the new slice. The capacity of the new slice is now i+4 - i.
|
| So by using the full slice expression you get a slice of length 4
| and capacity 4. Without doing this the capacity would be equal to
| the capacity of the original slice.
|
| I suppose that by controlling the capacity that you eliminate the
| bounds check.
| jeremycw wrote:
| The overhead of cgo seems like it would not be an issue if you
| could pass in an array of vectors and moved the outer loop into
| the procedure instead of calling the procedure for each vector.
| This may only be viable if there is a low overhead way to pass
| arrays back and forth between go and C. I'm no expert but a quick
| google makes me think this is possible.
|
| I get that hand rolling assembly is fun but as a theoretical
| future maintainer I would much prefer C to ASM.
| camdencheek wrote:
| Definitely possible! However, I prefer to avoid cgo wherever
| possible for more than just the overhead.
|
| In my experience:
|
| - It complicates builds by requiring a C toolchain
|
| - It makes single, static binaries more difficult
|
| - It makes portability more difficult (not that assembly is
| portable though)
|
| - It causes difficult-to-debug issues (I recently ran into an
| issue where MacOS signing changed, causing all my cgo binaries
| to be killed on startup)
|
| - Debuggers don't work across Cgo boundaries (they do with go
| ASM!)
|
| I think Dave Cheney said it best:
| https://dave.cheney.net/2016/01/18/cgo-is-not-go
| sroussey wrote:
| For other languages (including nodejs/bun/rust/python etc) you
| can have a look at SimSIMD which I have contributed to this year
| (made recompiled binaries for nodejs/bun part of the build
| process for x86_64 and arm64 on Mac and Linux, x86 and x86_64 on
| windows).
|
| [0] https://github.com/ashvardanian/SimSIMD
___________________________________________________________________
(page generated 2024-01-23 23:00 UTC)