[HN Gopher] Understanding SIMD: Infinite complexity of trivial p...
       ___________________________________________________________________
        
       Understanding SIMD: Infinite complexity of trivial problems
        
       Author : verdagon
       Score  : 235 points
       Date   : 2024-11-25 17:08 UTC (5 days ago)
        
 (HTM) web link (www.modular.com)
 (TXT) w3m dump (www.modular.com)
        
       | Agingcoder wrote:
       | This is the first time I hear 'hyperscalar'. Is this generally
       | accepted ? ( I've been using SIMD since the MMX days so am a bit
       | surprised )
        
         | dragontamer wrote:
         | I don't think so.
         | 
         | Superscalar is a real term (multiple operations in one clock
         | tick due to parallel pipelines within a core). But hyperscalar
         | is cringe to me. There are tons of words describing SIMD
         | already, it seems unclear why someone would make up a new word
         | to describe an already existing concept.
         | 
         | Especially when a similar word (superscalar) already is defined
         | and likely gets confused for this new word.
        
           | ashvardanian wrote:
           | That may have been my mistake. I use super & hyper
           | interchangeably and don't always notice :)
           | 
           | PS: Should be an easy patch, will update!
        
             | dragontamer wrote:
             | Maybe not.
             | 
             | Superscalar is when say... Think of the following assembly
             | code.                  Add r1, r2        Sub r3, r4
             | 
             | And the add and subtract both happen on the same clock
             | tick. The important thing is that a modern CPU core (and
             | even GPU core) have multiple parallel ALU pipelines inside
             | of them.
             | 
             | Because r1, r2, r3 and r4 are fully independent, a modern
             | CPU can detect the potential parallelism here and act in
             | parallel. After CPUs mastered this trick, the next out of
             | order processors were invented (which not only allowed for
             | super scalar operations, but allowed the subtract to
             | execute first if for some reason the CPU core were waiting
             | on r1 or r2).
             | 
             | There are a ton of ways that modern CPUs and GPUs extract
             | parallelism from seemingly nothingness. And because all the
             | techniques are independent, we can have superscalar out-of-
             | order SIMD (like what happens in AVX512 in practice). SIMD
             | is... SIMD. It's one instruction applied to lots of data in
             | parallel. It's totally different.
             | 
             | You really need to use the correct word for the specific
             | kind of parallelism that you are trying to highlight. I
             | expect that the only word that makes sense in this article
             | is SIMD.
        
               | pyrolistical wrote:
               | I wish hardware exposed an api that allowed us to submit
               | a tree of instructions so the hardware doesn't need
               | figure out which instructions are independent.
               | 
               | Lots of this kind of work can be done during compilation
               | but cannot be communicated to hardware due to code being
               | linear
        
               | dragontamer wrote:
               | That's called VLIW and Intel Itanium is considered one of
               | the biggest chip failures of all time.
               | 
               | There is an argument that today's compilers are finally
               | good enough for VLIW to go mainstream, but good luck
               | convincing anyone in today's market to go for it.
               | 
               | ------
               | 
               | A big problem with VLIW is that it's impossible to
               | predict L1, L2, L3 or DRAM access. Meaning all
               | loads/stores are impossible to schedule by the compiler.
               | 
               | NVidia has interesting barriers that get compiled into
               | its SASS (a level lower than PTX assembly). These
               | barriers seem to allow the compiler to assist in the
               | dependency management process but ultimately still
               | require a decoder in the NVidia core final level before
               | execution.
        
               | neerajsi wrote:
               | Vliw is kind of the dual of what pyrolistical was asking
               | for. Vliw lets you bundle instructions that are known to
               | be independent rather than encode instructions to mark
               | known dependencies.
               | 
               | The idea pyrolistical mentioned is closer to explicit
               | data graph execution: https://en.m.wikipedia.org/wiki/Exp
               | licit_data_graph_executio....
        
               | creato wrote:
               | VLIW is still in use in multiple DSP products on the
               | market today, and they are good successful products in
               | their niche.
               | 
               | They work very well if your code can be written as a loop
               | without branches (or very limited branches) in the body,
               | and a lot of instruction level parallelism in the body.
               | 
               | Unfortunately for Intel, most code doesn't look like
               | that. But for most workloads that happen to also be a
               | good case for SIMD, it is (can be) great.
        
               | bjourne wrote:
               | Agree with this. Calling SIMD superscalar is a misnomer
               | since it is _single instruction_ (multiple data) with
               | very wide data paths. Superscalar implies multiple
               | _different_ instructions in parallel, such as adding a
               | pair of numbers, while subtracting another pair (or even
               | dividing).
        
         | spacemanspiff01 wrote:
         | I thought it was referring to this?
         | 
         | https://en.m.wikipedia.org/wiki/Hyperscale_computing
         | 
         | IE our simd implementation allows you to scale across different
         | architectures/ CPU revisions without having to rewrite assembly
         | for each CPU processor?
         | 
         | Edit: Rereading, that does not make much sense...
        
       | Joker_vD wrote:
       | > SIMD instructions are complex, and even Arm is starting to look
       | more "CISCy" than x86!
       | 
       | Thank you for saying it out loud. XLAT/XLATB of x86 is positively
       | tame compared to e.g. vrgatherei16.vv/vrgather.vv.
        
       | dragontamer wrote:
       | Intel needs to see what has happened to their AVX instructions
       | and why NVidia has taken over.
       | 
       | If you just wrote your SIMD in CUDA 15 years ago, NVidia
       | compilers would have given you maximum performance across all
       | NVidia GPUs rather than being forced to write and rewrite in SSE
       | vs AVX vs AVX512.
       | 
       | GPU SIMD is still SIMD. Just... better at it. I think AMD and
       | Intel GPUs can keep up btw. But software advantage and long term
       | benefits of rewriting into CUDA are heavily apparent.
       | 
       | Intel ISPC is a great project btw if you need high level code
       | that targets SSE, AVX, AVX512 and even ARM NEON all with one
       | codebase + auto compiling across all the architectures.
       | 
       | -------
       | 
       | Intels AVX512 is pretty good at a hardware level. But software
       | methodology to interact with SIMD using GPU-like languages should
       | be a priority.
       | 
       | Intrinsics are good for maximum performance but they are too hard
       | for mainstream programmers.
        
         | dist-epoch wrote:
         | > If you just wrote your SIMD in CUDA 15 years ago, NVidia
         | compilers would have given you maximum performance across all
         | NVidia GPUs
         | 
         | That's not true. For maximum performance you need to tweak the
         | code to a particular GPU model/architecture.
         | 
         | Intel has SSE/AVX/AVX2/AVX512, but CUDA has like 10 iterations
         | of this (increasing capabilities). Code written 15 years ago
         | would not use modern capabilities, like more flexible memory
         | access, atomics.
        
           | dragontamer wrote:
           | Maximum performance? Okay, you'll have to upgrade to ballot
           | instructions or whatever and rearchitect your algorithms. (Or
           | other wavefront / voting / etc. etc. new instructions that
           | have been invented. Especially those 4x4 matrix
           | multiplication AI instructions).
           | 
           | But CUDA -> PTX intermediate code has allowed for
           | significantly more flexibility. For crying out loud, the
           | entire machine code (aka SASS) of NVidia GPUs has been cycled
           | out at least 4 times in the past decade (128-bit bundles,
           | changes to instruction formats, acquire/release semantics,
           | etc etc)
           | 
           | It's amazing what backwards compatibility NVidia has achieved
           | in the past 15 years thanks to this architecture. SASS
           | changes so dramatically from generation to generation but the
           | PTX intermediate code has stayed highly competitive.
        
             | dist-epoch wrote:
             | Intel code from 15 years ago also runs today. But it will
             | not use AVX512.
             | 
             | Which is the same with PTX, right? If you didn't use the
             | tensor core instructions or wavefront voting in the CUDA
             | code, the PTX generated from it will not either, and NVIDIA
             | will not magically add those capabilities in when compiling
             | to SASS.
             | 
             | Maybe it remains competitive because the code is inherently
             | parallel anyway, so it will naturally scale to fill the
             | extra execution units of the GPU, which is where most of
             | the improvement is generation to generation.
             | 
             | While AVX code can't automatically scale to use the AVX512
             | units.
        
               | dragontamer wrote:
               | It's not the same. AVX2 instructions haven't changed and
               | never will change.
               | 
               | In contrast, NVidia can go from 64-bit instruction
               | bundles to 128-bit machine code (96-bit instruction +
               | 32-bit control information) between Pascal (aka PTX
               | Compute Capacity 5) and Voltage (aka PTX Compute Capacity
               | 7) and all the old PTX code just autocompiles to the new
               | assembly instruction format and takes advantage of all
               | the new memory barriers added in Volta.
               | 
               | Having a PTX translation later is a MAJOR advantage for
               | the NVidia workflow.
        
               | ashvardanian wrote:
               | There is still a lot of similarity between CPU and GPU
               | programming - between AVX and PTX. Different generations
               | of CPU cores handle the same AVX2 instructions
               | differently. The microcode changes and the schedulers
               | change, but the process is transparent for the user,
               | similar to PTX.
        
               | mmoskal wrote:
               | I imagine there is and order of magnitude of difference
               | between how much you can translate in software, with
               | large memory and significant time budget to work with,
               | compared to microcode.
        
               | dragontamer wrote:
               | Most CPU instructions are 1-to-1 with their microcode. I
               | dare say that microcode is nearly irrelevant, any high-
               | performance instruction (ex: multiply, add, XOR, etc.
               | etc.) is but a single instruction anyway.
               | 
               | Load/Store are memory dependent in all architectures. So
               | that's just a different story as CPUs and GPUs have
               | completely different ideas of how caches should work.
               | (CPUs aim for latency, GPUs for bandwidth + incredibly
               | large register spaces with substantial hiding of latency
               | thanks to large occupancies).
               | 
               | -------------
               | 
               | That being said: reorder buffers on CPUs are well over
               | 400-instructions these days, with super-large cores (like
               | Apple's M4) is apparently on the order of 600 to 800
               | instructions.
               | 
               | Reorder buffers are _NOT_ translation. They're Tomasulo's
               | algorithm
               | (https://en.wikipedia.org/wiki/Tomasulo%27s_algorithm).
               | If you want to know how CPUs do out-of-order, study that.
               | 
               | I'd say CPUs have small register spaces (16 architectural
               | registers, maybe 32), but large register files of maybe
               | 300 or 400+. Tomasulo's algorithm is used to out-of-order
               | access registers.
               | 
               | You should think of instructions like "mov rax, [memory]"
               | as closer to "rax = malloc(register); delayed-load(rax,
               | memory); Out-of-order execute all instructions that don't
               | use RAX ahead of us in instruction stream".
               | 
               | Tomasulo's algorithm means using ~300-register file to
               | _pretend_ to be just 16 architectural registers. The 300
               | registers keeps the data out-of-order and allows you to
               | execute. Registers in modern CPUs are closer to
               | unique_ptr<int> in C++, assigning them frees (aka:
               | reorder buffer) and also mallocs a new register off the
               | register-file.
        
               | janwas wrote:
               | I hope people aren't writing directly to AVX2. When using
               | a wrapper such as Highway, you get exactly this kind of
               | update after a recompile, or even just running your code
               | on a CPU that supports newer instructions.
               | 
               | The cost is that the binary carries around both AVX2 and
               | AVX-512 codepaths, but that is not an issue IMO.
        
               | imtringued wrote:
               | Most video encoders and decoders consist of kernels with
               | hand written SIMD instructions/intrinsics.
        
               | janwas wrote:
               | Agreed. FWIW we demonstrated with JPEG XL (image codec,
               | though also with animation 'video' support) that it is
               | possible to write such kernels using the portable Highway
               | intrinsics.
        
               | dragontamer wrote:
               | > I hope people aren't writing directly to AVX2.
               | 
               | Did you not read the article? It's using AVX intrinsics
               | and NEON intrinsics.
        
               | jandrewrogers wrote:
               | Many use cases for SIMD aren't trivially expressible
               | through wrappers and abstractions. It is sometimes
               | cleaner, easier, and produces more optimized codegen to
               | write the intrinsics directly. It isn't ideal but it
               | often produces the best result for the effort involved.
               | 
               | An issue with the abstractions that does not go away is
               | that the optimal code architecture -- well above the
               | level of the SIMD wrappers -- is dependent on the
               | capabilities of the silicon. The wrappers can't solve for
               | that. And if you optimize the code architecture for the
               | silicon architecture, it quickly approximates writing
               | architecture-specific intrinsics with an additional layer
               | of indirection, which significantly reduces any notional
               | benefit from the abstractions.
               | 
               | The wrappers can't abstract enough, and higher level
               | abstractions (written with architecture aware intrinsics)
               | are often too use case specific to reuse widely.
        
               | Remnant44 wrote:
               | I would wager that most real world SIMD use is with
               | direct intrinsics.
        
         | jsheard wrote:
         | > Intel ISPC is a great project btw if you need high level code
         | that targets SSE, AVX, AVX512 and even ARM NEON
         | 
         | It's pretty funny how NEON ended up in there. A former Intel
         | employee decided to implement it for fun and submitted it as a
         | pull request, which Intel quietly ignored for obvious reasons,
         | but then _another_ former Intel employee who still had commit
         | rights merged the PR, and the optics of publicly reverting it
         | would be even worse than stonewalling so Intel begrudgingly let
         | it stand (but they did revoke that devs commit rights).
         | 
         | https://pharr.org/matt/blog/2018/04/29/ispc-retrospective
        
         | pjmlp wrote:
         | It is worse than that, given that AVX is the survivor from
         | Larrabee great plan to kill GPUs.
         | 
         | Larrabee was going to take over it all, as I enjoyed its
         | presentation at GDCE 2009.
        
           | dragontamer wrote:
           | I mean, 288-E Core Xeons are about to ship. Xeon 6900 series,
           | right? (Estimated to ship in Q1 2025)
           | 
           | So Larrabee lives on for... some reason. These E cores are
           | well known to be modified Intel Atom cores and those were
           | modified Xeon Phi cores which were Larrabee based.
           | 
           | Just with.... AVX512 being disabled. (Lost when Xeon Phi
           | turned into Intel Atoms).
           | 
           | Intels technical strategy is completely bonkers. In a bad
           | way. Intel invented all this tech 10 to 20 years ago but
           | fails to have a cohesive strategy to bring it to market.
           | There's clearly smart people there but somehow all the top
           | level decisions are just awful
        
             | ashvardanian wrote:
             | Yes, a lot of weird decisions were made at Intel.
             | 
             | Ironically, AMD waited so long to implement AVX-512, but
             | now has it on both server and mobile chips (natively and
             | 256 bit emulation, respectively). Intel started the whole
             | thing, has a very fragmented stack and is now preparing
             | those E cores with even more new extensions.
             | 
             | Most importantly for Search and AI, it adds AVX_VNNI, which
             | can be used for faster 8-bit integer dot-products: https://
             | github.com/ashvardanian/SimSIMD/blob/75c426fb190a9d4...
             | 
             | Would be interesting to see how matrix multiplication
             | throughput will differ between AVX-512-capable P cores and
             | a larger quantity of AVX_VNNI-capable E cores!
        
               | alfiedotwtf wrote:
               | A former Intel CEO even wrote a book where every product
               | was planned 20+ years in advance.
               | 
               | Imagine planning 20 years in advance where Moore's Law is
               | still going strong. Come to think of it, Moore was also
               | CEO of Intel lol
        
           | Earw0rm wrote:
           | And a few years later, Intel said we'd get AVX512 on
           | everything by 2016, and that the instruction encoding
           | supported a future extension to 1024.
           | 
           | And then the Skylake and Cannon Lake debacle..
           | 
           | First they pulled it from the consumer chips a fairly short
           | time before launch. Then the server chips it was present in
           | would downclock aggressively when you did use it, so you
           | could get at best maybe 40% more performance, certainly far
           | from the 2x+ it promised.
           | 
           | Ten years on and the AMD 9950X does a pretty good job with
           | it, however.
        
             | Earw0rm wrote:
             | Oh, and I neglected to mention the protracted development,
             | and short, miserable life, of Cannon Lake itself.
             | 
             | First announced in 2013, it eventually shipped five years
             | later in only a single, crippled dual-core mobile SKU,
             | which lasted just a year in the market before they killed
             | it off.
             | 
             | "Let's put our only consumer implementation of our highest
             | performing vector architecture on a lame-duck NUC chip.",
             | good move guys.
        
         | variadix wrote:
         | How much of this is because CUDA is designed for GPU execution
         | and because the GPU ISA isn't a stable interface? E.g. new GPU
         | instructions can be utilized by new CUDA compilers for new
         | hardware because the code wasn't written to a specific ISA?
         | Also, don't people fine tune GPU kernels per architecture
         | manually (either by hand or via automated optimizers that test
         | combinations in the configuration space)?
        
           | dragontamer wrote:
           | NVidia PTX is a very stable interface.
           | 
           | And the PTX to SASS compiler DOES a degree of automatic fine
           | tuning between architectures. Nothing amazing or anything,
           | but it's a minor speed boost that has made PTX just a easier
           | 'assembly-like language' to build on top of.
        
           | janwas wrote:
           | My understanding is that there is a lot of hand-writing (not
           | just fine-tuning) going on. AFAIK CuDNN and TensorRT are
           | written directly as SASS, not CUDA. And the presence of FP8
           | in H100, but not A100, would likely require a complete
           | rewrite.
        
             | dragontamer wrote:
             | Cub, thrust and many other libraries that make those
             | kernels possible don't need to be rewritten.
             | 
             | When you write a merge sort in CUDA, you can keep it across
             | all versions. Maybe the new instructions can improve a few
             | corner cases, but it's not like AVX to AVX512 where you
             | need to rewrite everything.
             | 
             | Ex: https://github.com/NVIDIA/cub/blob/main/cub/device/devi
             | ce_me...
        
         | snihalani wrote:
         | > software methodology to interact with SIMD using GPU-like
         | languages should be a priority.
         | 
         | What's your opinion on sycl?
         | 
         | https://www.khronos.org/sycl/
        
         | ip26 wrote:
         | Is CUDA not more analogous to using MKL, rather than AVX?
        
         | jabl wrote:
         | Sometimes I wonder about an alternative history scenario where
         | CPU ISA's would have chosen a SIMT style model instead of SIMD.
         | "Just" have something like fork/join instructions to start/stop
         | vector mode, otherwise use the standard scalar instructions in
         | both scalar and vector mode. Would have avoided a lot of
         | combinatorial explosion in instructions. (of course you'd have
         | to do something for cross-lane operations, and later tensor
         | instructions etc.)
        
           | janwas wrote:
           | Not sure why SIMT would help, it requires more compiler
           | transforms than if the code is written for packets/vectors or
           | whatever we want to call them. As you note, cross-lane is a
           | key part of a good SIMD abstraction. Vulkan calls it
           | "subgroups", but from where I sit it's still SIMD.
        
         | janwas wrote:
         | Max performance is a stretch - recompilation would not utilize
         | tensor cores, right?
         | 
         | "too hard for mainstream programmers" seems overly pessimistic.
         | I've run several workshops where devs have written dot-product
         | kernels using Highway after 30 minutes of introduction.
        
           | kristianp wrote:
           | They said intrinsics. Highway is an abstraction on top of
           | intrinsics.
        
         | adrian_b wrote:
         | While there is some truth in what you say, it makes seem like
         | writing in the CUDA style is something new and revolutionary
         | invented by NVIDIA, which it is not.
         | 
         | The CUDA style of writing parallel programs is nothing else
         | than the use of the so-called "parrallel do" a.k.a. "parrallel
         | for" program structure, which has been already discussed in
         | 1963. Notable later evolutions of this concept have been
         | present in "Communicating Sequential Processes" by C.A.R. Hoare
         | (1978-08: "arrays of processes"), then in the programming
         | language Occam, which was designed based on what Hoare had
         | described, then in the OpenMP extension of Fortran (1997-10),
         | then in the OpenMP extension of C and C++ (1998-10).
         | 
         | Programming in CUDA does not bring anything new, except that in
         | comparison e.g. with OpenMP some keywords are implicit and
         | others are different, so the equivalence is not immediately
         | obvious.
         | 
         | Programming for CPUs in the much older OpenMP is equivalent
         | with programming in CUDA for GPUs.
         | 
         | The real innovation of NVIDIA has been the high quality of the
         | NVIDIA CUDA compiler and CUDA runtime GPU driver, which are
         | able to distribute the work that must be done on the elements
         | of an array over all the available cores, threads and SIMD
         | lanes, in a manner that is transparent for the programmer, so
         | in many cases the programmer is free to ignore which is the
         | actual structure of the GPU that will run the program.
         | 
         | Previous compilers for OpenMP or for other such programming
         | language extensions for parallel programming have been much
         | less capable to produce efficient parallel programs without
         | being tuned by the programmer for each hardware variant.
        
           | dragontamer wrote:
           | Oh all of this was being done in the 1980s by Lisp*
           | programmers.
           | 
           | I'm not calling it new. I'm just saying that the intrinsics
           | style is much much harder than what Lisp*, DirectX HLSL,
           | CUDA, OpenCL (etc. etc) does.
           | 
           | A specialized SIMD language makes writing SIMD easier
           | compared to intrinsic style. Look at any CUDA code today and
           | compare it to the AVX that is in the above article and it
           | becomes readily apparent.
        
         | synack wrote:
         | I've been playing with a new Lunar Lake laptop and they've
         | complicated things even further with the Neural Processing Unit
         | (NPU)
         | 
         | Now if your vectors are INT8/FP8 you're supposed to shovel them
         | into this accelerator via PCIe, rather than packing into
         | registers for AVX512.
         | 
         | I wish they'd just pick an interface for vector ops and stick
         | with it.
        
         | david-gpu wrote:
         | _> If you just wrote your SIMD in CUDA 15 years ago, NVidia
         | compilers would have given you maximum performance across all
         | NVidia GPUs rather than being forced to write and rewrite in
         | SSE vs AVX vs AVX512_
         | 
         | NVidia compilers would have compiled your code into something
         | functional, but if you want to approach peak performance you
         | need to at least tweak your kernels, and sometimes rewrite them
         | from scratch. See for example the various MMA instructions that
         | were introduced over time.
         | 
         |  _Edit:_ I see somebody made a similar comment and you
         | addressed it. Sorry for the churn.
        
       | EVa5I7bHFq9mnYK wrote:
       | C# vectors do a great job of simplifying those intrinsics in a
       | safe and portable manner.
        
         | ashvardanian wrote:
         | There are dozens of libraries, frameworks, and compiler
         | toolchains that try to abstract away SIMD capabilities, but I
         | don't think it's a great approach.
         | 
         | The only 2 approaches that still make sense to me:
         | 
         | A. Writing serial vectorization-aware code in a native compiled
         | language, hoping your compiler will auto-vectorize.
         | 
         | B. Implementing natively for every hardware platform, as the
         | ISA differences are too big to efficiently abstract away
         | anything beyond 128-register float multiplication and addition.
         | 
         | This article, in a way, an attempt to show how big the
         | differences even for simple data-parallel floating-point tasks.
        
           | dzaima wrote:
           | There's the middle-ground approach of having primarily
           | target-specific operations but with intersecting ones named
           | the same, and allowing easily building custom abstractions on
           | top of such to paper over the differences how best it makes
           | sense for the given application. That's the approach
           | https://github.com/mlochbaum/Singeli takes.
           | 
           | There's a good amount of stuff that can clearly utilize SIMD
           | without much platform-specificness, but doesn't easily
           | autovectorize - early-exit checks in a loop, packed bit
           | boolean stuff, some data rearranging, probing hashmap checks,
           | some very-short-variable-length-loop things. And while there
           | might often be some parts that do just need to be entirely
           | target-specific, they'll usually be surrounded by stuff that
           | doesn't (the loop, trip count calculation, loads/stores,
           | probably some arithmetic).
        
           | neonsunset wrote:
           | Numerics in .NET are not a high-level abstraction and do out
           | of box what many mature vectorized libraries end up doing
           | themselves - there is significant overlap between NEON, SSE*
           | and, if we overlook vector width, AVX2/512 and WASMs
           | PackedSIMD.
           | 
           | .NET has roughly three vector APIs:
           | 
           | - Vector<T> which is platform-defined width vector that
           | exposes common set of operations
           | 
           | - Vector64/128/256/512<T> which has wider API than the
           | previous one
           | 
           | - Platform intrinsics - basically immintrin.h
           | 
           | Notably, platform intrinsics use respective VectorXXX<T>
           | types which allows to write common parts of the algorithm in
           | a portable way and apply platform intrinsics in specific
           | areas where it makes sense. Also some method have 'Unsafe'
           | and 'Native' variants to allow for vector to exhibit
           | platform-specific behavior like shuffles since in many
           | situations this is still the desired output for the common
           | case.
           | 
           | The .NET's compiler produces competitive with GCC and
           | sometimes Clang codegen for these. It's gotten particularly
           | good at lowering AVX512.
        
           | Earw0rm wrote:
           | The library approach does a pretty good job in conjunction
           | with a good compiler, and sensible algorithm design.
           | 
           | You're writing C++ code but as if it was shader code.
           | 
           | I've seen impressive results with clang doing this sort of
           | thing.
        
           | exDM69 wrote:
           | I will respectfully disagree with your statement, with the
           | caveat that I mostly dabble in arithmetic with 128b/256b
           | float and int vectors.
           | 
           | Using C or C++ with vector extensions (Gcc/Clang) or Rust
           | (nightly) std::simd is very easy and you get code that is
           | portable to different CPUs and ISAs.
           | 
           | But most importantly they have a zero cost fallback option to
           | CPU-specific intrinsics when you need them. An f32x8 can be
           | passed at zero cost as __mm256 to any
           | core::arch::x86_64::__mm_intrinsic (or xmmintrin.h in C++
           | land).
           | 
           | You gain portable arithmetic and swizzles and SIMD vector
           | types, but lose nothing. Not having to write everything for
           | x86_64 and aarch64 is a huge win even if doesn't quite cover
           | everything.
           | 
           | Additionally you can use wider vectors than your hardware
           | supports, the compiler is able to split your f64x64 to 128,
           | 256 or 512 bit registers as needed depending on the compile
           | target.
        
           | kolbe wrote:
           | I still like ispc, but that's not going to catch on.
        
       | juancn wrote:
       | The main problem is that there are no good abstractions in
       | popular programming languages to take advantage of SIMD
       | extensions.
       | 
       | Also, the feature set being all over the place (e.g. integer
       | support is fairly recent) doesn't help either.
       | 
       | ISPC is a good idea, but execution is meh... it's hard to setup
       | and integrate.
       | 
       | Ideally you would want to be able to easily use this from other
       | popular languages, like Java, Python, Javascript, without having
       | to resort to linking a library written in C/C++.
       | 
       | Granted, language extensions may be required to approach
       | something like that in an ergonomic way, but most somehow end up
       | just mimicking what C++ does and expose a pseudo assembler.
        
         | pjmlp wrote:
         | The best is the GPU programming approach, with specialised
         | languages
         | 
         | Just like using SQL is much more sane than low level C APIs to
         | handle BTree nodes.
         | 
         | The language extensions help, but code still requires too much
         | low level expertise, with algorithms and data structures having
         | to take SIMD/MIMD capabilities into account anyway.
        
         | Conscat wrote:
         | I think the EVE library for C++ is a great abstraction. It's
         | got an unusual syntax using subscript operator overloading, but
         | that winds up being a very ergonomic and flexible way to
         | program with masked-SIMD.
        
           | secondcoming wrote:
           | I'm not sure about EVE. I trialled it by trying to uppercase
           | a string and even though I got it working in the end it was
           | quite unpleasant. Their docs need to be better.
        
         | Earw0rm wrote:
         | std::experimental::simd is happening. It should be part of
         | c++26.
        
           | janwas wrote:
           | Unfortunately a bit late :) Highway reached v1.0 about 2.5
           | years ago. How long would it take until Clang/GCC/MSVC are
           | ready, _and_ all users ' distros have updated? Not to mention
           | that the number of ops provided by std::experimental::simd is
           | extremely limited - basically only math operators, and zero
           | support for shuffling/crypto/rounding/interleaving/table
           | lookups which seem indispensable for many applications.
        
         | colejohnson66 wrote:
         | C#'s Vector<T> does a pretty great job.
        
       | bob1029 wrote:
       | I see a lot of "just use the GPU" and you'd often be right.
       | 
       | SIMD on the CPU is most compelling to me due to the latency
       | characteristics. You are nanoseconds away from the control flow.
       | If the GPU needs some updated state regarding the outside world,
       | it takes significantly longer to propagate this information.
       | 
       | For most use cases, the GPU will win the trade off. But, there is
       | a reason you don't hear much about systems like order matching
       | engines using them.
        
         | pclmulqdq wrote:
         | You would be surprised. The GPU often loses even for small
         | neural nets given the large latency. Anything that needs high
         | throughput or is sized like an HPC problem should use a GPU,
         | but a lot of code benefits from SIMD on small problems.
        
           | gmueckl wrote:
           | If you run many small tasks on the GPU, you can increase
           | throughput by overlapping transfers and computation. There
           | may also be other ways to batch problems together, but that
           | depends on the algorithms.
           | 
           | The one truly unfixable issue is round-trip latency.
        
           | gopalv wrote:
           | > The GPU often loses even for small neural nets given the
           | large latency
           | 
           | Apple's neural engine shows that you can live in between
           | those two worlds.
           | 
           | As you said, the trouble is the latency, the programming
           | model is still great.
        
         | moldavi wrote:
         | Do Apple's chips (M1 etc) change this at all, since they share
         | memory with the GPU?
        
           | bob1029 wrote:
           | I think an argument could be made depending on the real world
           | timings. How much closer in time is the Apple GPU vs one on a
           | PCIe bus?
        
           | one_even_prime wrote:
           | Apple chips share the same physical memory between the GPU
           | and the CPU. Still, they don't have USM/UVM (Unified Shared
           | Memory/Unified Virtual Memory), that is, the GPU and the CPU
           | can't access the same data concurrently and easily. Programs
           | must map/unmap pages to control which device accesses it, and
           | that's a very expensive operation.
        
             | tubs wrote:
             | They don't need to be unmapped just for the other one to
             | use it. source: I wrote GPU drivers for over 10 years.
        
           | Earw0rm wrote:
           | Not much. Synchronisation of tasks is still a big overhead.
           | 
           | If you've got tens to hundreds of microseconds worth of
           | workload, sure, get the GPU to do it.
           | 
           | But bear in mind 1000 clocks at 4GHz is 250ns, there's still
           | a sizeable region where tight CPU/GPU integration isn't tight
           | enough.
        
         | dragontamer wrote:
         | Despite my 'Use a GPU' post below, you are absolutely correct.
         | 
         | Maximizing performance on a CPU today requires all the steps in
         | the above article, and the article is actually very well
         | written with regards to the 'mindset' needed to tackle a
         | problem such as this.
         | 
         | It's a great article for people aiming to maximize the
         | performance on Intel or AMD systems.
         | 
         | ------
         | 
         | CPUs have the memory capacity advantage and will continue to
         | hold said advantage for the foreseeable future (despite NVidias
         | NVLink and other techs to try to bridge the gap).
         | 
         | And CPU code remains far easier than learning CUDA, despite how
         | hard these AVX intrinsics are in comparison to CUDA.
        
           | marmaduke wrote:
           | > CPUs have the memory capacity advantage
           | 
           | perhaps also more precisely they also have quite an advantage
           | on anything that needs and plays nicely with caches? when I
           | sliced my problem to maximize cache usage, I saw pretty clear
           | scalability with cores: L1/L2 cache bandwidth is ~30GB/s, so
           | e.g. a 32 core system starts to compete with the big consumer
           | GPUs.
        
             | dragontamer wrote:
             | Caches between CPU and GPU are extremely complex.
             | 
             | Not only because caching is complex in CPU land, but
             | because GPU has a completely different set of caches (and
             | registers) that depends entirely on architecture.
             | 
             | Case in point: GPUs often have access to 256 architectural
             | registers aka 1024-bytes of register space. Now this
             | depends on how much occupancy your GPU code is targeting
             | (maybe 4-occupancy?), but there's a lot you can do with
             | even 64-registers (aka: 4-occupancy and 256-bytes of
             | register space), and is key to making those blazing fast
             | FP16 matrix-multiplication kernels "for AI" everyone's so
             | hot about right now.
             | 
             | For something like FP16 matrix multiply (a very cache-
             | friendly problem), the entire SIMD (all 32-lanes) of the
             | GPU symmetric multiprocessor work together on the problem.
             | So we're talking about an effective 32kB of __register
             | space__ (let alone cache or other memory in the hierarchy).
             | 
             | Even before FP16 matrix-multiply instructions, this absurd
             | register space advantage is why GPUs were the king of
             | matrix multiplication.
             | 
             | --------
             | 
             | GPUs are worse at larger cache sizes, say 1MB or 2MB. At
             | 1MB+, a modern CPU core's L2 cache can hold all of that
             | (either Intel P-core, AMD Zen5, or even Intel E-core can
             | hold many MB in L2).
             | 
             | GPUs have a secret though: a crossbar at the __shared__
             | memory level. Rearranging memory and data across your lanes
             | can be done through this crossbar (including many-to-one
             | reductions for atomics, or one-to-many broadcasts in just a
             | single clock tick). So your GPU-lanes have incredible
             | communication available to them (and this crossbar is the
             | key for modern ballot / voting based horizontal compute of
             | modern GPU styles). This is only ~64kB of space and shared
             | between all GPU-lanes but with 1024-lanes supporting
             | communication its an important element of GPU memory.
             | 
             | CPU L3 cache is very nice from a latency perspective, but
             | bandwidth wise L3 cache is on the order of GPU's GDDR6x or
             | HBM. 500GB/s to 2000GB/s, depending on the technology.
             | 
             | Finally, CPU DDR5 RAM may be the slowest in this
             | discussion, but its the biggest. 2TB+ Xeon Servers aren't
             | even that expensive and can be assumed for any serious tech
             | firm these days (ie: all-RAM Databases and whatnot).
             | 
             | ---------
             | 
             | So at different sizes and different use-scenarios, GPUs and
             | CPUs will trade places. I'd expect CPUs to win most red-
             | and-black tree races, but GPUs will win matrix
             | multiplication. Both take advantage of "cache" but in very
             | different ways.
        
       | benchmarkist wrote:
       | Looks like a great use case for AI. Set up the logical
       | specification and constraints and let the AI find the optimal
       | sequence of SIMD operations to fulfill the requirements.
        
         | fooblaster wrote:
         | No, there are decades of compiler literature for solving this
         | problem.
        
           | benchmarkist wrote:
           | That's even better then. Just let the AI read the literature
           | and write the optimal compiler.
        
             | fooblaster wrote:
             | It would probably be easier to clone the existing
             | repository than get an llm to regurgitate llvm.
        
               | benchmarkist wrote:
               | The AI would learn from llvm as well.
        
         | almostgotcaught wrote:
         | lol so says every person that has no clue how (NP-hard)
         | combinatorial optimization is.
        
           | benchmarkist wrote:
           | For humans it's very hard but it will be a breeze for the AI.
           | I thought HN was a community of builders. This is an obvious
           | startup opportunity.
        
             | stouset wrote:
             | All we have to do is ascribe magical properties to AI and
             | we can solve anything as if P=NP!
        
               | benchmarkist wrote:
               | Those distinction are irrelevant for an AI because it is
               | a pure form of intelligence that simply computes answers
               | without worrying about P or NP complexity classes.
        
               | ConspiracyFact wrote:
               | You had me going.
               | 
               | B-
        
               | imtringued wrote:
               | That's the same as being confidently wrong.
        
         | bartekpacia wrote:
         | high quality bait
        
       | TinkersW wrote:
       | You can simplify the 2x sqrts as sqrt(a*b), overall less
       | operations so perhaps more accurate. It would also let you get
       | rid of the funky lane swivels.
       | 
       | As this would only use 1 lane, perhaps if you have multiple of
       | these to normalize, you could vectorize it.
        
         | a_gopher wrote:
         | my thoughts exactly - crazy to know all these arcane SIMD
         | opcodes but not know basic maths!!
        
           | ashvardanian wrote:
           | Square root computation can be tricky, often relying on
           | approximations. These approximations tend to perform best for
           | mid-range values, while accuracy can degrade for very large
           | or very small values. With this in mind, a product of roots
           | is generally more accurate than a root of products.
           | 
           | From a SIMD perspective, it's worth noting that on most
           | platforms, the cost of computing one square root or two is
           | the same. On modern x86 server CPUs, for instance, you can
           | calculate up to 8 double-precision roots in parallel with
           | identical latency. So there's no additional cost in terms of
           | performance.
           | 
           | I hope this sheds some light on the design of my code.
           | 
           | PS: In a previous life, I did research in Astro- and Plasma
           | Physics. While I don't claim to remember all the Math, it's
           | usually more productive to ask for clarification than to
           | assume ignorance ;)
        
             | harry8 wrote:
             | > it's usually more productive to ask for clarification
             | than to assume ignorance ;)
             | 
             | Good reminder for me and anyone else right there, nicely
             | put.
        
           | nine_k wrote:
           | Moments like that are enlightening. When you see something
           | really improbable (knowing advanced SIMD while appearing to
           | ignore basic algebra), it's likely the moment you see a gap
           | in _your_ picture of the world. So it 's tine to learn
           | something new and likely unexpected (else you could have
           | guessed).
        
       | rishi_devan wrote:
       | Interesting article. The article mentions "...the NumPy
       | implementation illustrates a marked improvement over the naive
       | algorithm...", but I couldn't find a NumPy implementation in the
       | article.
        
         | andix wrote:
         | Yes, they are really great at abstracting the SIMD operations,
         | but the abstraction has only very few common methods. I'm not
         | sure how much real world benefits those abstractions have.
         | 
         | Once you need more complex operations, you need to use the
         | specific operations from System.Runtime.Intrinsics.(X86|ARM)
         | based on the current architecture. And you need to adjust your
         | implementation on the CPUs capabilities. There are still a lot
         | of older x64 CPUs around that don't have AVX512 for example.
        
       | big-chungus4 wrote:
       | can the authors please share the numpy code too
        
         | ashvardanian wrote:
         | There are several ways to implement it in NumPy, often
         | resulting in 20% variance. I've added a reference
         | implementation to my mirror of the blogpost and the Modular
         | team will soon update the original posting as well:
         | https://ashvardanian.com/posts/understanding-simd-complexity...
        
       | marmaduke wrote:
       | My approach to this is to write a bunch of tiny "kernels" which
       | are obvious to SIMD and then inline them all, and it does a
       | pretty good job on x86 and arm
       | 
       | https://github.com/maedoc/tvbk/blob/nb-again/src/util.h
        
       | remram wrote:
       | Did they write bfloat16 and bfloat32 when they meant float16 and
       | float32?
       | 
       | On the image: https://www.modular.com/blog/understanding-simd-
       | infinite-com...
        
         | sgerenser wrote:
         | Yeah I was really confused at first, pretty sure they messed up
         | the labels.
        
           | ashvardanian wrote:
           | Patched ;)
        
       ___________________________________________________________________
       (page generated 2024-11-30 23:01 UTC)