[HN Gopher] Understanding SIMD: Infinite complexity of trivial p...
       ___________________________________________________________________
        
       Understanding SIMD: Infinite complexity of trivial problems
        
       Author : verdagon
       Score  : 104 points
       Date   : 2024-11-25 17:08 UTC (4 days ago)
        
 (HTM) web link (www.modular.com)
 (TXT) w3m dump (www.modular.com)
        
       | Agingcoder wrote:
       | This is the first time I hear 'hyperscalar'. Is this generally
       | accepted ? ( I've been using SIMD since the MMX days so am a bit
       | surprised )
        
         | dragontamer wrote:
         | I don't think so.
         | 
         | Superscalar is a real term (multiple operations in one clock
         | tick due to parallel pipelines within a core). But hyperscalar
         | is cringe to me. There are tons of words describing SIMD
         | already, it seems unclear why someone would make up a new word
         | to describe an already existing concept.
         | 
         | Especially when a similar word (superscalar) already is defined
         | and likely gets confused for this new word.
        
           | ashvardanian wrote:
           | That may have been my mistake. I use super & hyper
           | interchangeably and don't always notice :)
           | 
           | PS: Should be an easy patch, will update!
        
             | dragontamer wrote:
             | Maybe not.
             | 
             | Superscalar is when say... Think of the following assembly
             | code.                  Add r1, r2        Sub r3, r4
             | 
             | And the add and subtract both happen on the same clock
             | tick. The important thing is that a modern CPU core (and
             | even GPU core) have multiple parallel ALU pipelines inside
             | of them.
             | 
             | Because r1, r2, r3 and r4 are fully independent, a modern
             | CPU can detect the potential parallelism here and act in
             | parallel. After CPUs mastered this trick, the next out of
             | order processors were invented (which not only allowed for
             | super scalar operations, but allowed the subtract to
             | execute first if for some reason the CPU core were waiting
             | on r1 or r2).
             | 
             | There are a ton of ways that modern CPUs and GPUs extract
             | parallelism from seemingly nothingness. And because all the
             | techniques are independent, we can have superscalar out-of-
             | order SIMD (like what happens in AVX512 in practice). SIMD
             | is... SIMD. It's one instruction applied to lots of data in
             | parallel. It's totally different.
             | 
             | You really need to use the correct word for the specific
             | kind of parallelism that you are trying to highlight. I
             | expect that the only word that makes sense in this article
             | is SIMD.
        
               | pyrolistical wrote:
               | I wish hardware exposed an api that allowed us to submit
               | a tree of instructions so the hardware doesn't need
               | figure out which instructions are independent.
               | 
               | Lots of this kind of work can be done during compilation
               | but cannot be communicated to hardware due to code being
               | linear
        
               | dragontamer wrote:
               | That's called VLIW and Intel Itanium is considered one of
               | the biggest chip failures of all time.
               | 
               | There is an argument that today's compilers are finally
               | good enough for VLIW to go mainstream, but good luck
               | convincing anyone in today's market to go for it.
               | 
               | ------
               | 
               | A big problem with VLIW is that it's impossible to
               | predict L1, L2, L3 or DRAM access. Meaning all
               | loads/stores are impossible to schedule by the compiler.
               | 
               | NVidia has interesting barriers that get compiled into
               | its SASS (a level lower than PTX assembly). These
               | barriers seem to allow the compiler to assist in the
               | dependency management process but ultimately still
               | require a decoder in the NVidia core final level before
               | execution.
        
               | neerajsi wrote:
               | Vliw is kind of the dual of what pyrolistical was asking
               | for. Vliw lets you bundle instructions that are known to
               | be independent rather than encode instructions to mark
               | known dependencies.
               | 
               | The idea pyrolistical mentioned is closer to explicit
               | data graph execution: https://en.m.wikipedia.org/wiki/Exp
               | licit_data_graph_executio....
        
         | spacemanspiff01 wrote:
         | I thought it was referring to this?
         | 
         | https://en.m.wikipedia.org/wiki/Hyperscale_computing
         | 
         | IE our simd implementation allows you to scale across different
         | architectures/ CPU revisions without having to rewrite assembly
         | for each CPU processor?
         | 
         | Edit: Rereading, that does not make much sense...
        
       | Joker_vD wrote:
       | > SIMD instructions are complex, and even Arm is starting to look
       | more "CISCy" than x86!
       | 
       | Thank you for saying it out loud. XLAT/XLATB of x86 is positively
       | tame compared to e.g. vrgatherei16.vv/vrgather.vv.
        
       | dragontamer wrote:
       | Intel needs to see what has happened to their AVX instructions
       | and why NVidia has taken over.
       | 
       | If you just wrote your SIMD in CUDA 15 years ago, NVidia
       | compilers would have given you maximum performance across all
       | NVidia GPUs rather than being forced to write and rewrite in SSE
       | vs AVX vs AVX512.
       | 
       | GPU SIMD is still SIMD. Just... better at it. I think AMD and
       | Intel GPUs can keep up btw. But software advantage and long term
       | benefits of rewriting into CUDA are heavily apparent.
       | 
       | Intel ISPC is a great project btw if you need high level code
       | that targets SSE, AVX, AVX512 and even ARM NEON all with one
       | codebase + auto compiling across all the architectures.
       | 
       | -------
       | 
       | Intels AVX512 is pretty good at a hardware level. But software
       | methodology to interact with SIMD using GPU-like languages should
       | be a priority.
       | 
       | Intrinsics are good for maximum performance but they are too hard
       | for mainstream programmers.
        
         | dist-epoch wrote:
         | > If you just wrote your SIMD in CUDA 15 years ago, NVidia
         | compilers would have given you maximum performance across all
         | NVidia GPUs
         | 
         | That's not true. For maximum performance you need to tweak the
         | code to a particular GPU model/architecture.
         | 
         | Intel has SSE/AVX/AVX2/AVX512, but CUDA has like 10 iterations
         | of this (increasing capabilities). Code written 15 years ago
         | would not use modern capabilities, like more flexible memory
         | access, atomics.
        
           | dragontamer wrote:
           | Maximum performance? Okay, you'll have to upgrade to ballot
           | instructions or whatever and rearchitect your algorithms. (Or
           | other wavefront / voting / etc. etc. new instructions that
           | have been invented. Especially those 4x4 matrix
           | multiplication AI instructions).
           | 
           | But CUDA -> PTX intermediate code has allowed for
           | significantly more flexibility. For crying out loud, the
           | entire machine code (aka SASS) of NVidia GPUs has been cycled
           | out at least 4 times in the past decade (128-bit bundles,
           | changes to instruction formats, acquire/release semantics,
           | etc etc)
           | 
           | It's amazing what backwards compatibility NVidia has achieved
           | in the past 15 years thanks to this architecture. SASS
           | changes so dramatically from generation to generation but the
           | PTX intermediate code has stayed highly competitive.
        
             | dist-epoch wrote:
             | Intel code from 15 years ago also runs today. But it will
             | not use AVX512.
             | 
             | Which is the same with PTX, right? If you didn't use the
             | tensor core instructions or wavefront voting in the CUDA
             | code, the PTX generated from it will not either, and NVIDIA
             | will not magically add those capabilities in when compiling
             | to SASS.
             | 
             | Maybe it remains competitive because the code is inherently
             | parallel anyway, so it will naturally scale to fill the
             | extra execution units of the GPU, which is where most of
             | the improvement is generation to generation.
             | 
             | While AVX code can't automatically scale to use the AVX512
             | units.
        
               | dragontamer wrote:
               | It's not the same. AVX2 instructions haven't changed and
               | never will change.
               | 
               | In contrast, NVidia can go from 64-bit instruction
               | bundles to 128-bit machine code (96-bit instruction +
               | 32-bit control information) between Pascal (aka PTX
               | Compute Capacity 5) and Voltage (aka PTX Compute Capacity
               | 7) and all the old PTX code just autocompiles to the new
               | assembly instruction format and takes advantage of all
               | the new memory barriers added in Volta.
               | 
               | Having a PTX translation later is a MAJOR advantage for
               | the NVidia workflow.
        
         | jsheard wrote:
         | > Intel ISPC is a great project btw if you need high level code
         | that targets SSE, AVX, AVX512 and even ARM NEON
         | 
         | It's pretty funny how NEON ended up in there. A former Intel
         | employee decided to implement it for fun and submitted it as a
         | pull request, which Intel quietly ignored for obvious reasons,
         | but then _another_ former Intel employee who still had commit
         | rights merged the PR, and the optics of publicly reverting it
         | would be even worse than stonewalling so Intel begrudgingly let
         | it stand (but they did revoke that devs commit rights).
         | 
         | https://pharr.org/matt/blog/2018/04/29/ispc-retrospective
        
         | pjmlp wrote:
         | It is worse than that, given that AVX is the survivor from
         | Larrabee great plan to kill GPUs.
         | 
         | Larrabee was going to take over it all, as I enjoyed its
         | presentation at GDCE 2009.
        
           | dragontamer wrote:
           | I mean, 288-E Core Xeons are about to ship. Xeon 6900 series,
           | right? (Estimated to ship in Q1 2025)
           | 
           | So Larrabee lives on for... some reason. These E cores are
           | well known to be modified Intel Atom cores and those were
           | modified Xeon Phi cores which were Larrabee based.
           | 
           | Just with.... AVX512 being disabled. (Lost when Xeon Phi
           | turned into Intel Atoms).
           | 
           | Intels technical strategy is completely bonkers. In a bad
           | way. Intel invented all this tech 10 to 20 years ago but
           | fails to have a cohesive strategy to bring it to market.
           | There's clearly smart people there but somehow all the top
           | level decisions are just awful
        
         | variadix wrote:
         | How much of this is because CUDA is designed for GPU execution
         | and because the GPU ISA isn't a stable interface? E.g. new GPU
         | instructions can be utilized by new CUDA compilers for new
         | hardware because the code wasn't written to a specific ISA?
         | Also, don't people fine tune GPU kernels per architecture
         | manually (either by hand or via automated optimizers that test
         | combinations in the configuration space)?
        
           | dragontamer wrote:
           | NVidia PTX is a very stable interface.
           | 
           | And the PTX to SASS compiler DOES a degree of automatic fine
           | tuning between architectures. Nothing amazing or anything,
           | but it's a minor speed boost that has made PTX just a easier
           | 'assembly-like language' to build on top of.
        
       | EVa5I7bHFq9mnYK wrote:
       | C# vectors do a great job of simplifying those intrinsics in a
       | safe and portable manner.
        
         | ashvardanian wrote:
         | There are dozens of libraries, frameworks, and compiler
         | toolchains that try to abstract away SIMD capabilities, but I
         | don't think it's a great approach.
         | 
         | The only 2 approaches that still make sense to me:
         | 
         | A. Writing serial vectorization-aware code in a native compiled
         | language, hoping your compiler will auto-vectorize.
         | 
         | B. Implementing natively for every hardware platform, as the
         | ISA differences are too big to efficiently abstract away
         | anything beyond 128-register float multiplication and addition.
         | 
         | This article, in a way, an attempt to show how big the
         | differences even for simple data-parallel floating-point tasks.
        
           | dzaima wrote:
           | There's the middle-ground approach of having primarily
           | target-specific operations but with intersecting ones named
           | the same, and allowing easily building custom abstractions on
           | top of such to paper over the differences how best it makes
           | sense for the given application. That's the approach
           | https://github.com/mlochbaum/Singeli takes.
           | 
           | There's a good amount of stuff that can clearly utilize SIMD
           | without much platform-specificness, but doesn't easily
           | autovectorize - early-exit checks in a loop, packed bit
           | boolean stuff, some data rearranging, probing hashmap checks,
           | some very-short-variable-length-loop things. And while there
           | might often be some parts that do just need to be entirely
           | target-specific, they'll usually be surrounded by stuff that
           | doesn't (the loop, trip count calculation, loads/stores,
           | probably some arithmetic).
        
           | neonsunset wrote:
           | Numerics in .NET are not a high-level abstraction and do out
           | of box what many mature vectorized libraries end up doing
           | themselves - there is significant overlap between NEON, SSE*
           | and, if we overlook vector width, AVX2/512 and WASMs
           | PackedSIMD.
           | 
           | .NET has roughly three vector APIs:
           | 
           | - Vector<T> which is platform-defined width vector that
           | exposes common set of operations
           | 
           | - Vector64/128/256/512<T> which has wider API than the
           | previous one
           | 
           | - Platform intrinsics - basically immintrin.h
           | 
           | Notably, platform intrinsics use respective VectorXXX<T>
           | types which allows to write common parts of the algorithm in
           | a portable way and apply platform intrinsics in specific
           | areas where it makes sense. Also some method have 'Unsafe'
           | and 'Native' variants to allow for vector to exhibit
           | platform-specific behavior like shuffles since in many
           | situations this is still the desired output for the common
           | case.
           | 
           | The .NET's compiler produces competitive with GCC and
           | sometimes Clang codegen for these. It's gotten particularly
           | good at lowering AVX512.
        
       | juancn wrote:
       | The main problem is that there are no good abstractions in
       | popular programming languages to take advantage of SIMD
       | extensions.
       | 
       | Also, the feature set being all over the place (e.g. integer
       | support is fairly recent) doesn't help either.
       | 
       | ISPC is a good idea, but execution is meh... it's hard to setup
       | and integrate.
       | 
       | Ideally you would want to be able to easily use this from other
       | popular languages, like Java, Python, Javascript, without having
       | to resort to linking a library written in C/C++.
       | 
       | Granted, language extensions may be required to approach
       | something like that in an ergonomic way, but most somehow end up
       | just mimicking what C++ does and expose a pseudo assembler.
        
         | pjmlp wrote:
         | The best is the GPU programming approach, with specialised
         | languages
         | 
         | Just like using SQL is much more sane than low level C APIs to
         | handle BTree nodes.
         | 
         | The language extensions help, but code still requires too much
         | low level expertise, with algorithms and data structures having
         | to take SIMD/MIMD capabilities into account anyway.
        
         | Conscat wrote:
         | I think the EVE library for C++ is a great abstraction. It's
         | got an unusual syntax using subscript operator overloading, but
         | that winds up being a very ergonomic and flexible way to
         | program with masked-SIMD.
        
       | bob1029 wrote:
       | I see a lot of "just use the GPU" and you'd often be right.
       | 
       | SIMD on the CPU is most compelling to me due to the latency
       | characteristics. You are nanoseconds away from the control flow.
       | If the GPU needs some updated state regarding the outside world,
       | it takes significantly longer to propagate this information.
       | 
       | For most use cases, the GPU will win the trade off. But, there is
       | a reason you don't hear much about systems like order matching
       | engines using them.
        
         | pclmulqdq wrote:
         | You would be surprised. The GPU often loses even for small
         | neural nets given the large latency. Anything that needs high
         | throughput or is sized like an HPC problem should use a GPU,
         | but a lot of code benefits from SIMD on small problems.
        
           | gmueckl wrote:
           | If you run many small tasks on the GPU, you can increase
           | throughput by overlapping transfers and computation. There
           | may also be other ways to batch problems together, but that
           | depends on the algorithms.
           | 
           | The one truly unfixable issue is round-trip latency.
        
           | gopalv wrote:
           | > The GPU often loses even for small neural nets given the
           | large latency
           | 
           | Apple's neural engine shows that you can live in between
           | those two worlds.
           | 
           | As you said, the trouble is the latency, the programming
           | model is still great.
        
         | moldavi wrote:
         | Do Apple's chips (M1 etc) change this at all, since they share
         | memory with the GPU?
        
           | bob1029 wrote:
           | I think an argument could be made depending on the real world
           | timings. How much closer in time is the Apple GPU vs one on a
           | PCIe bus?
        
           | one_even_prime wrote:
           | Apple chips share the same physical memory between the GPU
           | and the CPU. Still, they don't have USM/UVM (Unified Shared
           | Memory/Unified Virtual Memory), that is, the GPU and the CPU
           | can't access the same data concurrently and easily. Programs
           | must map/unmap pages to control which device accesses it, and
           | that's a very expensive operation.
        
         | dragontamer wrote:
         | Despite my 'Use a GPU' post below, you are absolutely correct.
         | 
         | Maximizing performance on a CPU today requires all the steps in
         | the above article, and the article is actually very well
         | written with regards to the 'mindset' needed to tackle a
         | problem such as this.
         | 
         | It's a great article for people aiming to maximize the
         | performance on Intel or AMD systems.
         | 
         | ------
         | 
         | CPUs have the memory capacity advantage and will continue to
         | hold said advantage for the foreseeable future (despite NVidias
         | NVLink and other techs to try to bridge the gap).
         | 
         | And CPU code remains far easier than learning CUDA, despite how
         | hard these AVX intrinsics are in comparison to CUDA.
        
       | a1o wrote:
       | This reads like AI in multiple sentences, it's really off-putting
       | reading it.
        
       | benchmarkist wrote:
       | Looks like a great use case for AI. Set up the logical
       | specification and constraints and let the AI find the optimal
       | sequence of SIMD operations to fulfill the requirements.
        
         | fooblaster wrote:
         | No, there are decades of compiler literature for solving this
         | problem.
        
           | benchmarkist wrote:
           | That's even better then. Just let the AI read the literature
           | and write the optimal compiler.
        
             | fooblaster wrote:
             | It would probably be easier to clone the existing
             | repository than get an llm to regurgitate llvm.
        
               | benchmarkist wrote:
               | The AI would learn from llvm as well.
        
         | almostgotcaught wrote:
         | lol so says every person that has no clue how (NP-hard)
         | combinatorial optimization is.
        
           | benchmarkist wrote:
           | For humans it's very hard but it will be a breeze for the AI.
           | I thought HN was a community of builders. This is an obvious
           | startup opportunity.
        
             | stouset wrote:
             | All we have to do is ascribe magical properties to AI and
             | we can solve anything as if P=NP!
        
               | benchmarkist wrote:
               | Those distinction are irrelevant for an AI because it is
               | a pure form of intelligence that simply computes answers
               | without worrying about P or NP complexity classes.
        
       | TinkersW wrote:
       | You can simplify the 2x sqrts as sqrt(a*b), overall less
       | operations so perhaps more accurate. It would also let you get
       | rid of the funky lane swivels.
       | 
       | As this would only use 1 lane, perhaps if you have multiple of
       | these to normalize, you could vectorize it.
        
       ___________________________________________________________________
       (page generated 2024-11-29 23:00 UTC)