[HN Gopher] Understanding SIMD: Infinite complexity of trivial p...
___________________________________________________________________
Understanding SIMD: Infinite complexity of trivial problems
Author : verdagon
Score : 104 points
Date : 2024-11-25 17:08 UTC (4 days ago)
(HTM) web link (www.modular.com)
(TXT) w3m dump (www.modular.com)
| Agingcoder wrote:
| This is the first time I hear 'hyperscalar'. Is this generally
| accepted ? ( I've been using SIMD since the MMX days so am a bit
| surprised )
| dragontamer wrote:
| I don't think so.
|
| Superscalar is a real term (multiple operations in one clock
| tick due to parallel pipelines within a core). But hyperscalar
| is cringe to me. There are tons of words describing SIMD
| already, it seems unclear why someone would make up a new word
| to describe an already existing concept.
|
| Especially when a similar word (superscalar) already is defined
| and likely gets confused for this new word.
| ashvardanian wrote:
| That may have been my mistake. I use super & hyper
| interchangeably and don't always notice :)
|
| PS: Should be an easy patch, will update!
| dragontamer wrote:
| Maybe not.
|
| Superscalar is when say... Think of the following assembly
| code. Add r1, r2 Sub r3, r4
|
| And the add and subtract both happen on the same clock
| tick. The important thing is that a modern CPU core (and
| even GPU core) have multiple parallel ALU pipelines inside
| of them.
|
| Because r1, r2, r3 and r4 are fully independent, a modern
| CPU can detect the potential parallelism here and act in
| parallel. After CPUs mastered this trick, the next out of
| order processors were invented (which not only allowed for
| super scalar operations, but allowed the subtract to
| execute first if for some reason the CPU core were waiting
| on r1 or r2).
|
| There are a ton of ways that modern CPUs and GPUs extract
| parallelism from seemingly nothingness. And because all the
| techniques are independent, we can have superscalar out-of-
| order SIMD (like what happens in AVX512 in practice). SIMD
| is... SIMD. It's one instruction applied to lots of data in
| parallel. It's totally different.
|
| You really need to use the correct word for the specific
| kind of parallelism that you are trying to highlight. I
| expect that the only word that makes sense in this article
| is SIMD.
| pyrolistical wrote:
| I wish hardware exposed an api that allowed us to submit
| a tree of instructions so the hardware doesn't need
| figure out which instructions are independent.
|
| Lots of this kind of work can be done during compilation
| but cannot be communicated to hardware due to code being
| linear
| dragontamer wrote:
| That's called VLIW and Intel Itanium is considered one of
| the biggest chip failures of all time.
|
| There is an argument that today's compilers are finally
| good enough for VLIW to go mainstream, but good luck
| convincing anyone in today's market to go for it.
|
| ------
|
| A big problem with VLIW is that it's impossible to
| predict L1, L2, L3 or DRAM access. Meaning all
| loads/stores are impossible to schedule by the compiler.
|
| NVidia has interesting barriers that get compiled into
| its SASS (a level lower than PTX assembly). These
| barriers seem to allow the compiler to assist in the
| dependency management process but ultimately still
| require a decoder in the NVidia core final level before
| execution.
| neerajsi wrote:
| Vliw is kind of the dual of what pyrolistical was asking
| for. Vliw lets you bundle instructions that are known to
| be independent rather than encode instructions to mark
| known dependencies.
|
| The idea pyrolistical mentioned is closer to explicit
| data graph execution: https://en.m.wikipedia.org/wiki/Exp
| licit_data_graph_executio....
| spacemanspiff01 wrote:
| I thought it was referring to this?
|
| https://en.m.wikipedia.org/wiki/Hyperscale_computing
|
| IE our simd implementation allows you to scale across different
| architectures/ CPU revisions without having to rewrite assembly
| for each CPU processor?
|
| Edit: Rereading, that does not make much sense...
| Joker_vD wrote:
| > SIMD instructions are complex, and even Arm is starting to look
| more "CISCy" than x86!
|
| Thank you for saying it out loud. XLAT/XLATB of x86 is positively
| tame compared to e.g. vrgatherei16.vv/vrgather.vv.
| dragontamer wrote:
| Intel needs to see what has happened to their AVX instructions
| and why NVidia has taken over.
|
| If you just wrote your SIMD in CUDA 15 years ago, NVidia
| compilers would have given you maximum performance across all
| NVidia GPUs rather than being forced to write and rewrite in SSE
| vs AVX vs AVX512.
|
| GPU SIMD is still SIMD. Just... better at it. I think AMD and
| Intel GPUs can keep up btw. But software advantage and long term
| benefits of rewriting into CUDA are heavily apparent.
|
| Intel ISPC is a great project btw if you need high level code
| that targets SSE, AVX, AVX512 and even ARM NEON all with one
| codebase + auto compiling across all the architectures.
|
| -------
|
| Intels AVX512 is pretty good at a hardware level. But software
| methodology to interact with SIMD using GPU-like languages should
| be a priority.
|
| Intrinsics are good for maximum performance but they are too hard
| for mainstream programmers.
| dist-epoch wrote:
| > If you just wrote your SIMD in CUDA 15 years ago, NVidia
| compilers would have given you maximum performance across all
| NVidia GPUs
|
| That's not true. For maximum performance you need to tweak the
| code to a particular GPU model/architecture.
|
| Intel has SSE/AVX/AVX2/AVX512, but CUDA has like 10 iterations
| of this (increasing capabilities). Code written 15 years ago
| would not use modern capabilities, like more flexible memory
| access, atomics.
| dragontamer wrote:
| Maximum performance? Okay, you'll have to upgrade to ballot
| instructions or whatever and rearchitect your algorithms. (Or
| other wavefront / voting / etc. etc. new instructions that
| have been invented. Especially those 4x4 matrix
| multiplication AI instructions).
|
| But CUDA -> PTX intermediate code has allowed for
| significantly more flexibility. For crying out loud, the
| entire machine code (aka SASS) of NVidia GPUs has been cycled
| out at least 4 times in the past decade (128-bit bundles,
| changes to instruction formats, acquire/release semantics,
| etc etc)
|
| It's amazing what backwards compatibility NVidia has achieved
| in the past 15 years thanks to this architecture. SASS
| changes so dramatically from generation to generation but the
| PTX intermediate code has stayed highly competitive.
| dist-epoch wrote:
| Intel code from 15 years ago also runs today. But it will
| not use AVX512.
|
| Which is the same with PTX, right? If you didn't use the
| tensor core instructions or wavefront voting in the CUDA
| code, the PTX generated from it will not either, and NVIDIA
| will not magically add those capabilities in when compiling
| to SASS.
|
| Maybe it remains competitive because the code is inherently
| parallel anyway, so it will naturally scale to fill the
| extra execution units of the GPU, which is where most of
| the improvement is generation to generation.
|
| While AVX code can't automatically scale to use the AVX512
| units.
| dragontamer wrote:
| It's not the same. AVX2 instructions haven't changed and
| never will change.
|
| In contrast, NVidia can go from 64-bit instruction
| bundles to 128-bit machine code (96-bit instruction +
| 32-bit control information) between Pascal (aka PTX
| Compute Capacity 5) and Voltage (aka PTX Compute Capacity
| 7) and all the old PTX code just autocompiles to the new
| assembly instruction format and takes advantage of all
| the new memory barriers added in Volta.
|
| Having a PTX translation later is a MAJOR advantage for
| the NVidia workflow.
| jsheard wrote:
| > Intel ISPC is a great project btw if you need high level code
| that targets SSE, AVX, AVX512 and even ARM NEON
|
| It's pretty funny how NEON ended up in there. A former Intel
| employee decided to implement it for fun and submitted it as a
| pull request, which Intel quietly ignored for obvious reasons,
| but then _another_ former Intel employee who still had commit
| rights merged the PR, and the optics of publicly reverting it
| would be even worse than stonewalling so Intel begrudgingly let
| it stand (but they did revoke that devs commit rights).
|
| https://pharr.org/matt/blog/2018/04/29/ispc-retrospective
| pjmlp wrote:
| It is worse than that, given that AVX is the survivor from
| Larrabee great plan to kill GPUs.
|
| Larrabee was going to take over it all, as I enjoyed its
| presentation at GDCE 2009.
| dragontamer wrote:
| I mean, 288-E Core Xeons are about to ship. Xeon 6900 series,
| right? (Estimated to ship in Q1 2025)
|
| So Larrabee lives on for... some reason. These E cores are
| well known to be modified Intel Atom cores and those were
| modified Xeon Phi cores which were Larrabee based.
|
| Just with.... AVX512 being disabled. (Lost when Xeon Phi
| turned into Intel Atoms).
|
| Intels technical strategy is completely bonkers. In a bad
| way. Intel invented all this tech 10 to 20 years ago but
| fails to have a cohesive strategy to bring it to market.
| There's clearly smart people there but somehow all the top
| level decisions are just awful
| variadix wrote:
| How much of this is because CUDA is designed for GPU execution
| and because the GPU ISA isn't a stable interface? E.g. new GPU
| instructions can be utilized by new CUDA compilers for new
| hardware because the code wasn't written to a specific ISA?
| Also, don't people fine tune GPU kernels per architecture
| manually (either by hand or via automated optimizers that test
| combinations in the configuration space)?
| dragontamer wrote:
| NVidia PTX is a very stable interface.
|
| And the PTX to SASS compiler DOES a degree of automatic fine
| tuning between architectures. Nothing amazing or anything,
| but it's a minor speed boost that has made PTX just a easier
| 'assembly-like language' to build on top of.
| EVa5I7bHFq9mnYK wrote:
| C# vectors do a great job of simplifying those intrinsics in a
| safe and portable manner.
| ashvardanian wrote:
| There are dozens of libraries, frameworks, and compiler
| toolchains that try to abstract away SIMD capabilities, but I
| don't think it's a great approach.
|
| The only 2 approaches that still make sense to me:
|
| A. Writing serial vectorization-aware code in a native compiled
| language, hoping your compiler will auto-vectorize.
|
| B. Implementing natively for every hardware platform, as the
| ISA differences are too big to efficiently abstract away
| anything beyond 128-register float multiplication and addition.
|
| This article, in a way, an attempt to show how big the
| differences even for simple data-parallel floating-point tasks.
| dzaima wrote:
| There's the middle-ground approach of having primarily
| target-specific operations but with intersecting ones named
| the same, and allowing easily building custom abstractions on
| top of such to paper over the differences how best it makes
| sense for the given application. That's the approach
| https://github.com/mlochbaum/Singeli takes.
|
| There's a good amount of stuff that can clearly utilize SIMD
| without much platform-specificness, but doesn't easily
| autovectorize - early-exit checks in a loop, packed bit
| boolean stuff, some data rearranging, probing hashmap checks,
| some very-short-variable-length-loop things. And while there
| might often be some parts that do just need to be entirely
| target-specific, they'll usually be surrounded by stuff that
| doesn't (the loop, trip count calculation, loads/stores,
| probably some arithmetic).
| neonsunset wrote:
| Numerics in .NET are not a high-level abstraction and do out
| of box what many mature vectorized libraries end up doing
| themselves - there is significant overlap between NEON, SSE*
| and, if we overlook vector width, AVX2/512 and WASMs
| PackedSIMD.
|
| .NET has roughly three vector APIs:
|
| - Vector<T> which is platform-defined width vector that
| exposes common set of operations
|
| - Vector64/128/256/512<T> which has wider API than the
| previous one
|
| - Platform intrinsics - basically immintrin.h
|
| Notably, platform intrinsics use respective VectorXXX<T>
| types which allows to write common parts of the algorithm in
| a portable way and apply platform intrinsics in specific
| areas where it makes sense. Also some method have 'Unsafe'
| and 'Native' variants to allow for vector to exhibit
| platform-specific behavior like shuffles since in many
| situations this is still the desired output for the common
| case.
|
| The .NET's compiler produces competitive with GCC and
| sometimes Clang codegen for these. It's gotten particularly
| good at lowering AVX512.
| juancn wrote:
| The main problem is that there are no good abstractions in
| popular programming languages to take advantage of SIMD
| extensions.
|
| Also, the feature set being all over the place (e.g. integer
| support is fairly recent) doesn't help either.
|
| ISPC is a good idea, but execution is meh... it's hard to setup
| and integrate.
|
| Ideally you would want to be able to easily use this from other
| popular languages, like Java, Python, Javascript, without having
| to resort to linking a library written in C/C++.
|
| Granted, language extensions may be required to approach
| something like that in an ergonomic way, but most somehow end up
| just mimicking what C++ does and expose a pseudo assembler.
| pjmlp wrote:
| The best is the GPU programming approach, with specialised
| languages
|
| Just like using SQL is much more sane than low level C APIs to
| handle BTree nodes.
|
| The language extensions help, but code still requires too much
| low level expertise, with algorithms and data structures having
| to take SIMD/MIMD capabilities into account anyway.
| Conscat wrote:
| I think the EVE library for C++ is a great abstraction. It's
| got an unusual syntax using subscript operator overloading, but
| that winds up being a very ergonomic and flexible way to
| program with masked-SIMD.
| bob1029 wrote:
| I see a lot of "just use the GPU" and you'd often be right.
|
| SIMD on the CPU is most compelling to me due to the latency
| characteristics. You are nanoseconds away from the control flow.
| If the GPU needs some updated state regarding the outside world,
| it takes significantly longer to propagate this information.
|
| For most use cases, the GPU will win the trade off. But, there is
| a reason you don't hear much about systems like order matching
| engines using them.
| pclmulqdq wrote:
| You would be surprised. The GPU often loses even for small
| neural nets given the large latency. Anything that needs high
| throughput or is sized like an HPC problem should use a GPU,
| but a lot of code benefits from SIMD on small problems.
| gmueckl wrote:
| If you run many small tasks on the GPU, you can increase
| throughput by overlapping transfers and computation. There
| may also be other ways to batch problems together, but that
| depends on the algorithms.
|
| The one truly unfixable issue is round-trip latency.
| gopalv wrote:
| > The GPU often loses even for small neural nets given the
| large latency
|
| Apple's neural engine shows that you can live in between
| those two worlds.
|
| As you said, the trouble is the latency, the programming
| model is still great.
| moldavi wrote:
| Do Apple's chips (M1 etc) change this at all, since they share
| memory with the GPU?
| bob1029 wrote:
| I think an argument could be made depending on the real world
| timings. How much closer in time is the Apple GPU vs one on a
| PCIe bus?
| one_even_prime wrote:
| Apple chips share the same physical memory between the GPU
| and the CPU. Still, they don't have USM/UVM (Unified Shared
| Memory/Unified Virtual Memory), that is, the GPU and the CPU
| can't access the same data concurrently and easily. Programs
| must map/unmap pages to control which device accesses it, and
| that's a very expensive operation.
| dragontamer wrote:
| Despite my 'Use a GPU' post below, you are absolutely correct.
|
| Maximizing performance on a CPU today requires all the steps in
| the above article, and the article is actually very well
| written with regards to the 'mindset' needed to tackle a
| problem such as this.
|
| It's a great article for people aiming to maximize the
| performance on Intel or AMD systems.
|
| ------
|
| CPUs have the memory capacity advantage and will continue to
| hold said advantage for the foreseeable future (despite NVidias
| NVLink and other techs to try to bridge the gap).
|
| And CPU code remains far easier than learning CUDA, despite how
| hard these AVX intrinsics are in comparison to CUDA.
| a1o wrote:
| This reads like AI in multiple sentences, it's really off-putting
| reading it.
| benchmarkist wrote:
| Looks like a great use case for AI. Set up the logical
| specification and constraints and let the AI find the optimal
| sequence of SIMD operations to fulfill the requirements.
| fooblaster wrote:
| No, there are decades of compiler literature for solving this
| problem.
| benchmarkist wrote:
| That's even better then. Just let the AI read the literature
| and write the optimal compiler.
| fooblaster wrote:
| It would probably be easier to clone the existing
| repository than get an llm to regurgitate llvm.
| benchmarkist wrote:
| The AI would learn from llvm as well.
| almostgotcaught wrote:
| lol so says every person that has no clue how (NP-hard)
| combinatorial optimization is.
| benchmarkist wrote:
| For humans it's very hard but it will be a breeze for the AI.
| I thought HN was a community of builders. This is an obvious
| startup opportunity.
| stouset wrote:
| All we have to do is ascribe magical properties to AI and
| we can solve anything as if P=NP!
| benchmarkist wrote:
| Those distinction are irrelevant for an AI because it is
| a pure form of intelligence that simply computes answers
| without worrying about P or NP complexity classes.
| TinkersW wrote:
| You can simplify the 2x sqrts as sqrt(a*b), overall less
| operations so perhaps more accurate. It would also let you get
| rid of the funky lane swivels.
|
| As this would only use 1 lane, perhaps if you have multiple of
| these to normalize, you could vectorize it.
___________________________________________________________________
(page generated 2024-11-29 23:00 UTC)