[HN Gopher] Understanding SIMD: Infinite complexity of trivial p...
___________________________________________________________________
Understanding SIMD: Infinite complexity of trivial problems
Author : verdagon
Score : 235 points
Date : 2024-11-25 17:08 UTC (5 days ago)
(HTM) web link (www.modular.com)
(TXT) w3m dump (www.modular.com)
| Agingcoder wrote:
| This is the first time I hear 'hyperscalar'. Is this generally
| accepted ? ( I've been using SIMD since the MMX days so am a bit
| surprised )
| dragontamer wrote:
| I don't think so.
|
| Superscalar is a real term (multiple operations in one clock
| tick due to parallel pipelines within a core). But hyperscalar
| is cringe to me. There are tons of words describing SIMD
| already, it seems unclear why someone would make up a new word
| to describe an already existing concept.
|
| Especially when a similar word (superscalar) already is defined
| and likely gets confused for this new word.
| ashvardanian wrote:
| That may have been my mistake. I use super & hyper
| interchangeably and don't always notice :)
|
| PS: Should be an easy patch, will update!
| dragontamer wrote:
| Maybe not.
|
| Superscalar is when say... Think of the following assembly
| code. Add r1, r2 Sub r3, r4
|
| And the add and subtract both happen on the same clock
| tick. The important thing is that a modern CPU core (and
| even GPU core) have multiple parallel ALU pipelines inside
| of them.
|
| Because r1, r2, r3 and r4 are fully independent, a modern
| CPU can detect the potential parallelism here and act in
| parallel. After CPUs mastered this trick, the next out of
| order processors were invented (which not only allowed for
| super scalar operations, but allowed the subtract to
| execute first if for some reason the CPU core were waiting
| on r1 or r2).
|
| There are a ton of ways that modern CPUs and GPUs extract
| parallelism from seemingly nothingness. And because all the
| techniques are independent, we can have superscalar out-of-
| order SIMD (like what happens in AVX512 in practice). SIMD
| is... SIMD. It's one instruction applied to lots of data in
| parallel. It's totally different.
|
| You really need to use the correct word for the specific
| kind of parallelism that you are trying to highlight. I
| expect that the only word that makes sense in this article
| is SIMD.
| pyrolistical wrote:
| I wish hardware exposed an api that allowed us to submit
| a tree of instructions so the hardware doesn't need
| figure out which instructions are independent.
|
| Lots of this kind of work can be done during compilation
| but cannot be communicated to hardware due to code being
| linear
| dragontamer wrote:
| That's called VLIW and Intel Itanium is considered one of
| the biggest chip failures of all time.
|
| There is an argument that today's compilers are finally
| good enough for VLIW to go mainstream, but good luck
| convincing anyone in today's market to go for it.
|
| ------
|
| A big problem with VLIW is that it's impossible to
| predict L1, L2, L3 or DRAM access. Meaning all
| loads/stores are impossible to schedule by the compiler.
|
| NVidia has interesting barriers that get compiled into
| its SASS (a level lower than PTX assembly). These
| barriers seem to allow the compiler to assist in the
| dependency management process but ultimately still
| require a decoder in the NVidia core final level before
| execution.
| neerajsi wrote:
| Vliw is kind of the dual of what pyrolistical was asking
| for. Vliw lets you bundle instructions that are known to
| be independent rather than encode instructions to mark
| known dependencies.
|
| The idea pyrolistical mentioned is closer to explicit
| data graph execution: https://en.m.wikipedia.org/wiki/Exp
| licit_data_graph_executio....
| creato wrote:
| VLIW is still in use in multiple DSP products on the
| market today, and they are good successful products in
| their niche.
|
| They work very well if your code can be written as a loop
| without branches (or very limited branches) in the body,
| and a lot of instruction level parallelism in the body.
|
| Unfortunately for Intel, most code doesn't look like
| that. But for most workloads that happen to also be a
| good case for SIMD, it is (can be) great.
| bjourne wrote:
| Agree with this. Calling SIMD superscalar is a misnomer
| since it is _single instruction_ (multiple data) with
| very wide data paths. Superscalar implies multiple
| _different_ instructions in parallel, such as adding a
| pair of numbers, while subtracting another pair (or even
| dividing).
| spacemanspiff01 wrote:
| I thought it was referring to this?
|
| https://en.m.wikipedia.org/wiki/Hyperscale_computing
|
| IE our simd implementation allows you to scale across different
| architectures/ CPU revisions without having to rewrite assembly
| for each CPU processor?
|
| Edit: Rereading, that does not make much sense...
| Joker_vD wrote:
| > SIMD instructions are complex, and even Arm is starting to look
| more "CISCy" than x86!
|
| Thank you for saying it out loud. XLAT/XLATB of x86 is positively
| tame compared to e.g. vrgatherei16.vv/vrgather.vv.
| dragontamer wrote:
| Intel needs to see what has happened to their AVX instructions
| and why NVidia has taken over.
|
| If you just wrote your SIMD in CUDA 15 years ago, NVidia
| compilers would have given you maximum performance across all
| NVidia GPUs rather than being forced to write and rewrite in SSE
| vs AVX vs AVX512.
|
| GPU SIMD is still SIMD. Just... better at it. I think AMD and
| Intel GPUs can keep up btw. But software advantage and long term
| benefits of rewriting into CUDA are heavily apparent.
|
| Intel ISPC is a great project btw if you need high level code
| that targets SSE, AVX, AVX512 and even ARM NEON all with one
| codebase + auto compiling across all the architectures.
|
| -------
|
| Intels AVX512 is pretty good at a hardware level. But software
| methodology to interact with SIMD using GPU-like languages should
| be a priority.
|
| Intrinsics are good for maximum performance but they are too hard
| for mainstream programmers.
| dist-epoch wrote:
| > If you just wrote your SIMD in CUDA 15 years ago, NVidia
| compilers would have given you maximum performance across all
| NVidia GPUs
|
| That's not true. For maximum performance you need to tweak the
| code to a particular GPU model/architecture.
|
| Intel has SSE/AVX/AVX2/AVX512, but CUDA has like 10 iterations
| of this (increasing capabilities). Code written 15 years ago
| would not use modern capabilities, like more flexible memory
| access, atomics.
| dragontamer wrote:
| Maximum performance? Okay, you'll have to upgrade to ballot
| instructions or whatever and rearchitect your algorithms. (Or
| other wavefront / voting / etc. etc. new instructions that
| have been invented. Especially those 4x4 matrix
| multiplication AI instructions).
|
| But CUDA -> PTX intermediate code has allowed for
| significantly more flexibility. For crying out loud, the
| entire machine code (aka SASS) of NVidia GPUs has been cycled
| out at least 4 times in the past decade (128-bit bundles,
| changes to instruction formats, acquire/release semantics,
| etc etc)
|
| It's amazing what backwards compatibility NVidia has achieved
| in the past 15 years thanks to this architecture. SASS
| changes so dramatically from generation to generation but the
| PTX intermediate code has stayed highly competitive.
| dist-epoch wrote:
| Intel code from 15 years ago also runs today. But it will
| not use AVX512.
|
| Which is the same with PTX, right? If you didn't use the
| tensor core instructions or wavefront voting in the CUDA
| code, the PTX generated from it will not either, and NVIDIA
| will not magically add those capabilities in when compiling
| to SASS.
|
| Maybe it remains competitive because the code is inherently
| parallel anyway, so it will naturally scale to fill the
| extra execution units of the GPU, which is where most of
| the improvement is generation to generation.
|
| While AVX code can't automatically scale to use the AVX512
| units.
| dragontamer wrote:
| It's not the same. AVX2 instructions haven't changed and
| never will change.
|
| In contrast, NVidia can go from 64-bit instruction
| bundles to 128-bit machine code (96-bit instruction +
| 32-bit control information) between Pascal (aka PTX
| Compute Capacity 5) and Voltage (aka PTX Compute Capacity
| 7) and all the old PTX code just autocompiles to the new
| assembly instruction format and takes advantage of all
| the new memory barriers added in Volta.
|
| Having a PTX translation later is a MAJOR advantage for
| the NVidia workflow.
| ashvardanian wrote:
| There is still a lot of similarity between CPU and GPU
| programming - between AVX and PTX. Different generations
| of CPU cores handle the same AVX2 instructions
| differently. The microcode changes and the schedulers
| change, but the process is transparent for the user,
| similar to PTX.
| mmoskal wrote:
| I imagine there is and order of magnitude of difference
| between how much you can translate in software, with
| large memory and significant time budget to work with,
| compared to microcode.
| dragontamer wrote:
| Most CPU instructions are 1-to-1 with their microcode. I
| dare say that microcode is nearly irrelevant, any high-
| performance instruction (ex: multiply, add, XOR, etc.
| etc.) is but a single instruction anyway.
|
| Load/Store are memory dependent in all architectures. So
| that's just a different story as CPUs and GPUs have
| completely different ideas of how caches should work.
| (CPUs aim for latency, GPUs for bandwidth + incredibly
| large register spaces with substantial hiding of latency
| thanks to large occupancies).
|
| -------------
|
| That being said: reorder buffers on CPUs are well over
| 400-instructions these days, with super-large cores (like
| Apple's M4) is apparently on the order of 600 to 800
| instructions.
|
| Reorder buffers are _NOT_ translation. They're Tomasulo's
| algorithm
| (https://en.wikipedia.org/wiki/Tomasulo%27s_algorithm).
| If you want to know how CPUs do out-of-order, study that.
|
| I'd say CPUs have small register spaces (16 architectural
| registers, maybe 32), but large register files of maybe
| 300 or 400+. Tomasulo's algorithm is used to out-of-order
| access registers.
|
| You should think of instructions like "mov rax, [memory]"
| as closer to "rax = malloc(register); delayed-load(rax,
| memory); Out-of-order execute all instructions that don't
| use RAX ahead of us in instruction stream".
|
| Tomasulo's algorithm means using ~300-register file to
| _pretend_ to be just 16 architectural registers. The 300
| registers keeps the data out-of-order and allows you to
| execute. Registers in modern CPUs are closer to
| unique_ptr<int> in C++, assigning them frees (aka:
| reorder buffer) and also mallocs a new register off the
| register-file.
| janwas wrote:
| I hope people aren't writing directly to AVX2. When using
| a wrapper such as Highway, you get exactly this kind of
| update after a recompile, or even just running your code
| on a CPU that supports newer instructions.
|
| The cost is that the binary carries around both AVX2 and
| AVX-512 codepaths, but that is not an issue IMO.
| imtringued wrote:
| Most video encoders and decoders consist of kernels with
| hand written SIMD instructions/intrinsics.
| janwas wrote:
| Agreed. FWIW we demonstrated with JPEG XL (image codec,
| though also with animation 'video' support) that it is
| possible to write such kernels using the portable Highway
| intrinsics.
| dragontamer wrote:
| > I hope people aren't writing directly to AVX2.
|
| Did you not read the article? It's using AVX intrinsics
| and NEON intrinsics.
| jandrewrogers wrote:
| Many use cases for SIMD aren't trivially expressible
| through wrappers and abstractions. It is sometimes
| cleaner, easier, and produces more optimized codegen to
| write the intrinsics directly. It isn't ideal but it
| often produces the best result for the effort involved.
|
| An issue with the abstractions that does not go away is
| that the optimal code architecture -- well above the
| level of the SIMD wrappers -- is dependent on the
| capabilities of the silicon. The wrappers can't solve for
| that. And if you optimize the code architecture for the
| silicon architecture, it quickly approximates writing
| architecture-specific intrinsics with an additional layer
| of indirection, which significantly reduces any notional
| benefit from the abstractions.
|
| The wrappers can't abstract enough, and higher level
| abstractions (written with architecture aware intrinsics)
| are often too use case specific to reuse widely.
| Remnant44 wrote:
| I would wager that most real world SIMD use is with
| direct intrinsics.
| jsheard wrote:
| > Intel ISPC is a great project btw if you need high level code
| that targets SSE, AVX, AVX512 and even ARM NEON
|
| It's pretty funny how NEON ended up in there. A former Intel
| employee decided to implement it for fun and submitted it as a
| pull request, which Intel quietly ignored for obvious reasons,
| but then _another_ former Intel employee who still had commit
| rights merged the PR, and the optics of publicly reverting it
| would be even worse than stonewalling so Intel begrudgingly let
| it stand (but they did revoke that devs commit rights).
|
| https://pharr.org/matt/blog/2018/04/29/ispc-retrospective
| pjmlp wrote:
| It is worse than that, given that AVX is the survivor from
| Larrabee great plan to kill GPUs.
|
| Larrabee was going to take over it all, as I enjoyed its
| presentation at GDCE 2009.
| dragontamer wrote:
| I mean, 288-E Core Xeons are about to ship. Xeon 6900 series,
| right? (Estimated to ship in Q1 2025)
|
| So Larrabee lives on for... some reason. These E cores are
| well known to be modified Intel Atom cores and those were
| modified Xeon Phi cores which were Larrabee based.
|
| Just with.... AVX512 being disabled. (Lost when Xeon Phi
| turned into Intel Atoms).
|
| Intels technical strategy is completely bonkers. In a bad
| way. Intel invented all this tech 10 to 20 years ago but
| fails to have a cohesive strategy to bring it to market.
| There's clearly smart people there but somehow all the top
| level decisions are just awful
| ashvardanian wrote:
| Yes, a lot of weird decisions were made at Intel.
|
| Ironically, AMD waited so long to implement AVX-512, but
| now has it on both server and mobile chips (natively and
| 256 bit emulation, respectively). Intel started the whole
| thing, has a very fragmented stack and is now preparing
| those E cores with even more new extensions.
|
| Most importantly for Search and AI, it adds AVX_VNNI, which
| can be used for faster 8-bit integer dot-products: https://
| github.com/ashvardanian/SimSIMD/blob/75c426fb190a9d4...
|
| Would be interesting to see how matrix multiplication
| throughput will differ between AVX-512-capable P cores and
| a larger quantity of AVX_VNNI-capable E cores!
| alfiedotwtf wrote:
| A former Intel CEO even wrote a book where every product
| was planned 20+ years in advance.
|
| Imagine planning 20 years in advance where Moore's Law is
| still going strong. Come to think of it, Moore was also
| CEO of Intel lol
| Earw0rm wrote:
| And a few years later, Intel said we'd get AVX512 on
| everything by 2016, and that the instruction encoding
| supported a future extension to 1024.
|
| And then the Skylake and Cannon Lake debacle..
|
| First they pulled it from the consumer chips a fairly short
| time before launch. Then the server chips it was present in
| would downclock aggressively when you did use it, so you
| could get at best maybe 40% more performance, certainly far
| from the 2x+ it promised.
|
| Ten years on and the AMD 9950X does a pretty good job with
| it, however.
| Earw0rm wrote:
| Oh, and I neglected to mention the protracted development,
| and short, miserable life, of Cannon Lake itself.
|
| First announced in 2013, it eventually shipped five years
| later in only a single, crippled dual-core mobile SKU,
| which lasted just a year in the market before they killed
| it off.
|
| "Let's put our only consumer implementation of our highest
| performing vector architecture on a lame-duck NUC chip.",
| good move guys.
| variadix wrote:
| How much of this is because CUDA is designed for GPU execution
| and because the GPU ISA isn't a stable interface? E.g. new GPU
| instructions can be utilized by new CUDA compilers for new
| hardware because the code wasn't written to a specific ISA?
| Also, don't people fine tune GPU kernels per architecture
| manually (either by hand or via automated optimizers that test
| combinations in the configuration space)?
| dragontamer wrote:
| NVidia PTX is a very stable interface.
|
| And the PTX to SASS compiler DOES a degree of automatic fine
| tuning between architectures. Nothing amazing or anything,
| but it's a minor speed boost that has made PTX just a easier
| 'assembly-like language' to build on top of.
| janwas wrote:
| My understanding is that there is a lot of hand-writing (not
| just fine-tuning) going on. AFAIK CuDNN and TensorRT are
| written directly as SASS, not CUDA. And the presence of FP8
| in H100, but not A100, would likely require a complete
| rewrite.
| dragontamer wrote:
| Cub, thrust and many other libraries that make those
| kernels possible don't need to be rewritten.
|
| When you write a merge sort in CUDA, you can keep it across
| all versions. Maybe the new instructions can improve a few
| corner cases, but it's not like AVX to AVX512 where you
| need to rewrite everything.
|
| Ex: https://github.com/NVIDIA/cub/blob/main/cub/device/devi
| ce_me...
| snihalani wrote:
| > software methodology to interact with SIMD using GPU-like
| languages should be a priority.
|
| What's your opinion on sycl?
|
| https://www.khronos.org/sycl/
| ip26 wrote:
| Is CUDA not more analogous to using MKL, rather than AVX?
| jabl wrote:
| Sometimes I wonder about an alternative history scenario where
| CPU ISA's would have chosen a SIMT style model instead of SIMD.
| "Just" have something like fork/join instructions to start/stop
| vector mode, otherwise use the standard scalar instructions in
| both scalar and vector mode. Would have avoided a lot of
| combinatorial explosion in instructions. (of course you'd have
| to do something for cross-lane operations, and later tensor
| instructions etc.)
| janwas wrote:
| Not sure why SIMT would help, it requires more compiler
| transforms than if the code is written for packets/vectors or
| whatever we want to call them. As you note, cross-lane is a
| key part of a good SIMD abstraction. Vulkan calls it
| "subgroups", but from where I sit it's still SIMD.
| janwas wrote:
| Max performance is a stretch - recompilation would not utilize
| tensor cores, right?
|
| "too hard for mainstream programmers" seems overly pessimistic.
| I've run several workshops where devs have written dot-product
| kernels using Highway after 30 minutes of introduction.
| kristianp wrote:
| They said intrinsics. Highway is an abstraction on top of
| intrinsics.
| adrian_b wrote:
| While there is some truth in what you say, it makes seem like
| writing in the CUDA style is something new and revolutionary
| invented by NVIDIA, which it is not.
|
| The CUDA style of writing parallel programs is nothing else
| than the use of the so-called "parrallel do" a.k.a. "parrallel
| for" program structure, which has been already discussed in
| 1963. Notable later evolutions of this concept have been
| present in "Communicating Sequential Processes" by C.A.R. Hoare
| (1978-08: "arrays of processes"), then in the programming
| language Occam, which was designed based on what Hoare had
| described, then in the OpenMP extension of Fortran (1997-10),
| then in the OpenMP extension of C and C++ (1998-10).
|
| Programming in CUDA does not bring anything new, except that in
| comparison e.g. with OpenMP some keywords are implicit and
| others are different, so the equivalence is not immediately
| obvious.
|
| Programming for CPUs in the much older OpenMP is equivalent
| with programming in CUDA for GPUs.
|
| The real innovation of NVIDIA has been the high quality of the
| NVIDIA CUDA compiler and CUDA runtime GPU driver, which are
| able to distribute the work that must be done on the elements
| of an array over all the available cores, threads and SIMD
| lanes, in a manner that is transparent for the programmer, so
| in many cases the programmer is free to ignore which is the
| actual structure of the GPU that will run the program.
|
| Previous compilers for OpenMP or for other such programming
| language extensions for parallel programming have been much
| less capable to produce efficient parallel programs without
| being tuned by the programmer for each hardware variant.
| dragontamer wrote:
| Oh all of this was being done in the 1980s by Lisp*
| programmers.
|
| I'm not calling it new. I'm just saying that the intrinsics
| style is much much harder than what Lisp*, DirectX HLSL,
| CUDA, OpenCL (etc. etc) does.
|
| A specialized SIMD language makes writing SIMD easier
| compared to intrinsic style. Look at any CUDA code today and
| compare it to the AVX that is in the above article and it
| becomes readily apparent.
| synack wrote:
| I've been playing with a new Lunar Lake laptop and they've
| complicated things even further with the Neural Processing Unit
| (NPU)
|
| Now if your vectors are INT8/FP8 you're supposed to shovel them
| into this accelerator via PCIe, rather than packing into
| registers for AVX512.
|
| I wish they'd just pick an interface for vector ops and stick
| with it.
| david-gpu wrote:
| _> If you just wrote your SIMD in CUDA 15 years ago, NVidia
| compilers would have given you maximum performance across all
| NVidia GPUs rather than being forced to write and rewrite in
| SSE vs AVX vs AVX512_
|
| NVidia compilers would have compiled your code into something
| functional, but if you want to approach peak performance you
| need to at least tweak your kernels, and sometimes rewrite them
| from scratch. See for example the various MMA instructions that
| were introduced over time.
|
| _Edit:_ I see somebody made a similar comment and you
| addressed it. Sorry for the churn.
| EVa5I7bHFq9mnYK wrote:
| C# vectors do a great job of simplifying those intrinsics in a
| safe and portable manner.
| ashvardanian wrote:
| There are dozens of libraries, frameworks, and compiler
| toolchains that try to abstract away SIMD capabilities, but I
| don't think it's a great approach.
|
| The only 2 approaches that still make sense to me:
|
| A. Writing serial vectorization-aware code in a native compiled
| language, hoping your compiler will auto-vectorize.
|
| B. Implementing natively for every hardware platform, as the
| ISA differences are too big to efficiently abstract away
| anything beyond 128-register float multiplication and addition.
|
| This article, in a way, an attempt to show how big the
| differences even for simple data-parallel floating-point tasks.
| dzaima wrote:
| There's the middle-ground approach of having primarily
| target-specific operations but with intersecting ones named
| the same, and allowing easily building custom abstractions on
| top of such to paper over the differences how best it makes
| sense for the given application. That's the approach
| https://github.com/mlochbaum/Singeli takes.
|
| There's a good amount of stuff that can clearly utilize SIMD
| without much platform-specificness, but doesn't easily
| autovectorize - early-exit checks in a loop, packed bit
| boolean stuff, some data rearranging, probing hashmap checks,
| some very-short-variable-length-loop things. And while there
| might often be some parts that do just need to be entirely
| target-specific, they'll usually be surrounded by stuff that
| doesn't (the loop, trip count calculation, loads/stores,
| probably some arithmetic).
| neonsunset wrote:
| Numerics in .NET are not a high-level abstraction and do out
| of box what many mature vectorized libraries end up doing
| themselves - there is significant overlap between NEON, SSE*
| and, if we overlook vector width, AVX2/512 and WASMs
| PackedSIMD.
|
| .NET has roughly three vector APIs:
|
| - Vector<T> which is platform-defined width vector that
| exposes common set of operations
|
| - Vector64/128/256/512<T> which has wider API than the
| previous one
|
| - Platform intrinsics - basically immintrin.h
|
| Notably, platform intrinsics use respective VectorXXX<T>
| types which allows to write common parts of the algorithm in
| a portable way and apply platform intrinsics in specific
| areas where it makes sense. Also some method have 'Unsafe'
| and 'Native' variants to allow for vector to exhibit
| platform-specific behavior like shuffles since in many
| situations this is still the desired output for the common
| case.
|
| The .NET's compiler produces competitive with GCC and
| sometimes Clang codegen for these. It's gotten particularly
| good at lowering AVX512.
| Earw0rm wrote:
| The library approach does a pretty good job in conjunction
| with a good compiler, and sensible algorithm design.
|
| You're writing C++ code but as if it was shader code.
|
| I've seen impressive results with clang doing this sort of
| thing.
| exDM69 wrote:
| I will respectfully disagree with your statement, with the
| caveat that I mostly dabble in arithmetic with 128b/256b
| float and int vectors.
|
| Using C or C++ with vector extensions (Gcc/Clang) or Rust
| (nightly) std::simd is very easy and you get code that is
| portable to different CPUs and ISAs.
|
| But most importantly they have a zero cost fallback option to
| CPU-specific intrinsics when you need them. An f32x8 can be
| passed at zero cost as __mm256 to any
| core::arch::x86_64::__mm_intrinsic (or xmmintrin.h in C++
| land).
|
| You gain portable arithmetic and swizzles and SIMD vector
| types, but lose nothing. Not having to write everything for
| x86_64 and aarch64 is a huge win even if doesn't quite cover
| everything.
|
| Additionally you can use wider vectors than your hardware
| supports, the compiler is able to split your f64x64 to 128,
| 256 or 512 bit registers as needed depending on the compile
| target.
| kolbe wrote:
| I still like ispc, but that's not going to catch on.
| juancn wrote:
| The main problem is that there are no good abstractions in
| popular programming languages to take advantage of SIMD
| extensions.
|
| Also, the feature set being all over the place (e.g. integer
| support is fairly recent) doesn't help either.
|
| ISPC is a good idea, but execution is meh... it's hard to setup
| and integrate.
|
| Ideally you would want to be able to easily use this from other
| popular languages, like Java, Python, Javascript, without having
| to resort to linking a library written in C/C++.
|
| Granted, language extensions may be required to approach
| something like that in an ergonomic way, but most somehow end up
| just mimicking what C++ does and expose a pseudo assembler.
| pjmlp wrote:
| The best is the GPU programming approach, with specialised
| languages
|
| Just like using SQL is much more sane than low level C APIs to
| handle BTree nodes.
|
| The language extensions help, but code still requires too much
| low level expertise, with algorithms and data structures having
| to take SIMD/MIMD capabilities into account anyway.
| Conscat wrote:
| I think the EVE library for C++ is a great abstraction. It's
| got an unusual syntax using subscript operator overloading, but
| that winds up being a very ergonomic and flexible way to
| program with masked-SIMD.
| secondcoming wrote:
| I'm not sure about EVE. I trialled it by trying to uppercase
| a string and even though I got it working in the end it was
| quite unpleasant. Their docs need to be better.
| Earw0rm wrote:
| std::experimental::simd is happening. It should be part of
| c++26.
| janwas wrote:
| Unfortunately a bit late :) Highway reached v1.0 about 2.5
| years ago. How long would it take until Clang/GCC/MSVC are
| ready, _and_ all users ' distros have updated? Not to mention
| that the number of ops provided by std::experimental::simd is
| extremely limited - basically only math operators, and zero
| support for shuffling/crypto/rounding/interleaving/table
| lookups which seem indispensable for many applications.
| colejohnson66 wrote:
| C#'s Vector<T> does a pretty great job.
| bob1029 wrote:
| I see a lot of "just use the GPU" and you'd often be right.
|
| SIMD on the CPU is most compelling to me due to the latency
| characteristics. You are nanoseconds away from the control flow.
| If the GPU needs some updated state regarding the outside world,
| it takes significantly longer to propagate this information.
|
| For most use cases, the GPU will win the trade off. But, there is
| a reason you don't hear much about systems like order matching
| engines using them.
| pclmulqdq wrote:
| You would be surprised. The GPU often loses even for small
| neural nets given the large latency. Anything that needs high
| throughput or is sized like an HPC problem should use a GPU,
| but a lot of code benefits from SIMD on small problems.
| gmueckl wrote:
| If you run many small tasks on the GPU, you can increase
| throughput by overlapping transfers and computation. There
| may also be other ways to batch problems together, but that
| depends on the algorithms.
|
| The one truly unfixable issue is round-trip latency.
| gopalv wrote:
| > The GPU often loses even for small neural nets given the
| large latency
|
| Apple's neural engine shows that you can live in between
| those two worlds.
|
| As you said, the trouble is the latency, the programming
| model is still great.
| moldavi wrote:
| Do Apple's chips (M1 etc) change this at all, since they share
| memory with the GPU?
| bob1029 wrote:
| I think an argument could be made depending on the real world
| timings. How much closer in time is the Apple GPU vs one on a
| PCIe bus?
| one_even_prime wrote:
| Apple chips share the same physical memory between the GPU
| and the CPU. Still, they don't have USM/UVM (Unified Shared
| Memory/Unified Virtual Memory), that is, the GPU and the CPU
| can't access the same data concurrently and easily. Programs
| must map/unmap pages to control which device accesses it, and
| that's a very expensive operation.
| tubs wrote:
| They don't need to be unmapped just for the other one to
| use it. source: I wrote GPU drivers for over 10 years.
| Earw0rm wrote:
| Not much. Synchronisation of tasks is still a big overhead.
|
| If you've got tens to hundreds of microseconds worth of
| workload, sure, get the GPU to do it.
|
| But bear in mind 1000 clocks at 4GHz is 250ns, there's still
| a sizeable region where tight CPU/GPU integration isn't tight
| enough.
| dragontamer wrote:
| Despite my 'Use a GPU' post below, you are absolutely correct.
|
| Maximizing performance on a CPU today requires all the steps in
| the above article, and the article is actually very well
| written with regards to the 'mindset' needed to tackle a
| problem such as this.
|
| It's a great article for people aiming to maximize the
| performance on Intel or AMD systems.
|
| ------
|
| CPUs have the memory capacity advantage and will continue to
| hold said advantage for the foreseeable future (despite NVidias
| NVLink and other techs to try to bridge the gap).
|
| And CPU code remains far easier than learning CUDA, despite how
| hard these AVX intrinsics are in comparison to CUDA.
| marmaduke wrote:
| > CPUs have the memory capacity advantage
|
| perhaps also more precisely they also have quite an advantage
| on anything that needs and plays nicely with caches? when I
| sliced my problem to maximize cache usage, I saw pretty clear
| scalability with cores: L1/L2 cache bandwidth is ~30GB/s, so
| e.g. a 32 core system starts to compete with the big consumer
| GPUs.
| dragontamer wrote:
| Caches between CPU and GPU are extremely complex.
|
| Not only because caching is complex in CPU land, but
| because GPU has a completely different set of caches (and
| registers) that depends entirely on architecture.
|
| Case in point: GPUs often have access to 256 architectural
| registers aka 1024-bytes of register space. Now this
| depends on how much occupancy your GPU code is targeting
| (maybe 4-occupancy?), but there's a lot you can do with
| even 64-registers (aka: 4-occupancy and 256-bytes of
| register space), and is key to making those blazing fast
| FP16 matrix-multiplication kernels "for AI" everyone's so
| hot about right now.
|
| For something like FP16 matrix multiply (a very cache-
| friendly problem), the entire SIMD (all 32-lanes) of the
| GPU symmetric multiprocessor work together on the problem.
| So we're talking about an effective 32kB of __register
| space__ (let alone cache or other memory in the hierarchy).
|
| Even before FP16 matrix-multiply instructions, this absurd
| register space advantage is why GPUs were the king of
| matrix multiplication.
|
| --------
|
| GPUs are worse at larger cache sizes, say 1MB or 2MB. At
| 1MB+, a modern CPU core's L2 cache can hold all of that
| (either Intel P-core, AMD Zen5, or even Intel E-core can
| hold many MB in L2).
|
| GPUs have a secret though: a crossbar at the __shared__
| memory level. Rearranging memory and data across your lanes
| can be done through this crossbar (including many-to-one
| reductions for atomics, or one-to-many broadcasts in just a
| single clock tick). So your GPU-lanes have incredible
| communication available to them (and this crossbar is the
| key for modern ballot / voting based horizontal compute of
| modern GPU styles). This is only ~64kB of space and shared
| between all GPU-lanes but with 1024-lanes supporting
| communication its an important element of GPU memory.
|
| CPU L3 cache is very nice from a latency perspective, but
| bandwidth wise L3 cache is on the order of GPU's GDDR6x or
| HBM. 500GB/s to 2000GB/s, depending on the technology.
|
| Finally, CPU DDR5 RAM may be the slowest in this
| discussion, but its the biggest. 2TB+ Xeon Servers aren't
| even that expensive and can be assumed for any serious tech
| firm these days (ie: all-RAM Databases and whatnot).
|
| ---------
|
| So at different sizes and different use-scenarios, GPUs and
| CPUs will trade places. I'd expect CPUs to win most red-
| and-black tree races, but GPUs will win matrix
| multiplication. Both take advantage of "cache" but in very
| different ways.
| benchmarkist wrote:
| Looks like a great use case for AI. Set up the logical
| specification and constraints and let the AI find the optimal
| sequence of SIMD operations to fulfill the requirements.
| fooblaster wrote:
| No, there are decades of compiler literature for solving this
| problem.
| benchmarkist wrote:
| That's even better then. Just let the AI read the literature
| and write the optimal compiler.
| fooblaster wrote:
| It would probably be easier to clone the existing
| repository than get an llm to regurgitate llvm.
| benchmarkist wrote:
| The AI would learn from llvm as well.
| almostgotcaught wrote:
| lol so says every person that has no clue how (NP-hard)
| combinatorial optimization is.
| benchmarkist wrote:
| For humans it's very hard but it will be a breeze for the AI.
| I thought HN was a community of builders. This is an obvious
| startup opportunity.
| stouset wrote:
| All we have to do is ascribe magical properties to AI and
| we can solve anything as if P=NP!
| benchmarkist wrote:
| Those distinction are irrelevant for an AI because it is
| a pure form of intelligence that simply computes answers
| without worrying about P or NP complexity classes.
| ConspiracyFact wrote:
| You had me going.
|
| B-
| imtringued wrote:
| That's the same as being confidently wrong.
| bartekpacia wrote:
| high quality bait
| TinkersW wrote:
| You can simplify the 2x sqrts as sqrt(a*b), overall less
| operations so perhaps more accurate. It would also let you get
| rid of the funky lane swivels.
|
| As this would only use 1 lane, perhaps if you have multiple of
| these to normalize, you could vectorize it.
| a_gopher wrote:
| my thoughts exactly - crazy to know all these arcane SIMD
| opcodes but not know basic maths!!
| ashvardanian wrote:
| Square root computation can be tricky, often relying on
| approximations. These approximations tend to perform best for
| mid-range values, while accuracy can degrade for very large
| or very small values. With this in mind, a product of roots
| is generally more accurate than a root of products.
|
| From a SIMD perspective, it's worth noting that on most
| platforms, the cost of computing one square root or two is
| the same. On modern x86 server CPUs, for instance, you can
| calculate up to 8 double-precision roots in parallel with
| identical latency. So there's no additional cost in terms of
| performance.
|
| I hope this sheds some light on the design of my code.
|
| PS: In a previous life, I did research in Astro- and Plasma
| Physics. While I don't claim to remember all the Math, it's
| usually more productive to ask for clarification than to
| assume ignorance ;)
| harry8 wrote:
| > it's usually more productive to ask for clarification
| than to assume ignorance ;)
|
| Good reminder for me and anyone else right there, nicely
| put.
| nine_k wrote:
| Moments like that are enlightening. When you see something
| really improbable (knowing advanced SIMD while appearing to
| ignore basic algebra), it's likely the moment you see a gap
| in _your_ picture of the world. So it 's tine to learn
| something new and likely unexpected (else you could have
| guessed).
| rishi_devan wrote:
| Interesting article. The article mentions "...the NumPy
| implementation illustrates a marked improvement over the naive
| algorithm...", but I couldn't find a NumPy implementation in the
| article.
| andix wrote:
| Yes, they are really great at abstracting the SIMD operations,
| but the abstraction has only very few common methods. I'm not
| sure how much real world benefits those abstractions have.
|
| Once you need more complex operations, you need to use the
| specific operations from System.Runtime.Intrinsics.(X86|ARM)
| based on the current architecture. And you need to adjust your
| implementation on the CPUs capabilities. There are still a lot
| of older x64 CPUs around that don't have AVX512 for example.
| big-chungus4 wrote:
| can the authors please share the numpy code too
| ashvardanian wrote:
| There are several ways to implement it in NumPy, often
| resulting in 20% variance. I've added a reference
| implementation to my mirror of the blogpost and the Modular
| team will soon update the original posting as well:
| https://ashvardanian.com/posts/understanding-simd-complexity...
| marmaduke wrote:
| My approach to this is to write a bunch of tiny "kernels" which
| are obvious to SIMD and then inline them all, and it does a
| pretty good job on x86 and arm
|
| https://github.com/maedoc/tvbk/blob/nb-again/src/util.h
| remram wrote:
| Did they write bfloat16 and bfloat32 when they meant float16 and
| float32?
|
| On the image: https://www.modular.com/blog/understanding-simd-
| infinite-com...
| sgerenser wrote:
| Yeah I was really confused at first, pretty sure they messed up
| the labels.
| ashvardanian wrote:
| Patched ;)
___________________________________________________________________
(page generated 2024-11-30 23:01 UTC)