[HN Gopher] C is not suited to SIMD (2019)
___________________________________________________________________
C is not suited to SIMD (2019)
Author : zetalyrae
Score : 72 points
Date : 2025-01-23 21:01 UTC (4 days ago)
(HTM) web link (blog.vmchale.com)
(TXT) w3m dump (blog.vmchale.com)
| PaulHoule wrote:
| So are those functional languages. The really cool SIMD code I
| see people writing now is by people like Lemire using the latest
| extensions who do clever things like decoding UTF-8 which I think
| will always take assembly language to express unless you are
| getting an SMT solver to write it for you.
| dagelf wrote:
| Link?
| mananaysiempre wrote:
| https://simdjson.org/publications/
|
| https://simdutf.github.io/simdutf/
|
| https://lemire.me/blog/
|
| https://branchfree.org/
|
| http://0x80.pl/
| throwawaymaths wrote:
| not even an SMT solver, that's not what they are for.
| 1propionyl wrote:
| No? SMT solvers underpin the majority of program synthesis
| research tools.
|
| When you get down to it, you're optimizing (searching) for
| some program that maximizes/minimizes some objective function
| with terms for error relative to specification/examples and
| size of synthesized program, while applying some form of
| cleverness to pruning the search space.
|
| This is absolutely within the wheelhouse of SMT solvers, and
| something that they are used for.
|
| SMT doesn't _have_ to be used, but for implementation it
| enables iterating on the concept more quickly (see also, more
| broadly: prototyping in Prolog), and in other cases it 's
| simply the most effective tool for the job. So, it tends to
| get a lot of play.
| kibwen wrote:
| _> So are those functional languages._
|
| The author is suggesting array languages as the solution, which
| are a separate category from functional languages.
| janwas wrote:
| Lemire and collaborators often write in C++ intrinsics, or thin
| platform-specific wrappers on top of them.
|
| I count ~8 different implementations [1], which demonstrates
| considerable commitment :) Personally, I prefer to write once
| with portable intrinsics.
|
| https://github.com/simdutf/simdutf/tree/1d5b5cd2b60850954df5...
| AnimalMuppet wrote:
| The argument seems to be that C isn't suited to SIMD because some
| functions (exp, for instance) are not. But it then talks about
| how exp is implemented in hardware, in a way that forces it to be
| a non-SIMD calculation. That's not a C issue, that's a CPU issue.
| No language can save you there (unless it uses a software-only
| function for exp).
|
| Is this article really confused, or did I misunderstand it?
|
| The thing that makes C/C++ a good language for SIMD is how easily
| it lets you control memory alignment.
| high_na_euv wrote:
| Even high level languages like c# allow you to express such a
| thing
|
| https://stackoverflow.com/questions/73269625/create-memory-a...
|
| Also structlayout and fieldoffset
| leecarraher wrote:
| i think the article is confused. i write quite a bit of SIMD
| code for HPC and c is my goto for low level computing because
| of the low level memory allocation in c. In fact that tends to
| be the lion share of work to implement simd operations where
| memory isn't the bottleneck. if only it were as simple as
| calling vfmadd132ps on two datatypes.
| woooooo wrote:
| Not in the article, but I've read that the way C does pointers,
| a single address with length implicit, makes it hard for
| compilers to assert that 2 arrays don't overlap/alias and this
| is an obstacle for generating SIMD instructions.
| 1718627440 wrote:
| Not really. Pointers from different allocations are
| guaranteed to never alias. If they do, it is UB.
| AnimalMuppet wrote:
| Right, but the compiler in general has no way to know that
| pointers are from different allocations - especially
| several function calls away from the allocations, with some
| pointer arithmetic in between.
| 1718627440 wrote:
| Yes, that is why restrict was introduced.
| bee_rider wrote:
| In general, the information as to whether or not the
| pointers will alias might not even be known, right? Like
| we could be compiling a library that will take in some
| pointers from the user.
| wayoverthecloud wrote:
| In C++ there is the align_alloc specificier.
| https://en.cppreference.com/w/c/memory/aligned_alloc
|
| Not sure for C
| ryanpetrich wrote:
| aligned_alloc is a C function. It doesn't help the compiler
| prove that two pointers can't alias. restrict is the
| keyword that does.
| alkh wrote:
| I thought you could simply add `restrict`[1] to indicate that
| only one pointer points to a specific object? Shouldn't this
| help?
|
| [1]https://en.cppreference.com/w/c/language/restrict
| aidenn0 wrote:
| You can add that qualifier, but nothing will enforce that
| objects don't overlap. So you replaced slow correct code
| with fast incorrect code.
| TheNewAndy wrote:
| If you believed the pointers didn't alias and they
| actually did then you replaced slow incorrect code with
| fast incorrect code
| aidenn0 wrote:
| That doesn't follow? If you replaced every call of memcpy
| with memmove (to use an example from the standard
| library), then your program is no less correct than it
| was before (and possibly more). The converse is that
| adding "restrict" qualifiers to existing functions can
| only make your program less correct, not more.
| drysine wrote:
| That's what the restrict type qualifier[0] is for.
|
| [0] https://en.cppreference.com/w/c/language/restrict
| bee_rider wrote:
| When compiling a library, it is generally impossible for the
| compiler to know whether or not the pointers will be aliased,
| right? That decision is made after the library is already
| compiled.
| drwu wrote:
| The function declaration in the header file of the library
| can carry the required 'restrict'. It works for c++
| invokers too, as most c++ compilers also support and check
| the __restrict__ for old plain pointer types.
| adgjlsfhk1 wrote:
| > But it then talks about how exp is implemented in hardware,
| in a way that forces it to be a non-SIMD calculation.
|
| This is especially surprising given that very few architectures
| have an `exp` in hardware. It's almost always done in software.
| stabbles wrote:
| I think the interpretation is more conceptual about functions,
| reusability and composability.
|
| The function exp has an abstract mathematical definition as
| mapping from numbers to numbers, and you could implement that
| generically if the language allows for it, but in C you cannot
| because it's bound to the signature `double exp(double)` and
| fixed like that at compile time. You cannot use this function
| in a different context and pass e.g. __m256d.
|
| Basically because C does not have function dispatch, it's ill-
| suited for generic programming. You cannot write an algorithm
| on scalar, and now pass arrays instead.
| Someone wrote:
| > You cannot write an algorithm on scalar, and now pass
| arrays instead.
|
| It is limited to cases where you know all overloads
| beforehand and limited because of C's weak type system (can't
| have an overload for arrays of _int_ and one for pointers to
| _int_ , for example), but you can.
| https://en.cppreference.com/w/c/language/generic:
| #include <math.h> #include <stdio.h> //
| Possible implementation of the tgmath.h macro cbrt
| #define cbrt(X) _Generic((X), \ long
| double: cbrtl, \ default: cbrt, \
| float: cbrtf \ )(X) int
| main(void) { double x = 8.0; const
| float y = 3.375; printf("cbrt(8.0) = %f\n", cbrt(x));
| // selects the default cbrt printf("cbrtf(3.375) =
| %f\n", cbrt(y)); // converts const float to float,
| // then selects cbrtf }
|
| It also, IMO, is a bit ugly, but that fits in nice with the
| rest of C, as seen through modern eyes.
|
| There also is a drawback for lots of cases that you cannot
| overload operators. Implementing vector addition as '+' isn't
| possible, for example.
|
| Many compilers partially support that as a language
| extension, though, see for example https://clang.llvm.org/doc
| s/LanguageExtensions.html#vectors-....
| uecker wrote:
| I am not sure what you mean by "weak" type system. C's type
| system is not weak IMHO and also differentiates between
| arrays and pointers: _Generic((typeof(x),
| int*: 1, int[]: 2)
|
| \\_Generic with type argument is a new feature, it is a bit
| less elegant without it:
| _Generic((typeof(x)*)0, typeof(int*)*: 1, typeof(int[])\*:
| 2)
| glouwbug wrote:
| Any pointer converts to void* without warning. That's
| pretty weak.
| ta3401 wrote:
| Perhaps not in the sense of weak vs strong, but rather in
| weak vs powerful. C type system isn't very powerful, if
| you compare it to languages from the scientific community
| (Python is an exception here imho). It is simple and
| probably good enough for what it was designed for (system
| programming), but it's not very good for scientific
| computing.
| uecker wrote:
| You can definitely implement generic algorithms in C and
| there are different ways to do this. For example, you create
| an abstract data type "ring" with addition, multiplication,
| and multiplication with a scalar, and pass objects to this
| type and functions for these three operations to a generic
| exponential function.
| ta3401 wrote:
| > But it then talks about how exp is implemented in hardware,
| in a way that forces it to be a non-SIMD calculation.
|
| The fact that exp is implemented in hardware is not the
| argument. The argument is that exp is a library function,
| compiled separately, and thus the compiler cannot inline the
| function and fuse it with an array-wide loop, to detect later
| on an opportunity to generate the SIMD instructions.
|
| It is true however that exp is implemented in hardware in the
| X86 world, and to be fair, perhaps a C compiler only needs to
| represent that function as an intrinsic instead, to give itself
| a chance to later replace it either with the function call or
| some SIMD instruction; but I guess that the standard doesn't
| provide that possibility?
| notorandit wrote:
| C has been invented when CPUs, those few available, were just
| single cored and single threaded.
|
| Wanna SMP? Use multi-thread libreries. Wanna SIMD/MIMD? Use
| (inline) assembler functions. Or design your own language.
| LiamPowell wrote:
| That's missing the point of the article, but it's also not
| true, at least not at the time of C89. This can be easily
| verified by a Google search. As an aside, Ada also had multi-
| threading support before C89 was released, but this article is
| about SIMD, not multi-threading.
| johnnyjeans wrote:
| c89 is 19 years into c's life
| LiamPowell wrote:
| Yes, but before standardisation there were many
| implementations all doing different things based on a book
| with no particularly formal specifications, so picking 1989
| makes more sense than 1970 in my opinion. Multi-threading
| also existed in 1970 but wasn't as widespread.
| dragontamer wrote:
| Fortran is older but it's array manipulation maps naturally to
| SIMD.
| mlochbaum wrote:
| Several comments seem confused about this point: the article is
| not about manual SIMD, which of course C is perfectly capable of
| with intrinsics. It's discussing problems in compiling
| architecture-independent code to SIMD instructions, which in
| practice C compilers very often fail to do (so, exp having scalar
| hardware support may force _other_ arithmetic not to be
| vectorized). An alternative mentioned is array programming, where
| operations are run one at a time on all the data; these languages
| serve as a proof of concept that useful programs can be run in a
| way that uses SIMD nearly all the time it 's applicable, but at
| the cost of having to write every intermediate result to memory.
| So the hope is that "more fluent methods for compilation" can
| generate SIMD code without losing the advantages of scalar
| compilation.
|
| As an array implementer I've thought about the issue a lot and
| have been meaning to write a full page on it. For now I have some
| comments at
| https://mlochbaum.github.io/BQN/implementation/versusc.html#...
| and the last paragraph of
| https://mlochbaum.github.io/BQN/implementation/compile/intro....
| mlochbaum wrote:
| Got the introduction written at https://mlochbaum.github.io/BQN
| /implementation/compile/fusio....
| brundolf wrote:
| Thank you. The title is confusing
| mlochbaum wrote:
| Finding it increasingly funny how many people have come out of
| the woodworks to "defend" C by offering this or that method of
| explicit vectorization. What an indictment! C programmers can
| no longer even conceive of a language that works from a
| description of what should be done and not which instructions
| should be used!
| jiehong wrote:
| It seems that Zig is well suited for writing SIMD [0].
|
| If only GPU makers could standardise an extended ISA like AVX on
| CPU, and we could all run SIMD or SIMT code without needing any
| librairies, but our compilers.
|
| [0]: https://zig.guide/language-basics/vectors/
| dundarious wrote:
| I like zig, but no, what it provides pales in comparison to
| what intrinsics offer.
| camel-cdr wrote:
| This depends entirely on compiler support. Intels ICX compiler
| can easily vectorize a sigmoid loop, by calling SVMLs vectorized
| expf function: https://godbolt.org/z/no6zhYGK6
|
| If you implement a scalar expf in a vectorizer friendly way, and
| it's visible to the compiler, then it could also be vectorized:
| https://godbolt.org/z/zxTn8hbEe
| dzaima wrote:
| gcc and clang are also capable of it, given certain compiler
| flags: https://godbolt.org/z/z766hc64n
| camel-cdr wrote:
| Thanks, I didn't know about this. Interesting that it seems
| to require fast-math.
| ComputerGuru wrote:
| That means it'll never be used. ffast-math is verboten in
| most serious codebases.
| dzaima wrote:
| Seems "-fno-math-errno" is enough for clang. gcc needs a
| whole "-fno-math-errno -funsafe-math-optimizations
| -ffinite-math-only".
| Avamander wrote:
| So is the optimization "wrong" or "unsafe" even in the
| case of Intel's ICX compiler? Is it that you can't
| express the right (error) semantics in C?
|
| I'm just wondering why those two require the flag and the
| other doesn't.
| dzaima wrote:
| ICX appears to just default to fast-math:
| https://godbolt.org/z/jzPazGjoh
|
| Requiring -fno-math-errno is sane enough, essentially
| noone needs math errno anyway (and that flag is needed to
| vectorize even sqrt, for which there's a proper full
| hardware SIMD instruction, but which obviously doesn't
| set errno on a negative input or whatever).
|
| Probably depends on the vector math library used whether
| it handles near-infinity or inf/NaN values properly. And
| there's also the potential concern that the scalar and
| vector exp() likely give different results, leading to
| weird behavior, which might be justification for
| -funsafe-math-optimizations.
| bee_rider wrote:
| I'm not sure what "suited to SIMD" means exactly in this
| context. I mean, it is clearly possible for a compiler to apply
| some SIMD optimizations. But the program is essentially
| expressed as a sequential thing, and then the compiler
| discovers the SIMD potential. Of course, we write programs that
| we hope will make it easy to discover that potential. But it
| can be difficult to reason about how a compiler is going to
| optimize, for anything other than a simple loop.
| camel-cdr wrote:
| Suites for SIMD means you write the scalar equivalent of what
| you'd do on a single element in a SIMD implementation.
|
| E.g. you avoid lookup tables when you can, or only use
| smaller ones you know to fit in one or two SIMD registers.
| gcc and clang can't vevtorize it as is, but they do if you
| remove the brancjes than handle infinity and over/under-flow.
|
| In the godbolt link I copied the musl expf implementation and
| icx was able to vectorize it, even though it uses a LUT to
| large for SIMD registers.
|
| #pragma omp simd and equivalents will encourage the compiler
| to vectorize a specific loop and produce a warning if a loop
| isn't vectorized.
| bee_rider wrote:
| I shouldn't have started my comment with the sort of
| implied question or note of confusion. Sorry, that was
| unclear communication.
|
| I agree that it is possible to write some C programs that
| some compilers will be able to discover the parallel
| potential of. But it isn't very ergonomic or dependable.
| So, I think this is not a strong counter-argument to the
| theory of the blog post. It is possible to write SIMD
| friendly C, but often it is easier for the programmer to
| fall back to intrinsics to express their intent.
| sifar wrote:
| It means auto-vectorization. Write scalar code that can be
| automatically vectorized by the compiler by using SIMD
| instructions
| Vosporos wrote:
| Woo, very happy to see more posts by McHale these days!
| Const-me wrote:
| I would rather conclude that automatic vectorizers are still less
| than ideal, despite SIMD instructions have been widely available
| in commodity processors for 25 years now.
|
| The language is actually great with SIMD, you just have to do it
| yourself with intrinsics, or use libraries. BTW, here's a library
| which implements 4-wide vectorized exponent functions for FP32
| precision on top of SSE, AVX and NEON SIMD intrinsics (MIT
| license):
| https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...
| nine_k wrote:
| Why does FORTRAN, which is older than C, has no trouble with
| auto-vectorization?
|
| Mostly because of a very different memory model, of course.
| Arrays are first-class things, not hacks on top of pointers,
| and can't alias each other.
| MisterTea wrote:
| > Arrays are first-class things, not hacks on top of
| pointers, and can't alias each other.
|
| "Hacks" feels like the wrong term to use here. Comparing
| managed objects surrounded by runtime to memory isn't a fair
| comparison.
| nine_k wrote:
| There's no language-level difference between int[] and int*
| in C. Indexing is just pointer arithmetics. No support for
| (non-jagged) multidimensional arrays either.
|
| A proper array would at least know its length. It does not
| require a lot of runtime to do, and most C code require
| libc anyway for basic stuff like malloc().
| jjmarr wrote:
| This isn't true. int[] decays into an int* but is a
| different type.
|
| An array member of a struct will have its data allocated
| contiguously with other struct members (subject to
| compiler padding). An int* would point outside the
| struct. This is possible even with variable length
| arrays. Specifically, you can declare the final member of
| a struct as an int[] which makes it a "flexible array
| member". Then you have to malloc the struct to be the
| sizeof the struct + whatever size you want the array to
| be.
|
| https://en.wikipedia.org/wiki/Flexible_array_member
|
| This is used in struct sockaddr for generic APIs where
| the actual struct can be of different sizes.
|
| https://lwn.net/Articles/997094/
|
| This is rather pedantic but the memory model for arrays
| _is_ important here. If you 're iterating over multiple
| struct members that are arrays, the fact the arrays are
| stored contiguously in memory is going to matter to SIMD.
| nine_k wrote:
| This is fair, thanks!
| Gibbon1 wrote:
| The C language spec isn't what important here it's the
| ABI doesn't know the difference.
| jjmarr wrote:
| if using an int* over an int[] changes the memory layout
| of a struct, that necessarily implies a difference in the
| ABI.
|
| As an example, C++'s std::array can be implemented as a
| struct containing a fixed-size C-style array. This can be
| passed-by-value. This means that returning a std::array
| can be more performant than returning a std::vector
| (which might be implemented as an int* that is
| reallocated when you add too many elements), because a
| std::array is returned on the stack while a std::vector
| is returned on the heap.
|
| I was bit by this once when returning a
| std::unordered_map because I was doing a heap-allocation
| in a performance-critical section.
| MisterTea wrote:
| > There's no language-level difference between int[] and
| int* in C
|
| sizeof(int[n]) vs sizeof(int*). Then there's: 'int
| getint(int *I){return *I;}' which wont let me pass
| '&int[n]' but I can 'int* = int[n]' then pass &int*. It's
| not so much a hack but a very primitive implementation on
| top of pointers (alright maybe it is a bit hacky.)
|
| > A proper array would at least know its length. It does
| not require a lot of runtime to do, and most C code
| require libc anyway for basic stuff like malloc().
|
| Something like a fat pointer 'typedef struct {long len,
| void *data} MyArray;' then make a libMyArray to operate
| on MyArray objects. But now you have a libMyArray
| dependency. You also loose type information unless you
| add fields to MyArray with an enum or have a MyArray for
| each type which gets messy. Then you cant do shit like
| just use an identifier in a conditional e.g.
| 'while(MyArray[n++])' unless you change how C works which
| makes it notC (uh, maybe pick another name.)
|
| I feel that C is fine as it's an old machine oriented
| language. Leave it alone. I would look to reach for
| another tool when the primitive design of C makes life
| difficult.
| Const-me wrote:
| I believe it's ecosystem. For example, Python is very high
| level language, not sure it even has a memory model. But it
| has libraries like NumPy which support all these vectorized
| exponents when processing long arrays of numbers.
| goku12 wrote:
| Everything that's heavily vectorized in the Python
| ecosystem, including numpy achieves it using optimized
| backend code written in other languages - fortran in
| particular. Python is only a thin veneer over those
| backends. In fact, you're constantly reminded to offload
| the control flow as much as possible to those backends for
| the sake of performance, instead of doing things like
| looping in Python. If that's enough to consider Python to
| be good in vectorization, I can just link high performance
| fortran libraries with C, handle non-vector control flow
| from there and call it a day. I guarantee you that this
| arrangement will be far more performant than what Python
| can ever achieve. I have to strongly agree with the other
| commenter's observation that the memory model is the key to
| vector performance.
|
| And of course Python has a memory model. While that model
| is not as well understood as C's model, it is the key to
| Python's success and popularity as a generic programming
| language and as a numeric/scientific programming language.
| Python's memory model unlike C's or Fortran's, isn't
| designed for high performance. It's designed for rich
| abstractions, high ergonomics and high interoperability
| with those performant languages. For most people, the
| processing time lost executing python code is an acceptable
| tradeoff for the highly expressive control that Python
| gives them over the scheduling of lower level operations.
| ilayn wrote:
| NumPy has no Fortran code, for quite a long time now.
| SciPy has and it is being rewritten. What you mention is
| the ufunc machinery underneath which is all C. NumPy also
| has SIMD support (albeit limited to certain functions).
| BLAS is also C/Assembly but only LAPACK is F77 (which is
| too much code to be rewritten).
|
| This does not mean Fortran is bad (obligatory disclaimer
| for Fortran fans).
| nitwit005 wrote:
| When I tested this before (some years ago), adding "restrict"
| to the C pointers resulted in the same behavior, so the
| aliasing default seem to be the key issue.
| Keyframe wrote:
| isn't DirectXMath C++, not C?
| npalli wrote:
| Yes, it is in C++.
| Const-me wrote:
| Yeah, but if you need C versions of low-level SIMD algorithms
| from there, copy-paste the implementation and it will
| probably compile as C code. That's why I mentioned the MIT
| license: copy-pasting from GPL libraries may or may not work
| depending on the project, but the MIT license is copy-paste
| friendly.
| svilen_dobrev wrote:
| reminds me of this old article.. ~2009, Sun/MIT - "The Future is
| Parallel" .. and "sequential decomposition and usual math are
| wrong horse.."
|
| https://ocw.mit.edu/courses/6-945-adventures-in-advanced-sym...
| npalli wrote:
| C++ does OK.
|
| Google's highway [1]
|
| Microsoft DirectXMath [2]
|
| [1] https://github.com/google/highway
|
| [2] https://github.com/microsoft/DirectXMath
| commandlinefan wrote:
| This seems more of a generalization of Richard Steven's
| observation:
|
| "Modularity is the enemy of performance"
|
| If you want optimal performance, you have to collapse the layers.
| Look at Deepseek, for example.
| vkaku wrote:
| I also think that vectorizers and compilers can detect parallel
| memory adds/moves/subs and without that, many do not take time to
| provide adequate hints to the compiler about it.
|
| Some people have vectorized successfully with C, even with all
| the hacks/pointers/union/opaque business. It requires careful
| programming, for sure. The ffmpeg cases are super good examples
| of how compiler misses happen, and how to optimize for full
| throughput in those cases. Worth a look for all compiler
| engineers.
___________________________________________________________________
(page generated 2025-01-27 23:01 UTC)