hngopher.com

       [HN Gopher] C is not suited to SIMD (2019)
       ___________________________________________________________________
        
       C is not suited to SIMD (2019)
        
       Author : zetalyrae
       Score  : 72 points
       Date   : 2025-01-23 21:01 UTC (4 days ago)
        
 (HTM) web link (blog.vmchale.com)
 (TXT) w3m dump (blog.vmchale.com)
        
       | PaulHoule wrote:
       | So are those functional languages. The really cool SIMD code I
       | see people writing now is by people like Lemire using the latest
       | extensions who do clever things like decoding UTF-8 which I think
       | will always take assembly language to express unless you are
       | getting an SMT solver to write it for you.
        
         | dagelf wrote:
         | Link?
        
           | mananaysiempre wrote:
           | https://simdjson.org/publications/
           | 
           | https://simdutf.github.io/simdutf/
           | 
           | https://lemire.me/blog/
           | 
           | https://branchfree.org/
           | 
           | http://0x80.pl/
        
         | throwawaymaths wrote:
         | not even an SMT solver, that's not what they are for.
        
           | 1propionyl wrote:
           | No? SMT solvers underpin the majority of program synthesis
           | research tools.
           | 
           | When you get down to it, you're optimizing (searching) for
           | some program that maximizes/minimizes some objective function
           | with terms for error relative to specification/examples and
           | size of synthesized program, while applying some form of
           | cleverness to pruning the search space.
           | 
           | This is absolutely within the wheelhouse of SMT solvers, and
           | something that they are used for.
           | 
           | SMT doesn't _have_ to be used, but for implementation it
           | enables iterating on the concept more quickly (see also, more
           | broadly: prototyping in Prolog), and in other cases it 's
           | simply the most effective tool for the job. So, it tends to
           | get a lot of play.
        
         | kibwen wrote:
         | _> So are those functional languages._
         | 
         | The author is suggesting array languages as the solution, which
         | are a separate category from functional languages.
        
         | janwas wrote:
         | Lemire and collaborators often write in C++ intrinsics, or thin
         | platform-specific wrappers on top of them.
         | 
         | I count ~8 different implementations [1], which demonstrates
         | considerable commitment :) Personally, I prefer to write once
         | with portable intrinsics.
         | 
         | https://github.com/simdutf/simdutf/tree/1d5b5cd2b60850954df5...
        
       | AnimalMuppet wrote:
       | The argument seems to be that C isn't suited to SIMD because some
       | functions (exp, for instance) are not. But it then talks about
       | how exp is implemented in hardware, in a way that forces it to be
       | a non-SIMD calculation. That's not a C issue, that's a CPU issue.
       | No language can save you there (unless it uses a software-only
       | function for exp).
       | 
       | Is this article really confused, or did I misunderstand it?
       | 
       | The thing that makes C/C++ a good language for SIMD is how easily
       | it lets you control memory alignment.
        
         | high_na_euv wrote:
         | Even high level languages like c# allow you to express such a
         | thing
         | 
         | https://stackoverflow.com/questions/73269625/create-memory-a...
         | 
         | Also structlayout and fieldoffset
        
         | leecarraher wrote:
         | i think the article is confused. i write quite a bit of SIMD
         | code for HPC and c is my goto for low level computing because
         | of the low level memory allocation in c. In fact that tends to
         | be the lion share of work to implement simd operations where
         | memory isn't the bottleneck. if only it were as simple as
         | calling vfmadd132ps on two datatypes.
        
         | woooooo wrote:
         | Not in the article, but I've read that the way C does pointers,
         | a single address with length implicit, makes it hard for
         | compilers to assert that 2 arrays don't overlap/alias and this
         | is an obstacle for generating SIMD instructions.
        
           | 1718627440 wrote:
           | Not really. Pointers from different allocations are
           | guaranteed to never alias. If they do, it is UB.
        
             | AnimalMuppet wrote:
             | Right, but the compiler in general has no way to know that
             | pointers are from different allocations - especially
             | several function calls away from the allocations, with some
             | pointer arithmetic in between.
        
               | 1718627440 wrote:
               | Yes, that is why restrict was introduced.
        
               | bee_rider wrote:
               | In general, the information as to whether or not the
               | pointers will alias might not even be known, right? Like
               | we could be compiling a library that will take in some
               | pointers from the user.
        
           | wayoverthecloud wrote:
           | In C++ there is the align_alloc specificier.
           | https://en.cppreference.com/w/c/memory/aligned_alloc
           | 
           | Not sure for C
        
             | ryanpetrich wrote:
             | aligned_alloc is a C function. It doesn't help the compiler
             | prove that two pointers can't alias. restrict is the
             | keyword that does.
        
           | alkh wrote:
           | I thought you could simply add `restrict`[1] to indicate that
           | only one pointer points to a specific object? Shouldn't this
           | help?
           | 
           | [1]https://en.cppreference.com/w/c/language/restrict
        
             | aidenn0 wrote:
             | You can add that qualifier, but nothing will enforce that
             | objects don't overlap. So you replaced slow correct code
             | with fast incorrect code.
        
               | TheNewAndy wrote:
               | If you believed the pointers didn't alias and they
               | actually did then you replaced slow incorrect code with
               | fast incorrect code
        
               | aidenn0 wrote:
               | That doesn't follow? If you replaced every call of memcpy
               | with memmove (to use an example from the standard
               | library), then your program is no less correct than it
               | was before (and possibly more). The converse is that
               | adding "restrict" qualifiers to existing functions can
               | only make your program less correct, not more.
        
           | drysine wrote:
           | That's what the restrict type qualifier[0] is for.
           | 
           | [0] https://en.cppreference.com/w/c/language/restrict
        
           | bee_rider wrote:
           | When compiling a library, it is generally impossible for the
           | compiler to know whether or not the pointers will be aliased,
           | right? That decision is made after the library is already
           | compiled.
        
             | drwu wrote:
             | The function declaration in the header file of the library
             | can carry the required 'restrict'. It works for c++
             | invokers too, as most c++ compilers also support and check
             | the __restrict__ for old plain pointer types.
        
         | adgjlsfhk1 wrote:
         | > But it then talks about how exp is implemented in hardware,
         | in a way that forces it to be a non-SIMD calculation.
         | 
         | This is especially surprising given that very few architectures
         | have an `exp` in hardware. It's almost always done in software.
        
         | stabbles wrote:
         | I think the interpretation is more conceptual about functions,
         | reusability and composability.
         | 
         | The function exp has an abstract mathematical definition as
         | mapping from numbers to numbers, and you could implement that
         | generically if the language allows for it, but in C you cannot
         | because it's bound to the signature `double exp(double)` and
         | fixed like that at compile time. You cannot use this function
         | in a different context and pass e.g. __m256d.
         | 
         | Basically because C does not have function dispatch, it's ill-
         | suited for generic programming. You cannot write an algorithm
         | on scalar, and now pass arrays instead.
        
           | Someone wrote:
           | > You cannot write an algorithm on scalar, and now pass
           | arrays instead.
           | 
           | It is limited to cases where you know all overloads
           | beforehand and limited because of C's weak type system (can't
           | have an overload for arrays of _int_ and one for pointers to
           | _int_ , for example), but you can.
           | https://en.cppreference.com/w/c/language/generic:
           | #include <math.h>       #include <stdio.h>             //
           | Possible implementation of the tgmath.h macro cbrt
           | #define cbrt(X) _Generic((X),     \                   long
           | double: cbrtl, \                       default: cbrt,  \
           | float: cbrtf  \                   )(X)             int
           | main(void)       {         double x = 8.0;         const
           | float y = 3.375;         printf("cbrt(8.0) = %f\n", cbrt(x));
           | // selects the default cbrt         printf("cbrtf(3.375) =
           | %f\n", cbrt(y)); // converts const float to float,
           | // then selects cbrtf       }
           | 
           | It also, IMO, is a bit ugly, but that fits in nice with the
           | rest of C, as seen through modern eyes.
           | 
           | There also is a drawback for lots of cases that you cannot
           | overload operators. Implementing vector addition as '+' isn't
           | possible, for example.
           | 
           | Many compilers partially support that as a language
           | extension, though, see for example https://clang.llvm.org/doc
           | s/LanguageExtensions.html#vectors-....
        
             | uecker wrote:
             | I am not sure what you mean by "weak" type system. C's type
             | system is not weak IMHO and also differentiates between
             | arrays and pointers:                  _Generic((typeof(x),
             | int*: 1, int[]: 2)
             | 
             | \\_Generic with type argument is a new feature, it is a bit
             | less elegant without it:
             | _Generic((typeof(x)*)0, typeof(int*)*: 1, typeof(int[])\*:
             | 2)
        
               | glouwbug wrote:
               | Any pointer converts to void* without warning. That's
               | pretty weak.
        
               | ta3401 wrote:
               | Perhaps not in the sense of weak vs strong, but rather in
               | weak vs powerful. C type system isn't very powerful, if
               | you compare it to languages from the scientific community
               | (Python is an exception here imho). It is simple and
               | probably good enough for what it was designed for (system
               | programming), but it's not very good for scientific
               | computing.
        
           | uecker wrote:
           | You can definitely implement generic algorithms in C and
           | there are different ways to do this. For example, you create
           | an abstract data type "ring" with addition, multiplication,
           | and multiplication with a scalar, and pass objects to this
           | type and functions for these three operations to a generic
           | exponential function.
        
         | ta3401 wrote:
         | > But it then talks about how exp is implemented in hardware,
         | in a way that forces it to be a non-SIMD calculation.
         | 
         | The fact that exp is implemented in hardware is not the
         | argument. The argument is that exp is a library function,
         | compiled separately, and thus the compiler cannot inline the
         | function and fuse it with an array-wide loop, to detect later
         | on an opportunity to generate the SIMD instructions.
         | 
         | It is true however that exp is implemented in hardware in the
         | X86 world, and to be fair, perhaps a C compiler only needs to
         | represent that function as an intrinsic instead, to give itself
         | a chance to later replace it either with the function call or
         | some SIMD instruction; but I guess that the standard doesn't
         | provide that possibility?
        
       | notorandit wrote:
       | C has been invented when CPUs, those few available, were just
       | single cored and single threaded.
       | 
       | Wanna SMP? Use multi-thread libreries. Wanna SIMD/MIMD? Use
       | (inline) assembler functions. Or design your own language.
        
         | LiamPowell wrote:
         | That's missing the point of the article, but it's also not
         | true, at least not at the time of C89. This can be easily
         | verified by a Google search. As an aside, Ada also had multi-
         | threading support before C89 was released, but this article is
         | about SIMD, not multi-threading.
        
           | johnnyjeans wrote:
           | c89 is 19 years into c's life
        
             | LiamPowell wrote:
             | Yes, but before standardisation there were many
             | implementations all doing different things based on a book
             | with no particularly formal specifications, so picking 1989
             | makes more sense than 1970 in my opinion. Multi-threading
             | also existed in 1970 but wasn't as widespread.
        
         | dragontamer wrote:
         | Fortran is older but it's array manipulation maps naturally to
         | SIMD.
        
       | mlochbaum wrote:
       | Several comments seem confused about this point: the article is
       | not about manual SIMD, which of course C is perfectly capable of
       | with intrinsics. It's discussing problems in compiling
       | architecture-independent code to SIMD instructions, which in
       | practice C compilers very often fail to do (so, exp having scalar
       | hardware support may force _other_ arithmetic not to be
       | vectorized). An alternative mentioned is array programming, where
       | operations are run one at a time on all the data; these languages
       | serve as a proof of concept that useful programs can be run in a
       | way that uses SIMD nearly all the time it 's applicable, but at
       | the cost of having to write every intermediate result to memory.
       | So the hope is that "more fluent methods for compilation" can
       | generate SIMD code without losing the advantages of scalar
       | compilation.
       | 
       | As an array implementer I've thought about the issue a lot and
       | have been meaning to write a full page on it. For now I have some
       | comments at
       | https://mlochbaum.github.io/BQN/implementation/versusc.html#...
       | and the last paragraph of
       | https://mlochbaum.github.io/BQN/implementation/compile/intro....
        
         | mlochbaum wrote:
         | Got the introduction written at https://mlochbaum.github.io/BQN
         | /implementation/compile/fusio....
        
         | brundolf wrote:
         | Thank you. The title is confusing
        
         | mlochbaum wrote:
         | Finding it increasingly funny how many people have come out of
         | the woodworks to "defend" C by offering this or that method of
         | explicit vectorization. What an indictment! C programmers can
         | no longer even conceive of a language that works from a
         | description of what should be done and not which instructions
         | should be used!
        
       | jiehong wrote:
       | It seems that Zig is well suited for writing SIMD [0].
       | 
       | If only GPU makers could standardise an extended ISA like AVX on
       | CPU, and we could all run SIMD or SIMT code without needing any
       | librairies, but our compilers.
       | 
       | [0]: https://zig.guide/language-basics/vectors/
        
         | dundarious wrote:
         | I like zig, but no, what it provides pales in comparison to
         | what intrinsics offer.
        
       | camel-cdr wrote:
       | This depends entirely on compiler support. Intels ICX compiler
       | can easily vectorize a sigmoid loop, by calling SVMLs vectorized
       | expf function: https://godbolt.org/z/no6zhYGK6
       | 
       | If you implement a scalar expf in a vectorizer friendly way, and
       | it's visible to the compiler, then it could also be vectorized:
       | https://godbolt.org/z/zxTn8hbEe
        
         | dzaima wrote:
         | gcc and clang are also capable of it, given certain compiler
         | flags: https://godbolt.org/z/z766hc64n
        
           | camel-cdr wrote:
           | Thanks, I didn't know about this. Interesting that it seems
           | to require fast-math.
        
             | ComputerGuru wrote:
             | That means it'll never be used. ffast-math is verboten in
             | most serious codebases.
        
             | dzaima wrote:
             | Seems "-fno-math-errno" is enough for clang. gcc needs a
             | whole "-fno-math-errno -funsafe-math-optimizations
             | -ffinite-math-only".
        
               | Avamander wrote:
               | So is the optimization "wrong" or "unsafe" even in the
               | case of Intel's ICX compiler? Is it that you can't
               | express the right (error) semantics in C?
               | 
               | I'm just wondering why those two require the flag and the
               | other doesn't.
        
               | dzaima wrote:
               | ICX appears to just default to fast-math:
               | https://godbolt.org/z/jzPazGjoh
               | 
               | Requiring -fno-math-errno is sane enough, essentially
               | noone needs math errno anyway (and that flag is needed to
               | vectorize even sqrt, for which there's a proper full
               | hardware SIMD instruction, but which obviously doesn't
               | set errno on a negative input or whatever).
               | 
               | Probably depends on the vector math library used whether
               | it handles near-infinity or inf/NaN values properly. And
               | there's also the potential concern that the scalar and
               | vector exp() likely give different results, leading to
               | weird behavior, which might be justification for
               | -funsafe-math-optimizations.
        
         | bee_rider wrote:
         | I'm not sure what "suited to SIMD" means exactly in this
         | context. I mean, it is clearly possible for a compiler to apply
         | some SIMD optimizations. But the program is essentially
         | expressed as a sequential thing, and then the compiler
         | discovers the SIMD potential. Of course, we write programs that
         | we hope will make it easy to discover that potential. But it
         | can be difficult to reason about how a compiler is going to
         | optimize, for anything other than a simple loop.
        
           | camel-cdr wrote:
           | Suites for SIMD means you write the scalar equivalent of what
           | you'd do on a single element in a SIMD implementation.
           | 
           | E.g. you avoid lookup tables when you can, or only use
           | smaller ones you know to fit in one or two SIMD registers.
           | gcc and clang can't vevtorize it as is, but they do if you
           | remove the brancjes than handle infinity and over/under-flow.
           | 
           | In the godbolt link I copied the musl expf implementation and
           | icx was able to vectorize it, even though it uses a LUT to
           | large for SIMD registers.
           | 
           | #pragma omp simd and equivalents will encourage the compiler
           | to vectorize a specific loop and produce a warning if a loop
           | isn't vectorized.
        
             | bee_rider wrote:
             | I shouldn't have started my comment with the sort of
             | implied question or note of confusion. Sorry, that was
             | unclear communication.
             | 
             | I agree that it is possible to write some C programs that
             | some compilers will be able to discover the parallel
             | potential of. But it isn't very ergonomic or dependable.
             | So, I think this is not a strong counter-argument to the
             | theory of the blog post. It is possible to write SIMD
             | friendly C, but often it is easier for the programmer to
             | fall back to intrinsics to express their intent.
        
           | sifar wrote:
           | It means auto-vectorization. Write scalar code that can be
           | automatically vectorized by the compiler by using SIMD
           | instructions
        
       | Vosporos wrote:
       | Woo, very happy to see more posts by McHale these days!
        
       | Const-me wrote:
       | I would rather conclude that automatic vectorizers are still less
       | than ideal, despite SIMD instructions have been widely available
       | in commodity processors for 25 years now.
       | 
       | The language is actually great with SIMD, you just have to do it
       | yourself with intrinsics, or use libraries. BTW, here's a library
       | which implements 4-wide vectorized exponent functions for FP32
       | precision on top of SSE, AVX and NEON SIMD intrinsics (MIT
       | license):
       | https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...
        
         | nine_k wrote:
         | Why does FORTRAN, which is older than C, has no trouble with
         | auto-vectorization?
         | 
         | Mostly because of a very different memory model, of course.
         | Arrays are first-class things, not hacks on top of pointers,
         | and can't alias each other.
        
           | MisterTea wrote:
           | > Arrays are first-class things, not hacks on top of
           | pointers, and can't alias each other.
           | 
           | "Hacks" feels like the wrong term to use here. Comparing
           | managed objects surrounded by runtime to memory isn't a fair
           | comparison.
        
             | nine_k wrote:
             | There's no language-level difference between int[] and int*
             | in C. Indexing is just pointer arithmetics. No support for
             | (non-jagged) multidimensional arrays either.
             | 
             | A proper array would at least know its length. It does not
             | require a lot of runtime to do, and most C code require
             | libc anyway for basic stuff like malloc().
        
               | jjmarr wrote:
               | This isn't true. int[] decays into an int* but is a
               | different type.
               | 
               | An array member of a struct will have its data allocated
               | contiguously with other struct members (subject to
               | compiler padding). An int* would point outside the
               | struct. This is possible even with variable length
               | arrays. Specifically, you can declare the final member of
               | a struct as an int[] which makes it a "flexible array
               | member". Then you have to malloc the struct to be the
               | sizeof the struct + whatever size you want the array to
               | be.
               | 
               | https://en.wikipedia.org/wiki/Flexible_array_member
               | 
               | This is used in struct sockaddr for generic APIs where
               | the actual struct can be of different sizes.
               | 
               | https://lwn.net/Articles/997094/
               | 
               | This is rather pedantic but the memory model for arrays
               | _is_ important here. If you 're iterating over multiple
               | struct members that are arrays, the fact the arrays are
               | stored contiguously in memory is going to matter to SIMD.
        
               | nine_k wrote:
               | This is fair, thanks!
        
               | Gibbon1 wrote:
               | The C language spec isn't what important here it's the
               | ABI doesn't know the difference.
        
               | jjmarr wrote:
               | if using an int* over an int[] changes the memory layout
               | of a struct, that necessarily implies a difference in the
               | ABI.
               | 
               | As an example, C++'s std::array can be implemented as a
               | struct containing a fixed-size C-style array. This can be
               | passed-by-value. This means that returning a std::array
               | can be more performant than returning a std::vector
               | (which might be implemented as an int* that is
               | reallocated when you add too many elements), because a
               | std::array is returned on the stack while a std::vector
               | is returned on the heap.
               | 
               | I was bit by this once when returning a
               | std::unordered_map because I was doing a heap-allocation
               | in a performance-critical section.
        
               | MisterTea wrote:
               | > There's no language-level difference between int[] and
               | int* in C
               | 
               | sizeof(int[n]) vs sizeof(int*). Then there's: 'int
               | getint(int *I){return *I;}' which wont let me pass
               | '&int[n]' but I can 'int* = int[n]' then pass &int*. It's
               | not so much a hack but a very primitive implementation on
               | top of pointers (alright maybe it is a bit hacky.)
               | 
               | > A proper array would at least know its length. It does
               | not require a lot of runtime to do, and most C code
               | require libc anyway for basic stuff like malloc().
               | 
               | Something like a fat pointer 'typedef struct {long len,
               | void *data} MyArray;' then make a libMyArray to operate
               | on MyArray objects. But now you have a libMyArray
               | dependency. You also loose type information unless you
               | add fields to MyArray with an enum or have a MyArray for
               | each type which gets messy. Then you cant do shit like
               | just use an identifier in a conditional e.g.
               | 'while(MyArray[n++])' unless you change how C works which
               | makes it notC (uh, maybe pick another name.)
               | 
               | I feel that C is fine as it's an old machine oriented
               | language. Leave it alone. I would look to reach for
               | another tool when the primitive design of C makes life
               | difficult.
        
           | Const-me wrote:
           | I believe it's ecosystem. For example, Python is very high
           | level language, not sure it even has a memory model. But it
           | has libraries like NumPy which support all these vectorized
           | exponents when processing long arrays of numbers.
        
             | goku12 wrote:
             | Everything that's heavily vectorized in the Python
             | ecosystem, including numpy achieves it using optimized
             | backend code written in other languages - fortran in
             | particular. Python is only a thin veneer over those
             | backends. In fact, you're constantly reminded to offload
             | the control flow as much as possible to those backends for
             | the sake of performance, instead of doing things like
             | looping in Python. If that's enough to consider Python to
             | be good in vectorization, I can just link high performance
             | fortran libraries with C, handle non-vector control flow
             | from there and call it a day. I guarantee you that this
             | arrangement will be far more performant than what Python
             | can ever achieve. I have to strongly agree with the other
             | commenter's observation that the memory model is the key to
             | vector performance.
             | 
             | And of course Python has a memory model. While that model
             | is not as well understood as C's model, it is the key to
             | Python's success and popularity as a generic programming
             | language and as a numeric/scientific programming language.
             | Python's memory model unlike C's or Fortran's, isn't
             | designed for high performance. It's designed for rich
             | abstractions, high ergonomics and high interoperability
             | with those performant languages. For most people, the
             | processing time lost executing python code is an acceptable
             | tradeoff for the highly expressive control that Python
             | gives them over the scheduling of lower level operations.
        
               | ilayn wrote:
               | NumPy has no Fortran code, for quite a long time now.
               | SciPy has and it is being rewritten. What you mention is
               | the ufunc machinery underneath which is all C. NumPy also
               | has SIMD support (albeit limited to certain functions).
               | BLAS is also C/Assembly but only LAPACK is F77 (which is
               | too much code to be rewritten).
               | 
               | This does not mean Fortran is bad (obligatory disclaimer
               | for Fortran fans).
        
           | nitwit005 wrote:
           | When I tested this before (some years ago), adding "restrict"
           | to the C pointers resulted in the same behavior, so the
           | aliasing default seem to be the key issue.
        
         | Keyframe wrote:
         | isn't DirectXMath C++, not C?
        
           | npalli wrote:
           | Yes, it is in C++.
        
           | Const-me wrote:
           | Yeah, but if you need C versions of low-level SIMD algorithms
           | from there, copy-paste the implementation and it will
           | probably compile as C code. That's why I mentioned the MIT
           | license: copy-pasting from GPL libraries may or may not work
           | depending on the project, but the MIT license is copy-paste
           | friendly.
        
       | svilen_dobrev wrote:
       | reminds me of this old article.. ~2009, Sun/MIT - "The Future is
       | Parallel" .. and "sequential decomposition and usual math are
       | wrong horse.."
       | 
       | https://ocw.mit.edu/courses/6-945-adventures-in-advanced-sym...
        
       | npalli wrote:
       | C++ does OK.
       | 
       | Google's highway [1]
       | 
       | Microsoft DirectXMath [2]
       | 
       | [1] https://github.com/google/highway
       | 
       | [2] https://github.com/microsoft/DirectXMath
        
       | commandlinefan wrote:
       | This seems more of a generalization of Richard Steven's
       | observation:
       | 
       | "Modularity is the enemy of performance"
       | 
       | If you want optimal performance, you have to collapse the layers.
       | Look at Deepseek, for example.
        
       | vkaku wrote:
       | I also think that vectorizers and compilers can detect parallel
       | memory adds/moves/subs and without that, many do not take time to
       | provide adequate hints to the compiler about it.
       | 
       | Some people have vectorized successfully with C, even with all
       | the hacks/pointers/union/opaque business. It requires careful
       | programming, for sure. The ffmpeg cases are super good examples
       | of how compiler misses happen, and how to optimize for full
       | throughput in those cases. Worth a look for all compiler
       | engineers.
        
       ___________________________________________________________________
       (page generated 2025-01-27 23:01 UTC)