[HN Gopher] Do low-level optimizations matter? Faster quicksort ...
       ___________________________________________________________________
        
       Do low-level optimizations matter? Faster quicksort with cmov
       (2020)
        
       Author : fanf2
       Score  : 23 points
       Date   : 2024-08-21 20:42 UTC (2 hours ago)
        
 (HTM) web link (cantrip.org)
 (TXT) w3m dump (cantrip.org)
        
       | mbroncano wrote:
       | (2020)
        
       | karmakaze wrote:
       | I would expect eliminating branches in a busy inner loop to
       | matter.
       | 
       | The interesting part is how that was done:
       | 
       | > A New Primitive, swap_if
       | 
       | > How can we use this method in our sort? First, let us make a
       | swap_if:                 inline bool swap_if(bool c, int& a, int&
       | b) {         int ta = a, mask = -c;  // false -> 0, true ->
       | 111..111         a = (b & mask) | (ta & ~mask);         b = (ta &
       | mask) | (b & ~mask);         return c;       }
       | 
       | > In our partition function, then, we can transform
       | if (*right <= pivot) {           int tmp = *left; *left = *right,
       | *right = tmp;           ++left;         }
       | 
       | > into just                   left += swap_if(*right <= pivot,
       | *left, *right);
        
       | memset wrote:
       | Fascinating! How does one learn to write C code that will make
       | use of specific asm instructions?
       | 
       | SIMDjson is another example that comes to mind. The conceit of C
       | is that you do t have control over the underlying machine
       | instructions without inlining it yourself. So how do people write
       | C code with cpu optimizations like this?
        
         | anonymoushn wrote:
         | Use inline assembly or intrinsics. Read the code of stuff that
         | does these things.
         | 
         | Resources sufficient for implementing simdjson or similar
         | vectorized parsers:
         | https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
         | https://lemire.me/blog/ http://0x80.pl/articles/
        
         | addaon wrote:
         | There are three approaches:
         | 
         | 1) Use intrinsics, if your platform provides intrinsics for the
         | instructions you want to use.
         | 
         | 2) Use inline assembly (which I suppose doesn't technically
         | count as writing C code, but is very much part of the story).
         | 
         | 3) Write C code carefully against a chosen compiler with chosen
         | optimization flags, and inspect the assembly. Especially for
         | small chunks of code, a tool like godbolt.org is indispensable.
         | Basically, you're writing the assembly you want the compiler to
         | generate (either explicitly, or just in your head), then
         | writing C code that seems likely to generate that sequence of
         | instructions, then tweaking the code until the output you want
         | is generated. If this is a super important optimization, it's
         | also reasonable to add a build step that inspects the generated
         | assembly and fails the build if it doesn't match the desired
         | pattern; but in that case, writing inline assembly is usually
         | easier.
        
           | vlovich123 wrote:
           | Option 4 is sometimes compiler annotations like
           | __builtin_expect_with_probability which you could use to
           | assign a branch a value of 50% which should coerce it to pick
           | cmov (same risk as 3 though in that you need to inspect
           | compiler output).
        
         | adelpozo wrote:
         | I would say it is a moving target if the goal is to write C
         | code that compiles to a seemingly magical cpu instruction. As
         | other have pointed out, learning assembly and compiler explorer
         | are useful things.
         | 
         | As a way to show the wonderful complexities of compilers check
         | the TOC of https://shop.elsevier.com/books/optimizing-
         | compilers-for-mod...
        
       | kimixa wrote:
       | > Back in 2000, AMD included cmov in its 64-bit x86 ISA
       | extensions. Then, Intel had to adopt them when Itanium flopped.
       | 
       | Wasn't "cmov" one of the things added for the pentium pro? So it
       | wasn't instruction compatible - hence the "i686" prefix to a lot
       | of compiler triples?
        
       ___________________________________________________________________
       (page generated 2024-08-21 23:00 UTC)