[HN Gopher] Counting bytes faster than you'd think possible
       ___________________________________________________________________
        
       Counting bytes faster than you'd think possible
        
       Author : asicsp
       Score  : 69 points
       Date   : 2024-07-23 09:34 UTC (3 days ago)
        
 (HTM) web link (blog.mattstuchlik.com)
 (TXT) w3m dump (blog.mattstuchlik.com)
        
       | rini17 wrote:
       | Can this optimization be applied to matmult for us, critters who
       | are running llama on cpu? XD
        
         | twoodfin wrote:
         | I don't think this really helps: It's a trick to drive maximum
         | possible memory bandwidth in service of a single executing
         | thread which can process data as fast as it's being delivered.
         | 
         | A parallel matrix multiply running across every core shouldn't
         | have any trouble maximizing memory bandwidth utilization.
        
           | owlbite wrote:
           | KV-cache based LLM inference is normally significantly memory
           | bound on the matrix-vector multiply. This is (part of) why
           | the quantization-based approaches are so popular.
        
             | twoodfin wrote:
             | It's memory bound _and_ single threaded?
        
               | rini17 wrote:
               | I found it memory bound so that it was fastest with 6
               | threads on my 8 thread xeon.
        
               | twoodfin wrote:
               | Right: So this trick probably doesn't help much. You only
               | need prefetch when there isn't enough fetch otherwise.
        
       | anonymoushn wrote:
       | My own solution which is ~1ms faster uses some other pattern that
       | was found experimentally, but I cannot seem to get it to go any
       | faster by tuning the parameters, and the #1 spot remains slightly
       | out of reach.
       | 
       | Alexander Monakov has called the attention of the highload
       | Telegram chat to this paper[0], saying:                 Haswell
       | is tricky for memory bw tuning, as even at fixed core frequency,
       | uncore frequency is not fixed, and depends on factors such as
       | hardware-measured stall cycles:            > According to the
       | respective patent [15], the uncore frequency depends on the stall
       | cycles of the cores, the EPB of the cores, and c-states
       | > ... uncore frequencies-in addition to EPB and stall cycles-
       | depend on the core frequency of the fastest active core on the
       | system. Moreover, the uncore frequency is also a target of power
       | limitations.
       | 
       | So one wonders if it's not really a matter of reading the RAM in
       | the right pattern to appease the prefetchers but of using values
       | in the right pattern to create the right pattern of stalls to get
       | the highest frequency.
       | 
       | [0]: https://tu-
       | dresden.de/zih/forschung/ressourcen/dateien/proje...
        
       | dinobones wrote:
       | Is there a path forward for compilers to eek out these
       | optimization gains eventually? Is there even a path?
       | 
       | 550x gains with some C ++ / mixed gnarly low level assembly vs
       | standard C++ is pretty shocking to me.
        
         | vient wrote:
         | Note that "standard C++" solution uses std::cin while optimized
         | one uses mmap - completely different things, a lot of speed
         | comes just from that. Would've been nice to compare with
         | solution having optimized input and otherwise standard summing
         | loop.
        
           | anonymoushn wrote:
           | For a solution that contains this stuff                 const
           | u8 *file_lo;       file_lo = (const u8*)mmap(0,250000000ull,P
           | ROT_READ,MAP_PRIVATE|MAP_POPULATE,0,0);       const u8
           | *file_hi = file_lo + 250000000ull;       u64 count = 0;
           | while (file_lo < file_hi) {         if (*file_lo == 127) {
           | ++count;         }         file_lo++;       }
           | 
           | I got a bit under 54ms. The solution in the article runs in a
           | bit under 16ms.
        
             | vient wrote:
             | Nice. Did some quick tests with your code on site, got
             | score of ~34000 - best solution is around 14700, so this
             | one is only 2.3 times slower.
             | 
             | Used clang with -Ofast -march=native -static. Funnily, gcc
             | gets only 54000 with the same options, 1.6 times slower.
        
               | vient wrote:
               | Wow, changing `count` type from uint64_t to uint32_t or
               | int radically changes results - now gcc gets 26500 and
               | clang gets 25000, that's just 1.7 times slower than
               | current best solution.
               | 
               | So you can get 25k with following code, clang -Ofast
               | -std=c++17 -march=native -static
               | #include <iostream>         #include <cstdint>
               | #include <sys/mman.h>         #include <unistd.h>
               | int main() {           auto file_lo = (const uint8_t*)mma
               | p(0,250000000ull,PROT_READ,MAP_PRIVATE|MAP_POPULATE,STDIN
               | _FILENO,0);           int count = 0;           for
               | (uint32_t i = 0; i < 250000000; ++i) {             if
               | (file_lo[i] == 127) {                 ++count;
               | }           }           std::cout << count << std::endl;
               | _exit(0);           return 0;         }
        
               | sYnfo wrote:
               | Neat! I'll add the best solution without explicit
               | SIMD/asm in this thread to the post after I wake up, it's
               | a great datapoint.
        
         | monktastic1 wrote:
         | FYI, "eke." "Eek" is an expression of alarm or surprise.
        
       | lumb63 wrote:
       | Does anyone have any tips for similar wizardry-level SIMD
       | optimization on ARM?
        
         | anonymoushn wrote:
         | If you learn AVX2 programming via highload, my impression is
         | that NEON is quite similar. The main difference is the lack of
         | movemask. You can read these[0] articles[1] about what to do
         | instead.
         | 
         | For SVE, prior to very recent versions of SVE, there was no
         | tblq (pshufb equivalent) so I didn't have much hope for using
         | it for general-purpose programming, though of course it would
         | be fine for stuff like TFA.
         | 
         | [0]: https://community.arm.com/arm-community-
         | blogs/b/infrastructu...
         | 
         | [1]: https://www.corsix.org/content/whirlwind-tour-
         | aarch64-vector...
        
         | TinkersW wrote:
         | It isn't as wide on ARM so the gains will be smaller(most ARM
         | is only 128 bit wide neon)
        
       | sYnfo wrote:
       | FYI vien [0] figured out that simply compiling with "-static
       | -fno-pie" and _exit(0)-ing at the end puts the solution presented
       | here to 15000 points and hence #4 on the leaderboard. Pretty
       | funny.
       | 
       | [0] https://news.ycombinator.com/user?id=vient
        
       | TacticalCoder wrote:
       | Le met hazard a guess: that blog post was _not_ written by a
       | LLM!?
        
       ___________________________________________________________________
       (page generated 2024-07-26 23:05 UTC)