[HN Gopher] Counting bytes faster than you'd think possible
___________________________________________________________________
Counting bytes faster than you'd think possible
Author : asicsp
Score : 69 points
Date : 2024-07-23 09:34 UTC (3 days ago)
(HTM) web link (blog.mattstuchlik.com)
(TXT) w3m dump (blog.mattstuchlik.com)
| rini17 wrote:
| Can this optimization be applied to matmult for us, critters who
| are running llama on cpu? XD
| twoodfin wrote:
| I don't think this really helps: It's a trick to drive maximum
| possible memory bandwidth in service of a single executing
| thread which can process data as fast as it's being delivered.
|
| A parallel matrix multiply running across every core shouldn't
| have any trouble maximizing memory bandwidth utilization.
| owlbite wrote:
| KV-cache based LLM inference is normally significantly memory
| bound on the matrix-vector multiply. This is (part of) why
| the quantization-based approaches are so popular.
| twoodfin wrote:
| It's memory bound _and_ single threaded?
| rini17 wrote:
| I found it memory bound so that it was fastest with 6
| threads on my 8 thread xeon.
| twoodfin wrote:
| Right: So this trick probably doesn't help much. You only
| need prefetch when there isn't enough fetch otherwise.
| anonymoushn wrote:
| My own solution which is ~1ms faster uses some other pattern that
| was found experimentally, but I cannot seem to get it to go any
| faster by tuning the parameters, and the #1 spot remains slightly
| out of reach.
|
| Alexander Monakov has called the attention of the highload
| Telegram chat to this paper[0], saying: Haswell
| is tricky for memory bw tuning, as even at fixed core frequency,
| uncore frequency is not fixed, and depends on factors such as
| hardware-measured stall cycles: > According to the
| respective patent [15], the uncore frequency depends on the stall
| cycles of the cores, the EPB of the cores, and c-states
| > ... uncore frequencies-in addition to EPB and stall cycles-
| depend on the core frequency of the fastest active core on the
| system. Moreover, the uncore frequency is also a target of power
| limitations.
|
| So one wonders if it's not really a matter of reading the RAM in
| the right pattern to appease the prefetchers but of using values
| in the right pattern to create the right pattern of stalls to get
| the highest frequency.
|
| [0]: https://tu-
| dresden.de/zih/forschung/ressourcen/dateien/proje...
| dinobones wrote:
| Is there a path forward for compilers to eek out these
| optimization gains eventually? Is there even a path?
|
| 550x gains with some C ++ / mixed gnarly low level assembly vs
| standard C++ is pretty shocking to me.
| vient wrote:
| Note that "standard C++" solution uses std::cin while optimized
| one uses mmap - completely different things, a lot of speed
| comes just from that. Would've been nice to compare with
| solution having optimized input and otherwise standard summing
| loop.
| anonymoushn wrote:
| For a solution that contains this stuff const
| u8 *file_lo; file_lo = (const u8*)mmap(0,250000000ull,P
| ROT_READ,MAP_PRIVATE|MAP_POPULATE,0,0); const u8
| *file_hi = file_lo + 250000000ull; u64 count = 0;
| while (file_lo < file_hi) { if (*file_lo == 127) {
| ++count; } file_lo++; }
|
| I got a bit under 54ms. The solution in the article runs in a
| bit under 16ms.
| vient wrote:
| Nice. Did some quick tests with your code on site, got
| score of ~34000 - best solution is around 14700, so this
| one is only 2.3 times slower.
|
| Used clang with -Ofast -march=native -static. Funnily, gcc
| gets only 54000 with the same options, 1.6 times slower.
| vient wrote:
| Wow, changing `count` type from uint64_t to uint32_t or
| int radically changes results - now gcc gets 26500 and
| clang gets 25000, that's just 1.7 times slower than
| current best solution.
|
| So you can get 25k with following code, clang -Ofast
| -std=c++17 -march=native -static
| #include <iostream> #include <cstdint>
| #include <sys/mman.h> #include <unistd.h>
| int main() { auto file_lo = (const uint8_t*)mma
| p(0,250000000ull,PROT_READ,MAP_PRIVATE|MAP_POPULATE,STDIN
| _FILENO,0); int count = 0; for
| (uint32_t i = 0; i < 250000000; ++i) { if
| (file_lo[i] == 127) { ++count;
| } } std::cout << count << std::endl;
| _exit(0); return 0; }
| sYnfo wrote:
| Neat! I'll add the best solution without explicit
| SIMD/asm in this thread to the post after I wake up, it's
| a great datapoint.
| monktastic1 wrote:
| FYI, "eke." "Eek" is an expression of alarm or surprise.
| lumb63 wrote:
| Does anyone have any tips for similar wizardry-level SIMD
| optimization on ARM?
| anonymoushn wrote:
| If you learn AVX2 programming via highload, my impression is
| that NEON is quite similar. The main difference is the lack of
| movemask. You can read these[0] articles[1] about what to do
| instead.
|
| For SVE, prior to very recent versions of SVE, there was no
| tblq (pshufb equivalent) so I didn't have much hope for using
| it for general-purpose programming, though of course it would
| be fine for stuff like TFA.
|
| [0]: https://community.arm.com/arm-community-
| blogs/b/infrastructu...
|
| [1]: https://www.corsix.org/content/whirlwind-tour-
| aarch64-vector...
| TinkersW wrote:
| It isn't as wide on ARM so the gains will be smaller(most ARM
| is only 128 bit wide neon)
| sYnfo wrote:
| FYI vien [0] figured out that simply compiling with "-static
| -fno-pie" and _exit(0)-ing at the end puts the solution presented
| here to 15000 points and hence #4 on the leaderboard. Pretty
| funny.
|
| [0] https://news.ycombinator.com/user?id=vient
| TacticalCoder wrote:
| Le met hazard a guess: that blog post was _not_ written by a
| LLM!?
___________________________________________________________________
(page generated 2024-07-26 23:05 UTC)