[HN Gopher] Making my debug build run 100x faster so that it is ...
       ___________________________________________________________________
        
       Making my debug build run 100x faster so that it is finally usable
        
       Author : broken_broken_
       Score  : 40 points
       Date   : 2025-02-18 08:48 UTC (14 hours ago)
        
 (HTM) web link (gaultier.github.io)
 (TXT) w3m dump (gaultier.github.io)
        
       | bean-weevil wrote:
       | Why not just compile that particular object with optimizations on
       | and the rest of the file with optimizations off?
        
         | broken_broken_ wrote:
         | Yes, that's the obvious (and boring!) answer, that I mention in
         | the introduction and that's in a way the implicit conclusion.
         | But that does not teach us SIMD then :)
        
           | saurik wrote:
           | Your article isn't really about, though, how to speed up a
           | debug build, and I thereby think you're likely not going to
           | find the right audience. Like, to be honest, I gave up on
           | your article, because while I found the premise of speeding
           | up a debug build really interesting, I (currently) have no
           | interest in hand-optimizing SIMD... but, in another time, or
           | if I were someone else, I might find that really interesting,
           | but then would not have thought to look at this article.
           | "Hand-optimizing SHA-1 using SIMD intrinsics and assembly" is
           | just a very different mental space than "making my debug
           | build run 100x faster", even if they are two ways to describe
           | the same activity. "Using SIMD and assembly to avoid relying
           | on compiler optimizations for performance" also feels better?
           | I would at least get it if your title was a pun or a joke or
           | was in some way fun--at which point I would blame Hacker News
           | for pulling articles out of their context and not having a
           | good policy surrounding publicly facing titles or subtitles--
           | but it feels like, in this case, the title is merely a poor
           | way to describe the content.
        
         | mperham wrote:
         | Yep, I was hoping to learn how to do this. Seems like a much
         | better long term lesson.
        
           | senkora wrote:
           | For gcc: #pragma GCC optimize ("O0")
           | 
           | For clang: #pragma clang optimize off
           | 
           | For MSVC: #pragma optimize("", off)
           | 
           | Put one of these at the top of your source file.
        
       | molenzwiebel wrote:
       | For this use-case, you can squeeze out even more performance by
       | using the SHA-1 implementation in Intel ISA-L Crypto [1]. The
       | SHA-1 implementation there allows for multi-buffer hashes, giving
       | you the ability to calculate the hashes for multiple chunks in
       | parallel on a single core. Given that that is basically your
       | usecase, it might be worth considering. I doubt it'll provide
       | much speedup if you're already I/O bound here though.
       | 
       | [1]: https://github.com/intel/isa-l_crypto
        
         | secondcoming wrote:
         | I came across this repo recently and it looks great. It's a
         | pity there doesn't seem to be an official Ubuntu package for it
         | though. There is one for the Intelligent Storage Acceleration
         | Library though.
        
         | broken_broken_ wrote:
         | Thank you, I will definitely have a look, and update the
         | article if there's any interesting finding
        
       | ack_complete wrote:
       | SHA1 is difficult to vectorize due to a tight loop-carried
       | dependency in the main operation. In an optimized build, I've
       | only seen about a 15% speedup over the scalar version with x64
       | SSSE3 without hardware SHA1 support. Debug builds of course can
       | benefit more from the reduction in operations since the
       | inefficient code generation is a bigger issue there than the
       | dependency chains. I think the performance delta is bigger for
       | ARM64 CPUs, but it's pretty rare to not have the Crypto extension
       | (except notably some Raspberry Pi models).
       | 
       | The comments in the SSE2 version are a bit odd as it references
       | MMX, and the Pentium M and Efficeon CPUs. Those CPUs are
       | _ancient_ -- 2003 /2004 era. The vectorized code you have also
       | uses SSE2 and not MMX, which is important since SSE2 is double
       | the width and has different performance characteristics from MMX.
       | IIRC, Intel CPUs didn't start supporting SHA until ~2019 with Ice
       | Lake, so the target for non-hardware-accelerated vectorized SHA1
       | for Intel CPUs would be mostly Skylake-based.
        
       | tvbusy wrote:
       | I understand the post is about learning to speed up SHA1
       | calculation, that I have no comment. However, the state file is a
       | solved problem for me. It's a rare case where state files are
       | corrupted and it's simple to just re-check the file. I cannot
       | imagine a torrent client checking the hash of TBs of files for
       | every single start. It's not a coincidence that many torrent
       | clients have a feature to skip hash checking and just immediately
       | assume the file is correct and start seeding immediately.
        
       ___________________________________________________________________
       (page generated 2025-02-18 23:00 UTC)