[HN Gopher] Bit-twiddling optimizations in Zed's Rope
       ___________________________________________________________________
        
       Bit-twiddling optimizations in Zed's Rope
        
       Author : misternugget
       Score  : 45 points
       Date   : 2024-11-18 16:39 UTC (2 days ago)
        
 (HTM) web link (zed.dev)
 (TXT) w3m dump (zed.dev)
        
       | rgrmrts wrote:
       | I really like all the blog posts and videos the Zed team has put
       | out, thank you if you're reading this!
       | 
       | Unrelated to this specific post I'm such a fan of Zed. It's the
       | first feature complete text editor in recent memory that I've
       | truly enjoyed using (i.e. it stays out of the way, is really
       | fast, feels well engineered). I'm coming to Zed after years of
       | Emacs which I still have love for but no longer feels like a
       | competitive piece of software (it does not take full advantage of
       | how good computers are today, e.g. gpu rendering or multicore). I
       | really hope Zed stays a fast and lightweight text editor instead
       | of becoming some bloated growth-at-all-cost VC ware (not that
       | they've exhibited any signs of that happening). I'd also happily
       | pay for Zed without a subscription based thing for access to LLM
       | features (which I do not use).
        
         | seanw444 wrote:
         | > it does not take full advantage of how good computers are
         | today, e.g. gpu rendering or multicore
         | 
         | Why does Emacs need that though? I hear people say this all the
         | time and I don't get it. Multicore kind of works against the
         | structure that Emacs touts as a feature. And GPU rendering? In
         | many applications, I totally agree with these complaints. But
         | it's a text editor.
         | 
         | I tried Zed myself, and it's good. But it doesn't dethrone
         | Emacs (for me personally).
        
           | PittleyDunkin wrote:
           | > Multicore kind of works against the structure that Emacs
           | touts as a feature.
           | 
           | I have consistent issues with emacs locking up when executing
           | network requests. I'm sure there's a specific bug that could
           | be hunted down and addressed, but this sort of thing
           | shouldn't happen much in an editor that's multicore by
           | default.
           | 
           | I'm not trying to dismiss emacs' reasoning, of course, but I
           | can understand being disgruntled with it.
           | 
           | The actual rendering I've been quite please by, though!
        
             | rgrmrts wrote:
             | Yeah this is one reason, or Emacs freezing for up to a
             | minute when updating packages. Also when using an LSP I
             | notice latency.
             | 
             | I use Emacs GUI (outside of the terminal) and comparing
             | performance for rending to something like Zed or Sublime is
             | definitely noticeable. It's great that Emacs is so resource
             | efficient but sometimes I wish it used more of my beefy
             | computer(s).
             | 
             | Like I said I still love Emacs and it's okay for it to make
             | a different set of trade-offs. I honestly didn't think I'd
             | ever switch editors but here we are!
        
               | seanw444 wrote:
               | That's fair I guess. In the case of IO that can be an
               | issue. When I hear multicore, I assume we're talking
               | about parallelism, not concurrency.
               | 
               | As for LSP, other than the Nim langserver, I've been
               | quite satisfied with Eglot's performance. I'm curious
               | what your setup is like. To be fair, I'm running a highly
               | customized Doom Emacs config, so it's possible I
               | inherited some non-vanilla optimizations I'm not aware
               | of.
        
           | DrBenCarson wrote:
           | "Need" is strong but using GPU rendering is definitely better
           | than CPU rendering and makes things feel very snappy
           | 
           | Most new TTY projects use GPUs for that reason
        
       | dmitrygr wrote:
       | >  // Parallel bit count intermediates       >  let a = v - ((v
       | >> 1) & (u64::MAX / 3));       >  let b = (a & (u64::MAX / 5)) +
       | ((a >> 2) & (u64::MAX / 5));       >  let c = (b + (b >> 4)) &
       | (u64::MAX / 0x11);       >  let d = (c + (c >> 8)) & (u64::MAX /
       | 0x101);
       | 
       | That "parallel bit count" is almost certainly slower than using
       | two POPCNT instructions on a modern cpu. Should just call
       | __builtin_popcount() and let the compiler do it the most optimal
       | way. Luckily, people do this sort of thing so often that many
       | modern compilers will try (and often succeed) to detect you
       | trying this insanity and convert it to a POPCOUNT (or a pair of
       | POPCOUNTs as the case may be here)
        
         | akoboldfrying wrote:
         | Which compilers support __builtin_popcount()? From memory, it's
         | a gcc extension. If the compiler selects a CPU POPCOUNT
         | instruction for it, are you sure it will work on all machines
         | that you want to run it on?
         | 
         | The above code is completely source- and binary-portable and
         | reasonably fast -- certainly faster than naively looping
         | through the bits, and within a small constant factor of a CPU
         | POPCOUNT instruction.
        
           | woadwarrior01 wrote:
           | > Which compilers support __builtin_popcount()?
           | 
           | Clang supports __builtin_popcount() too. And MSVC has
           | __popcnt().
        
           | dmitrygr wrote:
           | Your compiler will know _the best_ way to popcount, that is
           | the point of that builtin. It 'll use the best method -
           | sometimes this one. GCC does this, MSVC does this, clang does
           | this, i think even rust has some way to do it (EDIT: it does:
           | count_ones()). On archs which lack POPCNT, it will use this
           | method or another, based on knowing the target. On x86 this
           | approach is OK as is. On arm64, for example, it will be
           | suboptimal due to all the literals needed. On armv6m, this
           | method is bad and table lookups are faster.
        
             | SkiFire13 wrote:
             | Note that by default rustc targets x86-64-v1 when compiling
             | for x86-64, and that lacks the popcount instruction. You
             | need to change the target_cpu to at least x86-64-v2 or
             | enable the popcount target_feature. This means that even if
             | your cpu is relatively new and you intend to run your code
             | on relatively new cpus, rustc will still generate older and
             | slower code for count_ones() using bitshifts and masks.
             | That said, I don't see the point in writing them manually
             | if the compiler can generate them for you.
        
           | jandrewrogers wrote:
           | Most vaguely recent compilers will convert naively looping
           | through bits into a native POPCOUNT instruction. The parallel
           | bit count algorithm was not reliably detected until more
           | recently and therefore would sometimes produce unoptimized
           | code, though current versions of gcc/clang/msvc can all
           | detect it now.
           | 
           | Also, pretty much every compiler for a very long time has
           | supported __builtin_popcount or equivalent.
        
           | aseipp wrote:
           | Everything supports __builtin_popcount or some variant these
           | days (__popcnt for MSVC). It's a complete non-issue, really.
           | 
           | And the compiler is not required to lower it to a single
           | instruction. It will if the target architecture is specified
           | appropriately, but there's nothing that says it has to
           | explode if it can't. In fact, by doing it this way, the
           | compiler is actually more free to generate code in a way
           | that's optimal for the architecture in all cases, because all
           | the implementation details are hidden e.g. loads of large
           | constants may be avoided if the compiler is allowed to choose
           | the exact implementation, while using the portable version
           | may tie its hands more depending on how it's feeling on that
           | day. Here's __builtin_popcount working just fine while
           | targeting a ~20yo architecture without native support for
           | SSE4.2; it can generate this code knowing what the proper
           | instructions and schedules are:
           | https://godbolt.org/z/ra7n5T5f3
           | 
           | The moral here is that the primitives are there for you to
           | use. Just use them and save yourself and your would-be code
           | reviewer's time.
        
       | camel-cdr wrote:
       | nth_set_bit_u64: wouldn't that be __builtin_ctzll(_pdep_u64(1<<n,
       | v)) with BMI2?
        
         | SkiFire13 wrote:
         | That's assuming you're ok with your program not running on some
         | older cpus.
        
       ___________________________________________________________________
       (page generated 2024-11-20 23:00 UTC)