[HN Gopher] Bit-twiddling optimizations in Zed's Rope
___________________________________________________________________
Bit-twiddling optimizations in Zed's Rope
Author : misternugget
Score : 45 points
Date : 2024-11-18 16:39 UTC (2 days ago)
(HTM) web link (zed.dev)
(TXT) w3m dump (zed.dev)
| rgrmrts wrote:
| I really like all the blog posts and videos the Zed team has put
| out, thank you if you're reading this!
|
| Unrelated to this specific post I'm such a fan of Zed. It's the
| first feature complete text editor in recent memory that I've
| truly enjoyed using (i.e. it stays out of the way, is really
| fast, feels well engineered). I'm coming to Zed after years of
| Emacs which I still have love for but no longer feels like a
| competitive piece of software (it does not take full advantage of
| how good computers are today, e.g. gpu rendering or multicore). I
| really hope Zed stays a fast and lightweight text editor instead
| of becoming some bloated growth-at-all-cost VC ware (not that
| they've exhibited any signs of that happening). I'd also happily
| pay for Zed without a subscription based thing for access to LLM
| features (which I do not use).
| seanw444 wrote:
| > it does not take full advantage of how good computers are
| today, e.g. gpu rendering or multicore
|
| Why does Emacs need that though? I hear people say this all the
| time and I don't get it. Multicore kind of works against the
| structure that Emacs touts as a feature. And GPU rendering? In
| many applications, I totally agree with these complaints. But
| it's a text editor.
|
| I tried Zed myself, and it's good. But it doesn't dethrone
| Emacs (for me personally).
| PittleyDunkin wrote:
| > Multicore kind of works against the structure that Emacs
| touts as a feature.
|
| I have consistent issues with emacs locking up when executing
| network requests. I'm sure there's a specific bug that could
| be hunted down and addressed, but this sort of thing
| shouldn't happen much in an editor that's multicore by
| default.
|
| I'm not trying to dismiss emacs' reasoning, of course, but I
| can understand being disgruntled with it.
|
| The actual rendering I've been quite please by, though!
| rgrmrts wrote:
| Yeah this is one reason, or Emacs freezing for up to a
| minute when updating packages. Also when using an LSP I
| notice latency.
|
| I use Emacs GUI (outside of the terminal) and comparing
| performance for rending to something like Zed or Sublime is
| definitely noticeable. It's great that Emacs is so resource
| efficient but sometimes I wish it used more of my beefy
| computer(s).
|
| Like I said I still love Emacs and it's okay for it to make
| a different set of trade-offs. I honestly didn't think I'd
| ever switch editors but here we are!
| seanw444 wrote:
| That's fair I guess. In the case of IO that can be an
| issue. When I hear multicore, I assume we're talking
| about parallelism, not concurrency.
|
| As for LSP, other than the Nim langserver, I've been
| quite satisfied with Eglot's performance. I'm curious
| what your setup is like. To be fair, I'm running a highly
| customized Doom Emacs config, so it's possible I
| inherited some non-vanilla optimizations I'm not aware
| of.
| DrBenCarson wrote:
| "Need" is strong but using GPU rendering is definitely better
| than CPU rendering and makes things feel very snappy
|
| Most new TTY projects use GPUs for that reason
| dmitrygr wrote:
| > // Parallel bit count intermediates > let a = v - ((v
| >> 1) & (u64::MAX / 3)); > let b = (a & (u64::MAX / 5)) +
| ((a >> 2) & (u64::MAX / 5)); > let c = (b + (b >> 4)) &
| (u64::MAX / 0x11); > let d = (c + (c >> 8)) & (u64::MAX /
| 0x101);
|
| That "parallel bit count" is almost certainly slower than using
| two POPCNT instructions on a modern cpu. Should just call
| __builtin_popcount() and let the compiler do it the most optimal
| way. Luckily, people do this sort of thing so often that many
| modern compilers will try (and often succeed) to detect you
| trying this insanity and convert it to a POPCOUNT (or a pair of
| POPCOUNTs as the case may be here)
| akoboldfrying wrote:
| Which compilers support __builtin_popcount()? From memory, it's
| a gcc extension. If the compiler selects a CPU POPCOUNT
| instruction for it, are you sure it will work on all machines
| that you want to run it on?
|
| The above code is completely source- and binary-portable and
| reasonably fast -- certainly faster than naively looping
| through the bits, and within a small constant factor of a CPU
| POPCOUNT instruction.
| woadwarrior01 wrote:
| > Which compilers support __builtin_popcount()?
|
| Clang supports __builtin_popcount() too. And MSVC has
| __popcnt().
| dmitrygr wrote:
| Your compiler will know _the best_ way to popcount, that is
| the point of that builtin. It 'll use the best method -
| sometimes this one. GCC does this, MSVC does this, clang does
| this, i think even rust has some way to do it (EDIT: it does:
| count_ones()). On archs which lack POPCNT, it will use this
| method or another, based on knowing the target. On x86 this
| approach is OK as is. On arm64, for example, it will be
| suboptimal due to all the literals needed. On armv6m, this
| method is bad and table lookups are faster.
| SkiFire13 wrote:
| Note that by default rustc targets x86-64-v1 when compiling
| for x86-64, and that lacks the popcount instruction. You
| need to change the target_cpu to at least x86-64-v2 or
| enable the popcount target_feature. This means that even if
| your cpu is relatively new and you intend to run your code
| on relatively new cpus, rustc will still generate older and
| slower code for count_ones() using bitshifts and masks.
| That said, I don't see the point in writing them manually
| if the compiler can generate them for you.
| jandrewrogers wrote:
| Most vaguely recent compilers will convert naively looping
| through bits into a native POPCOUNT instruction. The parallel
| bit count algorithm was not reliably detected until more
| recently and therefore would sometimes produce unoptimized
| code, though current versions of gcc/clang/msvc can all
| detect it now.
|
| Also, pretty much every compiler for a very long time has
| supported __builtin_popcount or equivalent.
| aseipp wrote:
| Everything supports __builtin_popcount or some variant these
| days (__popcnt for MSVC). It's a complete non-issue, really.
|
| And the compiler is not required to lower it to a single
| instruction. It will if the target architecture is specified
| appropriately, but there's nothing that says it has to
| explode if it can't. In fact, by doing it this way, the
| compiler is actually more free to generate code in a way
| that's optimal for the architecture in all cases, because all
| the implementation details are hidden e.g. loads of large
| constants may be avoided if the compiler is allowed to choose
| the exact implementation, while using the portable version
| may tie its hands more depending on how it's feeling on that
| day. Here's __builtin_popcount working just fine while
| targeting a ~20yo architecture without native support for
| SSE4.2; it can generate this code knowing what the proper
| instructions and schedules are:
| https://godbolt.org/z/ra7n5T5f3
|
| The moral here is that the primitives are there for you to
| use. Just use them and save yourself and your would-be code
| reviewer's time.
| camel-cdr wrote:
| nth_set_bit_u64: wouldn't that be __builtin_ctzll(_pdep_u64(1<<n,
| v)) with BMI2?
| SkiFire13 wrote:
| That's assuming you're ok with your program not running on some
| older cpus.
___________________________________________________________________
(page generated 2024-11-20 23:00 UTC)