[HN Gopher] Bit-twiddling optimizations in Zed's Rope
       ___________________________________________________________________
        
       Bit-twiddling optimizations in Zed's Rope
        
       Author : misternugget
       Score  : 149 points
       Date   : 2024-11-18 16:39 UTC (3 days ago)
        
 (HTM) web link (zed.dev)
 (TXT) w3m dump (zed.dev)
        
       | rgrmrts wrote:
       | I really like all the blog posts and videos the Zed team has put
       | out, thank you if you're reading this!
       | 
       | Unrelated to this specific post I'm such a fan of Zed. It's the
       | first feature complete text editor in recent memory that I've
       | truly enjoyed using (i.e. it stays out of the way, is really
       | fast, feels well engineered). I'm coming to Zed after years of
       | Emacs which I still have love for but no longer feels like a
       | competitive piece of software (it does not take full advantage of
       | how good computers are today, e.g. gpu rendering or multicore). I
       | really hope Zed stays a fast and lightweight text editor instead
       | of becoming some bloated growth-at-all-cost VC ware (not that
       | they've exhibited any signs of that happening). I'd also happily
       | pay for Zed without a subscription based thing for access to LLM
       | features (which I do not use).
        
         | seanw444 wrote:
         | > it does not take full advantage of how good computers are
         | today, e.g. gpu rendering or multicore
         | 
         | Why does Emacs need that though? I hear people say this all the
         | time and I don't get it. Multicore kind of works against the
         | structure that Emacs touts as a feature. And GPU rendering? In
         | many applications, I totally agree with these complaints. But
         | it's a text editor.
         | 
         | I tried Zed myself, and it's good. But it doesn't dethrone
         | Emacs (for me personally).
        
           | PittleyDunkin wrote:
           | > Multicore kind of works against the structure that Emacs
           | touts as a feature.
           | 
           | I have consistent issues with emacs locking up when executing
           | network requests. I'm sure there's a specific bug that could
           | be hunted down and addressed, but this sort of thing
           | shouldn't happen much in an editor that's multicore by
           | default.
           | 
           | I'm not trying to dismiss emacs' reasoning, of course, but I
           | can understand being disgruntled with it.
           | 
           | The actual rendering I've been quite please by, though!
        
             | rgrmrts wrote:
             | Yeah this is one reason, or Emacs freezing for up to a
             | minute when updating packages. Also when using an LSP I
             | notice latency.
             | 
             | I use Emacs GUI (outside of the terminal) and comparing
             | performance for rending to something like Zed or Sublime is
             | definitely noticeable. It's great that Emacs is so resource
             | efficient but sometimes I wish it used more of my beefy
             | computer(s).
             | 
             | Like I said I still love Emacs and it's okay for it to make
             | a different set of trade-offs. I honestly didn't think I'd
             | ever switch editors but here we are!
        
               | seanw444 wrote:
               | That's fair I guess. In the case of IO that can be an
               | issue. When I hear multicore, I assume we're talking
               | about parallelism, not concurrency.
               | 
               | As for LSP, other than the Nim langserver, I've been
               | quite satisfied with Eglot's performance. I'm curious
               | what your setup is like. To be fair, I'm running a highly
               | customized Doom Emacs config, so it's possible I
               | inherited some non-vanilla optimizations I'm not aware
               | of.
        
               | PittleyDunkin wrote:
               | > When I hear multicore, I assume we're talking about
               | parallelism, not concurrency.
               | 
               | Parallelism trivially enables concurrency. The lack of
               | parallelism magnifies the issues emacs has with
               | concurrency.
        
               | karthink wrote:
               | Removing the interpreter lock for a few specialized tasks
               | (without sweeping runtime changes to Emacs) would be
               | enough to fix most of these issues -- parsing JSON from
               | process output into lisp data in a background thread is
               | one candidate. [1]
               | 
               | Installing packages does not need to block either, there
               | is no architectural limitation here. The Elpaca package
               | manager for Emacs provides async, parallel package
               | updates. Loading packages into the Lisp image will block
               | though, there's no way around that.
               | 
               | The other big source of input lag is garbage collection,
               | and there are some ongoing efforts to use the MPS library
               | in Emacs for a copying, concurrent GC. This is a big
               | change and I don't know if this experiment will go
               | anywhere, but Eli Zaretskii and co are trying.
               | 
               | [1]: https://github.com/emacs-lsp/emacs
        
               | erk__ wrote:
               | Someone made a alternative way of optimizing LSP by
               | converting the JSON to elisp bytecode on the fly:
               | https://github.com/blahgeek/emacs-lsp-booster
        
           | DrBenCarson wrote:
           | "Need" is strong but using GPU rendering is definitely better
           | than CPU rendering and makes things feel very snappy
           | 
           | Most new TTY projects use GPUs for that reason
        
           | vinkelhake wrote:
           | > But it's a text editor.
           | 
           | Long time emacs user here (+20 years, yikes). I've used it on
           | all kinds of computers during this time. Even a relatively
           | modest computer from 2024 is an absolute beast compared to
           | something from the year 2000.
           | 
           | With that said, there are text editing operations that can
           | cause it to grind to a complete halt, especially working with
           | large files (or very long lines). And it _shouldn 't_, you
           | know? I think emacs users sort of internalize which
           | operations they should avoid. It's kind of ridiculous to have
           | to do that in a text editor with the massive amounts of
           | compute that our computers have today.
        
             | donio wrote:
             | > (or very long lines)
             | 
             | Long line handling has greatly improved in emacs-29. Multi-
             | megabyte lines are not a problem anymore.
        
               | canucker2016 wrote:
               | from Emacs 29.1 NEWS file,
               | https://raw.githubusercontent.com/emacs-
               | mirror/emacs/refs/he...                   ** Emacs is now
               | capable of editing files with very long lines.
               | The display of long lines has been optimized, and Emacs
               | should no         longer choke when a buffer on display
               | contains long lines.  The         variable 'long-line-
               | threshold' controls whether and when these display
               | optimizations are in effect.              A companion
               | variable 'large-hscroll-threshold' controls when another
               | set of display optimizations are in effect, which are
               | aimed         specifically at speeding up display of long
               | lines that are truncated         on display.
        
           | kevin_thibedeau wrote:
           | GPU rendering simplifies smooth text scrolling which used to
           | be a thing on some dumb terminals and microcomputers like
           | Amiga that supported it in hardware. Most emulators are
           | locking character cells on a fixed grid and we miss out on
           | such niceties.
        
             | ku1ik wrote:
             | AMIIIIGAAAAA!
        
       | dmitrygr wrote:
       | >  // Parallel bit count intermediates       >  let a = v - ((v
       | >> 1) & (u64::MAX / 3));       >  let b = (a & (u64::MAX / 5)) +
       | ((a >> 2) & (u64::MAX / 5));       >  let c = (b + (b >> 4)) &
       | (u64::MAX / 0x11);       >  let d = (c + (c >> 8)) & (u64::MAX /
       | 0x101);
       | 
       | That "parallel bit count" is almost certainly slower than using
       | two POPCNT instructions on a modern cpu. Should just call
       | __builtin_popcount() and let the compiler do it the most optimal
       | way. Luckily, people do this sort of thing so often that many
       | modern compilers will try (and often succeed) to detect you
       | trying this insanity and convert it to a POPCOUNT (or a pair of
       | POPCOUNTs as the case may be here)
        
         | akoboldfrying wrote:
         | Which compilers support __builtin_popcount()? From memory, it's
         | a gcc extension. If the compiler selects a CPU POPCOUNT
         | instruction for it, are you sure it will work on all machines
         | that you want to run it on?
         | 
         | The above code is completely source- and binary-portable and
         | reasonably fast -- certainly faster than naively looping
         | through the bits, and within a small constant factor of a CPU
         | POPCOUNT instruction.
        
           | woadwarrior01 wrote:
           | > Which compilers support __builtin_popcount()?
           | 
           | Clang supports __builtin_popcount() too. And MSVC has
           | __popcnt().
        
           | dmitrygr wrote:
           | Your compiler will know _the best_ way to popcount, that is
           | the point of that builtin. It 'll use the best method -
           | sometimes this one. GCC does this, MSVC does this, clang does
           | this, i think even rust has some way to do it (EDIT: it does:
           | count_ones()). On archs which lack POPCNT, it will use this
           | method or another, based on knowing the target. On x86 this
           | approach is OK as is. On arm64, for example, it will be
           | suboptimal due to all the literals needed. On armv6m, this
           | method is bad and table lookups are faster.
        
             | SkiFire13 wrote:
             | Note that by default rustc targets x86-64-v1 when compiling
             | for x86-64, and that lacks the popcount instruction. You
             | need to change the target_cpu to at least x86-64-v2 or
             | enable the popcount target_feature. This means that even if
             | your cpu is relatively new and you intend to run your code
             | on relatively new cpus, rustc will still generate older and
             | slower code for count_ones() using bitshifts and masks.
             | That said, I don't see the point in writing them manually
             | if the compiler can generate them for you.
        
               | vlovich123 wrote:
               | It's not unreasonable to think that Rust will change the
               | minimum version and you should always override the target
               | cpu anyway for C++-like toolchains when building
               | production binaries (`-Ctarget-cpu` for rust, `march=`
               | for clang/gcc).
        
             | Findecanor wrote:
             | I once wrote that algorithm, divided into single lines,
             | intending each line to be a single 64-bit ARM instruction.
             | The compiler did idiom detection, transforming it to
             | "builtin popcnt" and (because 64-bit ARMv8.0 lacks a POPCNT
             | instruction) back to the same algorithm. Only that the
             | emitted code was one instruction larger than my code.
             | 
             | 64-bit ARM's actually has a very peculiar encoding of
             | immediates to arithmetic instructions. It supports _only_
             | recurring bit patterns such as used by this algorithm. For
             | example  "add x2, x3, #3333333333333333" is encoded as one
             | four-byte instruction.
        
               | stassats wrote:
               | > because 64-bit ARMv8.0 lacks a POPCNT instruction
               | 
               | It does have this: https://developer.arm.com/documentatio
               | n/ddi0596/2021-09/SIMD...
               | 
               | And GCC happily uses it https://godbolt.org/z/dTW46f9Kf
        
           | jandrewrogers wrote:
           | Most vaguely recent compilers will convert naively looping
           | through bits into a native POPCOUNT instruction. The parallel
           | bit count algorithm was not reliably detected until more
           | recently and therefore would sometimes produce unoptimized
           | code, though current versions of gcc/clang/msvc can all
           | detect it now.
           | 
           | Also, pretty much every compiler for a very long time has
           | supported __builtin_popcount or equivalent.
        
           | aseipp wrote:
           | Everything supports __builtin_popcount or some variant these
           | days (__popcnt for MSVC). It's a complete non-issue, really.
           | 
           | And the compiler is not required to lower it to a single
           | instruction. It will if the target architecture is specified
           | appropriately, but there's nothing that says it has to
           | explode if it can't. In fact, by doing it this way, the
           | compiler is actually more free to generate code in a way
           | that's optimal for the architecture in all cases, because all
           | the implementation details are hidden e.g. loads of large
           | constants may be avoided if the compiler is allowed to choose
           | the exact implementation, while using the portable version
           | may tie its hands more depending on how it's feeling on that
           | day. Here's __builtin_popcount working just fine while
           | targeting a ~20yo architecture without native support for
           | SSE4.2; it can generate this code knowing what the proper
           | instructions and schedules are:
           | https://godbolt.org/z/ra7n5T5f3
           | 
           | The moral here is that the primitives are there for you to
           | use. Just use them and save yourself and your would-be code
           | reviewer's time.
        
           | SuchAnonMuchWow wrote:
           | In addition to the other comments, the iso C23 standard added
           | the <stdbit.h> header to the standard library with a
           | stdc_count_ones() function, so compiler support will become
           | standard.
        
           | Validark wrote:
           | I guess it depends how many different compilers you are
           | targeting but generally speaking, compilers look for trivial
           | implementations of popcount and can replace it with the
           | single instruction.
           | 
           | However, as someone already mentioned, this is not equivalent
           | to a pair of popcounts.
        
           | 3836293648 wrote:
           | The only one that exists. This is rust and rustc supports
           | popcnt on all platforms that have it and emulates it on those
           | that don't
        
         | ot wrote:
         | What that code does is a per-byte-pair popcount, which is not
         | what the POPCNT instruction does (it computes the popcount for
         | the whole word).
         | 
         | On processors with BMI2 the whole algorithm reduces to a PDEP
         | as mentioned in another comment, but if you don't have that
         | this is pretty much the best you can do (unless you use lookup
         | tables but those have pros and cons).
        
       | camel-cdr wrote:
       | nth_set_bit_u64: wouldn't that be __builtin_ctzll(_pdep_u64(1<<n,
       | v)) with BMI2?
        
         | SkiFire13 wrote:
         | That's assuming you're ok with your program not running on some
         | older cpus.
        
           | zamadatix wrote:
           | That and that you're not willing to entertain splitting the
           | manual version as #[cfg(not(target_feature = "bmi2"))]
           | fallback implementation. For something already down to ~ 1 ns
           | both of those may well be very reasonable assumptions of
           | course.
        
             | Validark wrote:
             | AMD machines prior to Zen 3 had a micro-coded
             | implementation of pdep and pext, so they're actually
             | relatively expensive for those earlier Zen machines (as
             | well as Bulldozer). Some people still have Ryzen 3000
             | series chips.
             | 
             | On the Intel side, pdep has been fast since its release
             | with the Haswell in 2013, so pretty much everyone using
             | Intel should be fine in this regard.
        
         | kwillets wrote:
         | That's my guess as well.
         | 
         | Bitstring rank/select is a well-known problem, and the BMI and
         | non-BMI (Hacker's Delight) versions are available as a
         | reference.
        
         | stouset wrote:
         | I believe the equivalent ARM64 instructions are in SVE2 which
         | isn't yet supported on Apple's M-series chips as of M4, sadly.
        
       | ramon156 wrote:
       | Isnt the tab example wrong? Id assume it to be
       | 
       | aa -> -> bb -> -> bb
       | 
       | It only takes up two spaces, after all
        
       | sapiogram wrote:
       | Is there a way to adjust text contrast in light mode in Zed yet?
       | The editor is unfortunately unusable for me, because of how
       | washed out the colors are.
        
         | mattbaker wrote:
         | There are themes now and a UI to install them, I also didn't
         | like the washed out colors.
        
           | onyno wrote:
           | Moved to zed a while ago. I'm using a light theme - Xcode low
           | key now (With some settings to turn off the bold fonts).
        
           | sapiogram wrote:
           | Have you found a light theme you're happy with? Which one?
        
       | Am4TIfIsER0ppos wrote:
       | > opacity: 0;
       | 
       | > filter: blur(1px);
       | 
       | Wonderful styling!
        
       | dzaima wrote:
       | SIMD can work quite well here too - with 128-bit SIMD (available
       | on both baseline x86-64 and aarch64) this can be just <=8 loop
       | iterations checking for the newline character (each iteration
       | counting the number of newline characters encountered, and a
       | lzcnt on the last iteration), and similar for characters
       | (assuming valid UTF-8, it's a single comparison to test if a byte
       | starts a new char).
        
       | eviks wrote:
       | Why not store just a small u8 count of newlines in a chunk
       | instead of their u128 positions and then only loop through the
       | last chunk for precision?
       | 
       | You don't need information about the position of newlines in all
       | the chunks located before the one your offset lands on
        
         | teo_zero wrote:
         | As I understand it, they do exactly what you say. TFA is about
         | optimizing the last chunk's loop.
        
           | eviks wrote:
           | Maybe I got confused, but how do they then count the newlines
           | in all the previous chunks? That information is still needed
           | to calculate the line for a specific position in the last
           | chunk
        
             | teo_zero wrote:
             | It's not you who's confused, it's that this part of the
             | process is not described. We only have some hints, here:
             | 
             | > If you called rope.offset_to_point(7234), the Rope would
             | traverse its SumTree to find the Chunk that contains offset
             | 7234 and then, on that Chunk, it would call offset_to_point
             | again.
             | 
             | And here:
             | 
             | > while the Rope can get us to the right Chunk in O(log(n))
             | 
             | I would guess that each node of the SumTree includes the
             | number of newlines of all the chunks before it.
        
             | DylanSp wrote:
             | It's outlined in their previous post on the Rope/SumTree
             | data structure they use, which this article links to:
             | https://zed.dev/blog/zed-decoded-rope-sumtree.
        
       | benreesman wrote:
       | I really want to admire the Zed folks even though they are a new
       | faction in the editor wars where I'm an emacs guy: they take
       | mostly all the things I care about seriously.
       | 
       | The are serious about local vs. remote vs. shared. They are
       | serious about hardware acceleration because they care about users
       | who type fast and edit big files. They care about real computer
       | science on how to push the current hardware to serve the user,
       | rather than treating the hardware as a crutch to give the user a
       | slightly worse experience at a fraction of the development cost.
       | They care about code highlighting the snippets on their blog like
       | very similar to the default Zed theme.
       | 
       | These are cool, serious people. If they give me emacs key
       | bindings with full paredit-everywhere, I might switch my daily
       | driver.
       | 
       | And this is about using modern SIMD-style stuff, branch less
       | stuff, Lemire stuff in a modern and relevant context. Other
       | commenters have pointed out that you can do the mask or popcount
       | better with intrinsically, and yeah, but they probably know that,
       | and B. They got the massive win, the remaining factor of K on the
       | pop count is coming.
       | 
       | Long Zed.
        
       | m1keil wrote:
       | This is all really cool, but I just want soft wrap to be fixed
        
       | veltas wrote:
       | > Turns out CPUs are pretty good with zeros and ones.
       | 
       | I've been saying this a lot recently, that CPU's are powerful
       | 1-bit vector machines!
        
       | eliasson wrote:
       | Does anyone know of any good presentations about the Zed
       | architecture and internals around?
       | 
       | I know that they have some recordings from their coding session,
       | but I am looking for something more overall.
        
       ___________________________________________________________________
       (page generated 2024-11-21 23:02 UTC)