[HN Gopher] Bit-twiddling optimizations in Zed's Rope
___________________________________________________________________
Bit-twiddling optimizations in Zed's Rope
Author : misternugget
Score : 149 points
Date : 2024-11-18 16:39 UTC (3 days ago)
(HTM) web link (zed.dev)
(TXT) w3m dump (zed.dev)
| rgrmrts wrote:
| I really like all the blog posts and videos the Zed team has put
| out, thank you if you're reading this!
|
| Unrelated to this specific post I'm such a fan of Zed. It's the
| first feature complete text editor in recent memory that I've
| truly enjoyed using (i.e. it stays out of the way, is really
| fast, feels well engineered). I'm coming to Zed after years of
| Emacs which I still have love for but no longer feels like a
| competitive piece of software (it does not take full advantage of
| how good computers are today, e.g. gpu rendering or multicore). I
| really hope Zed stays a fast and lightweight text editor instead
| of becoming some bloated growth-at-all-cost VC ware (not that
| they've exhibited any signs of that happening). I'd also happily
| pay for Zed without a subscription based thing for access to LLM
| features (which I do not use).
| seanw444 wrote:
| > it does not take full advantage of how good computers are
| today, e.g. gpu rendering or multicore
|
| Why does Emacs need that though? I hear people say this all the
| time and I don't get it. Multicore kind of works against the
| structure that Emacs touts as a feature. And GPU rendering? In
| many applications, I totally agree with these complaints. But
| it's a text editor.
|
| I tried Zed myself, and it's good. But it doesn't dethrone
| Emacs (for me personally).
| PittleyDunkin wrote:
| > Multicore kind of works against the structure that Emacs
| touts as a feature.
|
| I have consistent issues with emacs locking up when executing
| network requests. I'm sure there's a specific bug that could
| be hunted down and addressed, but this sort of thing
| shouldn't happen much in an editor that's multicore by
| default.
|
| I'm not trying to dismiss emacs' reasoning, of course, but I
| can understand being disgruntled with it.
|
| The actual rendering I've been quite please by, though!
| rgrmrts wrote:
| Yeah this is one reason, or Emacs freezing for up to a
| minute when updating packages. Also when using an LSP I
| notice latency.
|
| I use Emacs GUI (outside of the terminal) and comparing
| performance for rending to something like Zed or Sublime is
| definitely noticeable. It's great that Emacs is so resource
| efficient but sometimes I wish it used more of my beefy
| computer(s).
|
| Like I said I still love Emacs and it's okay for it to make
| a different set of trade-offs. I honestly didn't think I'd
| ever switch editors but here we are!
| seanw444 wrote:
| That's fair I guess. In the case of IO that can be an
| issue. When I hear multicore, I assume we're talking
| about parallelism, not concurrency.
|
| As for LSP, other than the Nim langserver, I've been
| quite satisfied with Eglot's performance. I'm curious
| what your setup is like. To be fair, I'm running a highly
| customized Doom Emacs config, so it's possible I
| inherited some non-vanilla optimizations I'm not aware
| of.
| PittleyDunkin wrote:
| > When I hear multicore, I assume we're talking about
| parallelism, not concurrency.
|
| Parallelism trivially enables concurrency. The lack of
| parallelism magnifies the issues emacs has with
| concurrency.
| karthink wrote:
| Removing the interpreter lock for a few specialized tasks
| (without sweeping runtime changes to Emacs) would be
| enough to fix most of these issues -- parsing JSON from
| process output into lisp data in a background thread is
| one candidate. [1]
|
| Installing packages does not need to block either, there
| is no architectural limitation here. The Elpaca package
| manager for Emacs provides async, parallel package
| updates. Loading packages into the Lisp image will block
| though, there's no way around that.
|
| The other big source of input lag is garbage collection,
| and there are some ongoing efforts to use the MPS library
| in Emacs for a copying, concurrent GC. This is a big
| change and I don't know if this experiment will go
| anywhere, but Eli Zaretskii and co are trying.
|
| [1]: https://github.com/emacs-lsp/emacs
| erk__ wrote:
| Someone made a alternative way of optimizing LSP by
| converting the JSON to elisp bytecode on the fly:
| https://github.com/blahgeek/emacs-lsp-booster
| DrBenCarson wrote:
| "Need" is strong but using GPU rendering is definitely better
| than CPU rendering and makes things feel very snappy
|
| Most new TTY projects use GPUs for that reason
| vinkelhake wrote:
| > But it's a text editor.
|
| Long time emacs user here (+20 years, yikes). I've used it on
| all kinds of computers during this time. Even a relatively
| modest computer from 2024 is an absolute beast compared to
| something from the year 2000.
|
| With that said, there are text editing operations that can
| cause it to grind to a complete halt, especially working with
| large files (or very long lines). And it _shouldn 't_, you
| know? I think emacs users sort of internalize which
| operations they should avoid. It's kind of ridiculous to have
| to do that in a text editor with the massive amounts of
| compute that our computers have today.
| donio wrote:
| > (or very long lines)
|
| Long line handling has greatly improved in emacs-29. Multi-
| megabyte lines are not a problem anymore.
| canucker2016 wrote:
| from Emacs 29.1 NEWS file,
| https://raw.githubusercontent.com/emacs-
| mirror/emacs/refs/he... ** Emacs is now
| capable of editing files with very long lines.
| The display of long lines has been optimized, and Emacs
| should no longer choke when a buffer on display
| contains long lines. The variable 'long-line-
| threshold' controls whether and when these display
| optimizations are in effect. A companion
| variable 'large-hscroll-threshold' controls when another
| set of display optimizations are in effect, which are
| aimed specifically at speeding up display of long
| lines that are truncated on display.
| kevin_thibedeau wrote:
| GPU rendering simplifies smooth text scrolling which used to
| be a thing on some dumb terminals and microcomputers like
| Amiga that supported it in hardware. Most emulators are
| locking character cells on a fixed grid and we miss out on
| such niceties.
| ku1ik wrote:
| AMIIIIGAAAAA!
| dmitrygr wrote:
| > // Parallel bit count intermediates > let a = v - ((v
| >> 1) & (u64::MAX / 3)); > let b = (a & (u64::MAX / 5)) +
| ((a >> 2) & (u64::MAX / 5)); > let c = (b + (b >> 4)) &
| (u64::MAX / 0x11); > let d = (c + (c >> 8)) & (u64::MAX /
| 0x101);
|
| That "parallel bit count" is almost certainly slower than using
| two POPCNT instructions on a modern cpu. Should just call
| __builtin_popcount() and let the compiler do it the most optimal
| way. Luckily, people do this sort of thing so often that many
| modern compilers will try (and often succeed) to detect you
| trying this insanity and convert it to a POPCOUNT (or a pair of
| POPCOUNTs as the case may be here)
| akoboldfrying wrote:
| Which compilers support __builtin_popcount()? From memory, it's
| a gcc extension. If the compiler selects a CPU POPCOUNT
| instruction for it, are you sure it will work on all machines
| that you want to run it on?
|
| The above code is completely source- and binary-portable and
| reasonably fast -- certainly faster than naively looping
| through the bits, and within a small constant factor of a CPU
| POPCOUNT instruction.
| woadwarrior01 wrote:
| > Which compilers support __builtin_popcount()?
|
| Clang supports __builtin_popcount() too. And MSVC has
| __popcnt().
| dmitrygr wrote:
| Your compiler will know _the best_ way to popcount, that is
| the point of that builtin. It 'll use the best method -
| sometimes this one. GCC does this, MSVC does this, clang does
| this, i think even rust has some way to do it (EDIT: it does:
| count_ones()). On archs which lack POPCNT, it will use this
| method or another, based on knowing the target. On x86 this
| approach is OK as is. On arm64, for example, it will be
| suboptimal due to all the literals needed. On armv6m, this
| method is bad and table lookups are faster.
| SkiFire13 wrote:
| Note that by default rustc targets x86-64-v1 when compiling
| for x86-64, and that lacks the popcount instruction. You
| need to change the target_cpu to at least x86-64-v2 or
| enable the popcount target_feature. This means that even if
| your cpu is relatively new and you intend to run your code
| on relatively new cpus, rustc will still generate older and
| slower code for count_ones() using bitshifts and masks.
| That said, I don't see the point in writing them manually
| if the compiler can generate them for you.
| vlovich123 wrote:
| It's not unreasonable to think that Rust will change the
| minimum version and you should always override the target
| cpu anyway for C++-like toolchains when building
| production binaries (`-Ctarget-cpu` for rust, `march=`
| for clang/gcc).
| Findecanor wrote:
| I once wrote that algorithm, divided into single lines,
| intending each line to be a single 64-bit ARM instruction.
| The compiler did idiom detection, transforming it to
| "builtin popcnt" and (because 64-bit ARMv8.0 lacks a POPCNT
| instruction) back to the same algorithm. Only that the
| emitted code was one instruction larger than my code.
|
| 64-bit ARM's actually has a very peculiar encoding of
| immediates to arithmetic instructions. It supports _only_
| recurring bit patterns such as used by this algorithm. For
| example "add x2, x3, #3333333333333333" is encoded as one
| four-byte instruction.
| stassats wrote:
| > because 64-bit ARMv8.0 lacks a POPCNT instruction
|
| It does have this: https://developer.arm.com/documentatio
| n/ddi0596/2021-09/SIMD...
|
| And GCC happily uses it https://godbolt.org/z/dTW46f9Kf
| jandrewrogers wrote:
| Most vaguely recent compilers will convert naively looping
| through bits into a native POPCOUNT instruction. The parallel
| bit count algorithm was not reliably detected until more
| recently and therefore would sometimes produce unoptimized
| code, though current versions of gcc/clang/msvc can all
| detect it now.
|
| Also, pretty much every compiler for a very long time has
| supported __builtin_popcount or equivalent.
| aseipp wrote:
| Everything supports __builtin_popcount or some variant these
| days (__popcnt for MSVC). It's a complete non-issue, really.
|
| And the compiler is not required to lower it to a single
| instruction. It will if the target architecture is specified
| appropriately, but there's nothing that says it has to
| explode if it can't. In fact, by doing it this way, the
| compiler is actually more free to generate code in a way
| that's optimal for the architecture in all cases, because all
| the implementation details are hidden e.g. loads of large
| constants may be avoided if the compiler is allowed to choose
| the exact implementation, while using the portable version
| may tie its hands more depending on how it's feeling on that
| day. Here's __builtin_popcount working just fine while
| targeting a ~20yo architecture without native support for
| SSE4.2; it can generate this code knowing what the proper
| instructions and schedules are:
| https://godbolt.org/z/ra7n5T5f3
|
| The moral here is that the primitives are there for you to
| use. Just use them and save yourself and your would-be code
| reviewer's time.
| SuchAnonMuchWow wrote:
| In addition to the other comments, the iso C23 standard added
| the <stdbit.h> header to the standard library with a
| stdc_count_ones() function, so compiler support will become
| standard.
| Validark wrote:
| I guess it depends how many different compilers you are
| targeting but generally speaking, compilers look for trivial
| implementations of popcount and can replace it with the
| single instruction.
|
| However, as someone already mentioned, this is not equivalent
| to a pair of popcounts.
| 3836293648 wrote:
| The only one that exists. This is rust and rustc supports
| popcnt on all platforms that have it and emulates it on those
| that don't
| ot wrote:
| What that code does is a per-byte-pair popcount, which is not
| what the POPCNT instruction does (it computes the popcount for
| the whole word).
|
| On processors with BMI2 the whole algorithm reduces to a PDEP
| as mentioned in another comment, but if you don't have that
| this is pretty much the best you can do (unless you use lookup
| tables but those have pros and cons).
| camel-cdr wrote:
| nth_set_bit_u64: wouldn't that be __builtin_ctzll(_pdep_u64(1<<n,
| v)) with BMI2?
| SkiFire13 wrote:
| That's assuming you're ok with your program not running on some
| older cpus.
| zamadatix wrote:
| That and that you're not willing to entertain splitting the
| manual version as #[cfg(not(target_feature = "bmi2"))]
| fallback implementation. For something already down to ~ 1 ns
| both of those may well be very reasonable assumptions of
| course.
| Validark wrote:
| AMD machines prior to Zen 3 had a micro-coded
| implementation of pdep and pext, so they're actually
| relatively expensive for those earlier Zen machines (as
| well as Bulldozer). Some people still have Ryzen 3000
| series chips.
|
| On the Intel side, pdep has been fast since its release
| with the Haswell in 2013, so pretty much everyone using
| Intel should be fine in this regard.
| kwillets wrote:
| That's my guess as well.
|
| Bitstring rank/select is a well-known problem, and the BMI and
| non-BMI (Hacker's Delight) versions are available as a
| reference.
| stouset wrote:
| I believe the equivalent ARM64 instructions are in SVE2 which
| isn't yet supported on Apple's M-series chips as of M4, sadly.
| ramon156 wrote:
| Isnt the tab example wrong? Id assume it to be
|
| aa -> -> bb -> -> bb
|
| It only takes up two spaces, after all
| sapiogram wrote:
| Is there a way to adjust text contrast in light mode in Zed yet?
| The editor is unfortunately unusable for me, because of how
| washed out the colors are.
| mattbaker wrote:
| There are themes now and a UI to install them, I also didn't
| like the washed out colors.
| onyno wrote:
| Moved to zed a while ago. I'm using a light theme - Xcode low
| key now (With some settings to turn off the bold fonts).
| sapiogram wrote:
| Have you found a light theme you're happy with? Which one?
| Am4TIfIsER0ppos wrote:
| > opacity: 0;
|
| > filter: blur(1px);
|
| Wonderful styling!
| dzaima wrote:
| SIMD can work quite well here too - with 128-bit SIMD (available
| on both baseline x86-64 and aarch64) this can be just <=8 loop
| iterations checking for the newline character (each iteration
| counting the number of newline characters encountered, and a
| lzcnt on the last iteration), and similar for characters
| (assuming valid UTF-8, it's a single comparison to test if a byte
| starts a new char).
| eviks wrote:
| Why not store just a small u8 count of newlines in a chunk
| instead of their u128 positions and then only loop through the
| last chunk for precision?
|
| You don't need information about the position of newlines in all
| the chunks located before the one your offset lands on
| teo_zero wrote:
| As I understand it, they do exactly what you say. TFA is about
| optimizing the last chunk's loop.
| eviks wrote:
| Maybe I got confused, but how do they then count the newlines
| in all the previous chunks? That information is still needed
| to calculate the line for a specific position in the last
| chunk
| teo_zero wrote:
| It's not you who's confused, it's that this part of the
| process is not described. We only have some hints, here:
|
| > If you called rope.offset_to_point(7234), the Rope would
| traverse its SumTree to find the Chunk that contains offset
| 7234 and then, on that Chunk, it would call offset_to_point
| again.
|
| And here:
|
| > while the Rope can get us to the right Chunk in O(log(n))
|
| I would guess that each node of the SumTree includes the
| number of newlines of all the chunks before it.
| DylanSp wrote:
| It's outlined in their previous post on the Rope/SumTree
| data structure they use, which this article links to:
| https://zed.dev/blog/zed-decoded-rope-sumtree.
| benreesman wrote:
| I really want to admire the Zed folks even though they are a new
| faction in the editor wars where I'm an emacs guy: they take
| mostly all the things I care about seriously.
|
| The are serious about local vs. remote vs. shared. They are
| serious about hardware acceleration because they care about users
| who type fast and edit big files. They care about real computer
| science on how to push the current hardware to serve the user,
| rather than treating the hardware as a crutch to give the user a
| slightly worse experience at a fraction of the development cost.
| They care about code highlighting the snippets on their blog like
| very similar to the default Zed theme.
|
| These are cool, serious people. If they give me emacs key
| bindings with full paredit-everywhere, I might switch my daily
| driver.
|
| And this is about using modern SIMD-style stuff, branch less
| stuff, Lemire stuff in a modern and relevant context. Other
| commenters have pointed out that you can do the mask or popcount
| better with intrinsically, and yeah, but they probably know that,
| and B. They got the massive win, the remaining factor of K on the
| pop count is coming.
|
| Long Zed.
| m1keil wrote:
| This is all really cool, but I just want soft wrap to be fixed
| veltas wrote:
| > Turns out CPUs are pretty good with zeros and ones.
|
| I've been saying this a lot recently, that CPU's are powerful
| 1-bit vector machines!
| eliasson wrote:
| Does anyone know of any good presentations about the Zed
| architecture and internals around?
|
| I know that they have some recordings from their coding session,
| but I am looking for something more overall.
___________________________________________________________________
(page generated 2024-11-21 23:02 UTC)