[HN Gopher] Text Editor Data Structures
       ___________________________________________________________________
        
       Text Editor Data Structures
        
       Author : starfreakclone
       Score  : 208 points
       Date   : 2023-06-13 16:19 UTC (6 hours ago)
        
 (HTM) web link (cdacamar.github.io)
 (TXT) w3m dump (cdacamar.github.io)
        
       | phildenhoff wrote:
       | Thanks to this article, I learned that the core data structure
       | for VSCode's text is written in TypeScript[0]. I, uh, I knew VS
       | Code was written in TypeScript but I didn't realise _all_ of it
       | was. It's crazy to think that the editor works as well as it
       | does!
       | 
       | [0]:
       | https://github.com/microsoft/vscode/tree/main/src/vs/editor/...
        
         | cmrdporcupine wrote:
         | I used a red-black tree written in JS over 15 years ago at a
         | job where I needed to use lower bound and range operations for
         | nodes in a pivot-table type dashboard for telecoms metrics. I
         | was actually blown away at the time at how it actually
         | performed rather decently, and this is back in the pre-Chrome
         | Firefox & IE6&IE7 days.
         | 
         | Now in the world of V8 and JavaScriptCore, I don't think it's
         | crazy at all. _So_ much work has gone into JS runtime
         | optimization. For heavily concurrent and memory intensive
         | workloads, I can imagine problems though.
        
         | [deleted]
        
       | egonschiele wrote:
       | I've been wondering if there's a good algorithm to handle
       | highlights in a piece of text. If I highlight some text, lets say
       | the part between the => <= here:
       | 
       | "Idioteque" is a song => by the English rock band Radiohead <=,
       | released on their fourth album, Kid A (2000).
       | 
       | If I want to then edit this text, is there an efficient algorithm
       | for figuring out the start and end index of the highlight for the
       | edited text?
        
       | tonetheman wrote:
       | [dead]
        
       | nsajko wrote:
       | This seems revisionist/ignorant. The article attributes a "piece
       | tree" data structure to VS Code developers, however the must-read
       | 1998 paper by Charles Crowley, _Data Structures for Text
       | Sequences_ , already mentions search trees as being used to
       | enhance the naive piece table in some text editors.
       | 
       | https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48...
        
         | starfreakclone wrote:
         | Thank you for the pointer! I was actually not aware of this
         | paper at all, which is why it was not included here (not sure
         | how I missed it). I'll be sure to push a revision to the blog
         | which mentions this paper.
        
           | thaliaarchi wrote:
           | You may also want to take a look at "Deletion: The curse of
           | the red-black tree" (Germane and Might 2014), which presents
           | functional deletion for red-black trees.
           | 
           | https://matt.might.net/papers/germane2014deletion.pdf
        
       | badsectoracula wrote:
       | There are tons of articles about plain text editor data
       | structures, but what about _rich_ text editor data structures?
       | Let 's say i want to implement a text editor that can have bold,
       | italic, underline, etc text but also be able to do automatic word
       | breaking, align paragraph text to left/middle/right, insert
       | images and/or other objects, have floating images and/or other
       | objects around which the other (non floating) text/images/objects
       | wrap, etc.
       | 
       | What would be the data structures for that? I can only think of
       | trying to replicate something like the HTML DOM but i have a
       | feeling something like Write for Windows 3.1 used a simpler data
       | structure.
        
         | legulere wrote:
         | It seems like also Bravo, the first WYSIWYG-Editor, abiword and
         | word used/use piece tables:
         | https://en.wikipedia.org/wiki/Piece_table#Usage
        
         | convolvatron wrote:
         | its probably instructive to look at elisp text properties. not
         | so much that its a great system - but it raises some messy
         | questions about how to operate on substrings that have
         | different sets of properties
        
         | samwillis wrote:
         | I believe most do use a DOM like structure for block level
         | elements. For inline styling however they tend to not have
         | separate nested nodes for each style (bold, italic, underline),
         | and are closer to the old HTML4 <font> tag with multiple
         | properties for each style and don't nest them. It makes for
         | much simpler code.
         | 
         | The docs for ProseMirror are a brilliant insight into how many
         | of these editors are designed. ProseMirror maintains its own
         | document model in parallel to the html DOM and rectifies one to
         | the other as either changes.
         | 
         | For realtime collaborative rich text editors take a look at
         | PeriText, they have a brilliant article explaining the CRDT
         | structures used.
         | 
         | https://prosemirror.net
         | 
         | https://www.inkandswitch.com/peritext/
        
         | coldpie wrote:
         | Yes, I think your instinct to look at win32 APIs is a good one.
         | Many structures in riched are public in headers, and of course
         | you could reference Wine's implementation[1].
         | 
         | [1]
         | https://source.winehq.org/git/wine.git/tree/HEAD:/dlls/riche...
        
           | badsectoracula wrote:
           | That is neat, thanks. I didn't check RICHED itself but when i
           | was toying around with an HTML editor a few years ago (see my
           | other post) i did check out an HTML editor API that Microsoft
           | had (i don't remember its name but it was used by, e.g., Joel
           | Spolsky's CityDesk - IIRC the control itself was documented
           | in the MSDN CDs that came with Visual Studio 6), from where i
           | got the idea of using a two node pointers and a subrange
           | index to represent a selection.
        
         | macrael wrote:
         | Checkout the cocoa class for this:
         | https://developer.apple.com/documentation/foundation/nsattri...
        
         | eludwig wrote:
         | Many older word processors (the only ones I'm familiar with)
         | used style "runs", which are simply an array of structs that
         | contain startPosition, endPosition (or length) and then a bunch
         | of style info in whatever way makes sense to them.
         | 
         | I worked on Ready,Set,Go! back in the day and also wrote my own
         | styled editor for the Mac in the 90's.
         | 
         | One interesting thing about this is that you could use a lazy
         | "adjustor" to remove duplicate or overlapping styles. No need
         | to worry about it during typing or selecting. A low priority
         | task could analyze the runs and fix them up as needed and no
         | one was the wiser.
         | 
         | IMO, the hardest part about writing a word processor back then
         | was font management. You had to construct width tables to be
         | able to compose lines properly. This generally involved a bunch
         | of poorly documented APIs and peering into opaque data
         | structures.
        
         | KerrAvon wrote:
         | It's not actually that different or complicated if you're
         | already doing proportional plaintext rendering -- you need to
         | store style attributes for a range of text, and you need to
         | support different heights per line.
         | 
         | The real complexity is rendering all of unicode properly, and
         | supporting international fonts, bidi layout, vertical text,
         | etc.
        
           | badsectoracula wrote:
           | I think unicode, fonts, etc would be an issue even with a
           | plain text editor anyway.
           | 
           | The "styling a range of text" is something i thought but you
           | still need to somehow associate the text with the range - and
           | vice versa - and this doesn't handle things like inserting
           | images and other types of objects since these aren't text.
           | 
           | You could have a document be a series of "paragraphs", each
           | being a series of "elements" with each "element" being
           | something like "text" (with a style), "image", etc. But then
           | once tables enter the picture, you need to expand paragraphs
           | to be of "table" type and each table cell is itself a self-
           | contained "series of paragraphs" - and then start thinking
           | about nested tables or images in tables!
           | 
           | Generalize that enough to avoid special cases inside special
           | cases and you end up with more of a tree-like structure
           | representing a DOM and less with a linear structure with
           | range-based styling.
           | 
           | (of course, then again, i don't remember Write for Windows
           | 3.1 having tables in the first place :-P but i'm interested
           | if there are alternative approaches anyway)
           | 
           | EDIT: one thing i forgot to mention - and why i am curious
           | about non-DOM-based approaches - is that one problem with the
           | DOM approach is the selection: with a linear/range-based
           | structure the selection is just one or two indices inside the
           | range, but with the DOM the selection can start from a node
           | with node-specific subrange (e.g. character in a text node)
           | and end with another node and both being very unrelated to
           | each other (i.e. only having some distant common ancestor and
           | not necessarily at the same level).
        
             | alpaca128 wrote:
             | Text is usually stored as tree either way in an editor,
             | using a DOM-like approach might work well on top of the
             | usual datastructures.
             | 
             | > with the DOM the selection can start from a node with
             | node-specific subrange (e.g. character in a text node) and
             | end with another node and both being very unrelated to each
             | other
             | 
             | I'd just store the range as character indices, using those
             | the right nodes in the tree can be accessed pretty quickly
             | as needed.
        
               | badsectoracula wrote:
               | I don't think character indices are enough, what if your
               | selection begins at the middle of a table cell and ends
               | on an image that is the only child of a cell in a
               | completely different table (no text involved at all,
               | except some text in the cells in between)? If you want
               | to, e.g., delete those how do you find which nodes are to
               | be deleted and updated (e.g. for merging the two tables
               | if there are cells after the one that contains the
               | image)?
        
               | alpaca128 wrote:
               | Images and table cells are just nodes within the tree
               | holding the text, assuming all styling etc is represented
               | in a plain text syntax similar to Markdown, of course.
               | Looking up the nodes from char indices is quick if each
               | node stores how many chars it contains.
               | 
               | Other approaches would probably require the selection to
               | be a tree of its own, I can't really say whether that's
               | simpler overall or not.
        
             | b33j0r wrote:
             | I might have a way to simplify this?
             | 
             | A plaintext document is an array of chars, a richtext
             | document is tree, which may or may not be well-formed.
             | 
             | Think about someone trying to _bold semi-half of_a_
             | sentence_, and how MS Frontpage was made by smart people,
             | it's just really hard.
             | 
             | The most interesting thing lately is the HTML attribute
             | `contenteditable`, and how it almost just kinda works! You
             | still have to be full-stack to make something good, but
             | that was an amazing improvement to the browser.
        
               | badsectoracula wrote:
               | Yeah, that trying to bold half of a sentence - or even
               | better, the middle of a sentence - is why i was wondering
               | about simpler alternatives to DOM. Some time ago i toyed
               | around with an HTML editor[0] (that one had to use a DOM
               | anyway, but my question is for rich text editing in
               | general - BTW the rectangles in the shot show a selection
               | that goes across nodes) and doing something like that
               | involved traversing all the nodes (going both down and up
               | the node tree, starting from the cursor's starting
               | position), finding the closest common ancestors under the
               | selection, creating "B" siblings to them and then
               | reparenting them under these new "B" nodes.
               | 
               | You can move a lot of that stuff to reusable methods but
               | personally i find the whole "editing" aspect to be more
               | involved than the "drawing" side - and also the one more
               | likely to be different than a plain text editor - when
               | dealing with DOM-like structures. Hence why i am
               | interested to see what alternatives there are.
               | 
               | [0] https://i.imgur.com/jLlyNSS.png
        
               | b33j0r wrote:
               | Nice. I've searched far and wide, and most people still
               | end up starting a whole company that only does a text
               | editor.
               | 
               | Quill works but is basically dead since 2017. Almost
               | anything foss is in a similarly ambiguous boat.
               | 
               | And then people like us, defeated, eventually buy
               | something when we actually need it.
               | 
               | There are just so many ways that users try to use it.
               | It's a tough problem!
        
           | z3t4 wrote:
           | You don't need any fancy data structures. 95% of the
           | performance goes into glyph rendering. And with Unicode the
           | performance gain from monospace fonts goes out the window as
           | some Unicode characters are very large, and not only that
           | they also take up many bytes. So one Unicode character can be
           | up to 5 bytes long and take up the same canvas space as 3
           | characters. You also need to read ahead as there are
           | combination characters, for example a smiley combined with
           | the color brow becomes a brown smiley. I've blogged about
           | implement support for Unicode in an editor here: https://xn--
           | zta-qla.com//en/blog/editor10.htm
        
             | rcoveson wrote:
             | > So one Unicode character can be up to 5 bytes long and
             | take up the same canvas space as 3 characters.
             | 
             | 5 bytes? In what encoding?
        
               | WorldMaker wrote:
               | Some "single character" emoji easily exceed 5 bytes in
               | _all_ encodings. You may think ZWJ sequences are
               | cheating, but emoji isn 't the only language encoded in
               | Unicode with complex ZWJ sequences.
        
               | rcoveson wrote:
               | My question is about the phrase _up to 5_. What in
               | Unicode is up to 5? Codepoints are up to 4 in all the
               | encodings I know. ZWJ sequences may as well be
               | arbitrarily long. What is  "up to 5"?
        
               | cstrahan wrote:
               | The original quote for reference:
               | 
               | > So one Unicode character can be up to 5 bytes long and
               | take up the same canvas space as 3 characters.
               | 
               | FWIW, I didn't read that as suggesting an upper bound of
               | 5 bytes, but rather as an example using arbitrary
               | numbers: N bytes of code units could, depending on the
               | font providing the glyph(s) for the respective
               | grapheme(s), could be rendered at M times the size of,
               | say, the letter A, where N != M -- despite the font
               | otherwise being monospaced. Which is just another way of
               | saying that you _must_ consult the font for the character
               | widths involved.
               | 
               | I think you're reading that quote as an assertion that:
               | For any grapheme G, G can be encoded in at most 5 bytes.
               | 
               | While what I think was being said was:
               | There exists a grapheme G, where G is encoded in 5 bytes,
               | and the respective glyph happens to be displayed at 3
               | times a single character (e.g. the letter A), despite the
               | font otherwise being monospaced. Therefore you *must*
               | consult the font for each glyph to correctly determine
               | character widths.
        
               | rcoveson wrote:
               | Continue to the next sentence of context:
               | 
               | > You _also_ need to read ahead as there are combination
               | characters, for example a smiley combined with the color
               | brow becomes a brown smiley.
               | 
               | Emphasis mine. Clearly combination characters are being
               | treated separately.
               | 
               | Frankly I think it's crazy to read "up to 5 bytes" and
               | not think that it suggest an upper bound. I think you're
               | reaching for a highly questionably interpretation of a
               | totally unambiguous clause. If the author meant to
               | express what you're saying, they would certainly have
               | written: "Some Unicode characters are 5 bytes long and
               | take up the same canvas space as 3 characters". Which
               | would _still_ look incorrect if they followed it with the
               | sentence  "You _also_ need to read ahead as there are
               | combination characters... ".
               | 
               | It is far more likely that the author is simply mistaken
               | and should have said 4 bytes, and perhaps used the word
               | "codepoint" instead of "character" in the original
               | sentence. That's a perfectly understandable technical
               | error, while the reinterpretation you're putting together
               | would imply an error of colloquial language.
        
               | WorldMaker wrote:
               | I read this as colloquial English for "around" or
               | "approximately". Not setting a bound limit, but setting
               | an example size.
        
               | rcoveson wrote:
               | But you also read it as referring to ZWJ sequences? So
               | the author has picked a number that is actually _below
               | average_ and they 've worded it as _up to..._?
               | 
               | Saying a ZWJ sequence can be "up to 5 bytes" is like
               | saying "the current generation of Intel processors run at
               | clock speeds of up to 2 GHz".
               | 
               | If they were referring to ZWJ sequences (I don't think
               | they were; I think they were just misremembering the
               | maximum encoded length of a codepoint) and they had said
               | "up to 35 bytes", then I might agree with you. It's still
               | not technically accurate, but it's a reasonable
               | colloquial usage, like saying "human males can grow up to
               | seven feet tall".
        
               | WorldMaker wrote:
               | I think you are trying to read something that wasn't
               | meant to be technical documentation as if it was trying
               | to be exact technical specifications. I'm not the
               | original author, so I don't have reason to litigate this
               | any further, and I'm not sure what you are arguing about
               | at this point.
        
               | rcoveson wrote:
               | You've now replied to me up to 2 times.
        
               | jychang wrote:
               | Emojis with skin color, mostly
        
               | rcoveson wrote:
               | No, that's a ZWJ sequence. Those can be arbitrarily long.
               | Doesn't explain where "5 bytes" comes from.
        
               | lelanthran wrote:
               | > 5 bytes? In what encoding?
               | 
               | I believe UTF-8 reserved up to six bytes for a single
               | character.
        
               | rcoveson wrote:
               | Yes, UTF-8 was up to 6 bytes per character early on. Some
               | broken implementations like MySQL's limit it to up to 3
               | bytes per character. The actual number is 4.
               | 
               | So what is "up to 5"?
        
           | starfreakclone wrote:
           | Indeed, you could even use a piece tree just like in the blog
           | but you store additional information in each piece which tell
           | the renderer how to layout the associated text. My
           | understanding is that most rich text editors represent the
           | text as a node-based tree anyway.
        
             | WorldMaker wrote:
             | Some early rich text editors never used a tree
             | representation for formatting represented the formatting as
             | "control character sequences". This was part of why some
             | people loved WordPerfect so much because you could toggle a
             | view of all the control characters and just
             | delete/copy/paste them like any other text in the same
             | document.
             | 
             | It's basically how the classic RTF format [1] works, and
             | things like VT100/ANSI escape codes in terminals. It's kind
             | of like the difference between imperative code and
             | declarative code: "this character sequence means toggle the
             | state of bold" versus "this node of characters is bold".
             | 
             | [1] https://docs.fileformat.com/word-processing/rtf/
        
         | kjs3 wrote:
         | Perhaps look at how (Open,Libre)Office does it to support ODT
         | format (e.g. OASIS Open Document Format)?
        
         | thangalin wrote:
         | RichTeXtFX[1] has these abilities[2] along with demo code[3]
         | that provides an entry point for sleuthing. My text editor,
         | KeenWrite[4], uses RichTeXtFX, but takes a different approach
         | due to its usage of inline variable references. Rather than
         | edit the document in a WYSIWYG-like fashion, users edit the
         | Markdown document as plain text then export as PDF by selecting
         | a theme to apply for typesetting. This cleanly separates
         | content from presentation. The "themes" video tutorial
         | demonstrates how theme selection works.[5]
         | 
         | [1]:
         | https://github.com/FXMisc/RichTextFX/tree/master/richtextfx/...
         | 
         | [2]: https://gluonhq.com/presenting-a-new-richtextarea-control/
         | 
         | [3]: https://github.com/gluonhq/rich-text-
         | area/blob/master/sample...
         | 
         | [4]: https://github.com/DaveJarvis/keenwrite
         | 
         | [5]: https://www.youtube.com/watch?v=3QpX70O5S30&list=PLB-
         | WIt1cZY...
        
       | krackers wrote:
       | You might also like reading
       | https://news.ycombinator.com/item?id=16385440 (A Brief Glance at
       | How Various Text Editors Manage Their Textual Data) and the
       | comments therein.
        
       | hgs3 wrote:
       | > Create debugging utilities for data structures very early on.
       | 
       | Yes, this isn't encouraged enough. I often serialize my data
       | structures as either JSON or Graphviz DOT files for
       | visualization. It helps save an immense amount of time. You can
       | also use the generated files for regression testing, i.e. diff
       | the actual output with the serialized output and if they're
       | different, then a bug was introduced.
        
       | bjoli wrote:
       | I would probably reach for an RRB tree which would, apart from
       | being very very neat, give me free undo/redo.
        
       | vbezhenar wrote:
       | Can someone point me to structures/algorithms to use for text
       | editor which could support files of unlimited lengths (including
       | lines of unlimited length) without loading those fully in the
       | buffer?
       | 
       | I miss that editor so much, that I'm considering to write one
       | some day, but I have no idea how to do so. I can invent things
       | myself, but I guess those things were invented already back in
       | the days computers were different.
        
       | mannyv wrote:
       | Apple's MPW was Apple's programming environment for a long time.
       | 
       | It's been a while, but I believe that it represented everything
       | in a tree of n-character chunks (n = 6?). It was probably the
       | first editor that I used that could open files of pretty much any
       | length.
        
         | eschneider wrote:
         | I do miss the 'select and execute', and 'everything is a shell'
         | model of MPW. Definitely one of my favorite development
         | environments.
        
           | tomjakubowski wrote:
           | sounds like plan9's acme
        
           | scelerat wrote:
           | elsewhere on HN today is a short article on BBedit, which has
           | a really nice shell worksheet interface similar to the old
           | MPW. Select the text you want to execute, hit control-return,
           | and the shell output will appear below the selected text.
        
             | [deleted]
        
       | teddyh wrote:
       | I would instead suggest _The Craft of Text Editing_
       | <https://www.finseth.com/craft/>
        
       | toppy wrote:
       | There is also a great documentation on the topic (and CRDT) by
       | /u/raphlinus: https://xi-editor.io/docs.html
        
       ___________________________________________________________________
       (page generated 2023-06-13 23:01 UTC)