[HN Gopher] Text Editor Data Structures
___________________________________________________________________
Text Editor Data Structures
Author : starfreakclone
Score : 208 points
Date : 2023-06-13 16:19 UTC (6 hours ago)
(HTM) web link (cdacamar.github.io)
(TXT) w3m dump (cdacamar.github.io)
| phildenhoff wrote:
| Thanks to this article, I learned that the core data structure
| for VSCode's text is written in TypeScript[0]. I, uh, I knew VS
| Code was written in TypeScript but I didn't realise _all_ of it
| was. It's crazy to think that the editor works as well as it
| does!
|
| [0]:
| https://github.com/microsoft/vscode/tree/main/src/vs/editor/...
| cmrdporcupine wrote:
| I used a red-black tree written in JS over 15 years ago at a
| job where I needed to use lower bound and range operations for
| nodes in a pivot-table type dashboard for telecoms metrics. I
| was actually blown away at the time at how it actually
| performed rather decently, and this is back in the pre-Chrome
| Firefox & IE6&IE7 days.
|
| Now in the world of V8 and JavaScriptCore, I don't think it's
| crazy at all. _So_ much work has gone into JS runtime
| optimization. For heavily concurrent and memory intensive
| workloads, I can imagine problems though.
| [deleted]
| egonschiele wrote:
| I've been wondering if there's a good algorithm to handle
| highlights in a piece of text. If I highlight some text, lets say
| the part between the => <= here:
|
| "Idioteque" is a song => by the English rock band Radiohead <=,
| released on their fourth album, Kid A (2000).
|
| If I want to then edit this text, is there an efficient algorithm
| for figuring out the start and end index of the highlight for the
| edited text?
| tonetheman wrote:
| [dead]
| nsajko wrote:
| This seems revisionist/ignorant. The article attributes a "piece
| tree" data structure to VS Code developers, however the must-read
| 1998 paper by Charles Crowley, _Data Structures for Text
| Sequences_ , already mentions search trees as being used to
| enhance the naive piece table in some text editors.
|
| https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48...
| starfreakclone wrote:
| Thank you for the pointer! I was actually not aware of this
| paper at all, which is why it was not included here (not sure
| how I missed it). I'll be sure to push a revision to the blog
| which mentions this paper.
| thaliaarchi wrote:
| You may also want to take a look at "Deletion: The curse of
| the red-black tree" (Germane and Might 2014), which presents
| functional deletion for red-black trees.
|
| https://matt.might.net/papers/germane2014deletion.pdf
| badsectoracula wrote:
| There are tons of articles about plain text editor data
| structures, but what about _rich_ text editor data structures?
| Let 's say i want to implement a text editor that can have bold,
| italic, underline, etc text but also be able to do automatic word
| breaking, align paragraph text to left/middle/right, insert
| images and/or other objects, have floating images and/or other
| objects around which the other (non floating) text/images/objects
| wrap, etc.
|
| What would be the data structures for that? I can only think of
| trying to replicate something like the HTML DOM but i have a
| feeling something like Write for Windows 3.1 used a simpler data
| structure.
| legulere wrote:
| It seems like also Bravo, the first WYSIWYG-Editor, abiword and
| word used/use piece tables:
| https://en.wikipedia.org/wiki/Piece_table#Usage
| convolvatron wrote:
| its probably instructive to look at elisp text properties. not
| so much that its a great system - but it raises some messy
| questions about how to operate on substrings that have
| different sets of properties
| samwillis wrote:
| I believe most do use a DOM like structure for block level
| elements. For inline styling however they tend to not have
| separate nested nodes for each style (bold, italic, underline),
| and are closer to the old HTML4 <font> tag with multiple
| properties for each style and don't nest them. It makes for
| much simpler code.
|
| The docs for ProseMirror are a brilliant insight into how many
| of these editors are designed. ProseMirror maintains its own
| document model in parallel to the html DOM and rectifies one to
| the other as either changes.
|
| For realtime collaborative rich text editors take a look at
| PeriText, they have a brilliant article explaining the CRDT
| structures used.
|
| https://prosemirror.net
|
| https://www.inkandswitch.com/peritext/
| coldpie wrote:
| Yes, I think your instinct to look at win32 APIs is a good one.
| Many structures in riched are public in headers, and of course
| you could reference Wine's implementation[1].
|
| [1]
| https://source.winehq.org/git/wine.git/tree/HEAD:/dlls/riche...
| badsectoracula wrote:
| That is neat, thanks. I didn't check RICHED itself but when i
| was toying around with an HTML editor a few years ago (see my
| other post) i did check out an HTML editor API that Microsoft
| had (i don't remember its name but it was used by, e.g., Joel
| Spolsky's CityDesk - IIRC the control itself was documented
| in the MSDN CDs that came with Visual Studio 6), from where i
| got the idea of using a two node pointers and a subrange
| index to represent a selection.
| macrael wrote:
| Checkout the cocoa class for this:
| https://developer.apple.com/documentation/foundation/nsattri...
| eludwig wrote:
| Many older word processors (the only ones I'm familiar with)
| used style "runs", which are simply an array of structs that
| contain startPosition, endPosition (or length) and then a bunch
| of style info in whatever way makes sense to them.
|
| I worked on Ready,Set,Go! back in the day and also wrote my own
| styled editor for the Mac in the 90's.
|
| One interesting thing about this is that you could use a lazy
| "adjustor" to remove duplicate or overlapping styles. No need
| to worry about it during typing or selecting. A low priority
| task could analyze the runs and fix them up as needed and no
| one was the wiser.
|
| IMO, the hardest part about writing a word processor back then
| was font management. You had to construct width tables to be
| able to compose lines properly. This generally involved a bunch
| of poorly documented APIs and peering into opaque data
| structures.
| KerrAvon wrote:
| It's not actually that different or complicated if you're
| already doing proportional plaintext rendering -- you need to
| store style attributes for a range of text, and you need to
| support different heights per line.
|
| The real complexity is rendering all of unicode properly, and
| supporting international fonts, bidi layout, vertical text,
| etc.
| badsectoracula wrote:
| I think unicode, fonts, etc would be an issue even with a
| plain text editor anyway.
|
| The "styling a range of text" is something i thought but you
| still need to somehow associate the text with the range - and
| vice versa - and this doesn't handle things like inserting
| images and other types of objects since these aren't text.
|
| You could have a document be a series of "paragraphs", each
| being a series of "elements" with each "element" being
| something like "text" (with a style), "image", etc. But then
| once tables enter the picture, you need to expand paragraphs
| to be of "table" type and each table cell is itself a self-
| contained "series of paragraphs" - and then start thinking
| about nested tables or images in tables!
|
| Generalize that enough to avoid special cases inside special
| cases and you end up with more of a tree-like structure
| representing a DOM and less with a linear structure with
| range-based styling.
|
| (of course, then again, i don't remember Write for Windows
| 3.1 having tables in the first place :-P but i'm interested
| if there are alternative approaches anyway)
|
| EDIT: one thing i forgot to mention - and why i am curious
| about non-DOM-based approaches - is that one problem with the
| DOM approach is the selection: with a linear/range-based
| structure the selection is just one or two indices inside the
| range, but with the DOM the selection can start from a node
| with node-specific subrange (e.g. character in a text node)
| and end with another node and both being very unrelated to
| each other (i.e. only having some distant common ancestor and
| not necessarily at the same level).
| alpaca128 wrote:
| Text is usually stored as tree either way in an editor,
| using a DOM-like approach might work well on top of the
| usual datastructures.
|
| > with the DOM the selection can start from a node with
| node-specific subrange (e.g. character in a text node) and
| end with another node and both being very unrelated to each
| other
|
| I'd just store the range as character indices, using those
| the right nodes in the tree can be accessed pretty quickly
| as needed.
| badsectoracula wrote:
| I don't think character indices are enough, what if your
| selection begins at the middle of a table cell and ends
| on an image that is the only child of a cell in a
| completely different table (no text involved at all,
| except some text in the cells in between)? If you want
| to, e.g., delete those how do you find which nodes are to
| be deleted and updated (e.g. for merging the two tables
| if there are cells after the one that contains the
| image)?
| alpaca128 wrote:
| Images and table cells are just nodes within the tree
| holding the text, assuming all styling etc is represented
| in a plain text syntax similar to Markdown, of course.
| Looking up the nodes from char indices is quick if each
| node stores how many chars it contains.
|
| Other approaches would probably require the selection to
| be a tree of its own, I can't really say whether that's
| simpler overall or not.
| b33j0r wrote:
| I might have a way to simplify this?
|
| A plaintext document is an array of chars, a richtext
| document is tree, which may or may not be well-formed.
|
| Think about someone trying to _bold semi-half of_a_
| sentence_, and how MS Frontpage was made by smart people,
| it's just really hard.
|
| The most interesting thing lately is the HTML attribute
| `contenteditable`, and how it almost just kinda works! You
| still have to be full-stack to make something good, but
| that was an amazing improvement to the browser.
| badsectoracula wrote:
| Yeah, that trying to bold half of a sentence - or even
| better, the middle of a sentence - is why i was wondering
| about simpler alternatives to DOM. Some time ago i toyed
| around with an HTML editor[0] (that one had to use a DOM
| anyway, but my question is for rich text editing in
| general - BTW the rectangles in the shot show a selection
| that goes across nodes) and doing something like that
| involved traversing all the nodes (going both down and up
| the node tree, starting from the cursor's starting
| position), finding the closest common ancestors under the
| selection, creating "B" siblings to them and then
| reparenting them under these new "B" nodes.
|
| You can move a lot of that stuff to reusable methods but
| personally i find the whole "editing" aspect to be more
| involved than the "drawing" side - and also the one more
| likely to be different than a plain text editor - when
| dealing with DOM-like structures. Hence why i am
| interested to see what alternatives there are.
|
| [0] https://i.imgur.com/jLlyNSS.png
| b33j0r wrote:
| Nice. I've searched far and wide, and most people still
| end up starting a whole company that only does a text
| editor.
|
| Quill works but is basically dead since 2017. Almost
| anything foss is in a similarly ambiguous boat.
|
| And then people like us, defeated, eventually buy
| something when we actually need it.
|
| There are just so many ways that users try to use it.
| It's a tough problem!
| z3t4 wrote:
| You don't need any fancy data structures. 95% of the
| performance goes into glyph rendering. And with Unicode the
| performance gain from monospace fonts goes out the window as
| some Unicode characters are very large, and not only that
| they also take up many bytes. So one Unicode character can be
| up to 5 bytes long and take up the same canvas space as 3
| characters. You also need to read ahead as there are
| combination characters, for example a smiley combined with
| the color brow becomes a brown smiley. I've blogged about
| implement support for Unicode in an editor here: https://xn--
| zta-qla.com//en/blog/editor10.htm
| rcoveson wrote:
| > So one Unicode character can be up to 5 bytes long and
| take up the same canvas space as 3 characters.
|
| 5 bytes? In what encoding?
| WorldMaker wrote:
| Some "single character" emoji easily exceed 5 bytes in
| _all_ encodings. You may think ZWJ sequences are
| cheating, but emoji isn 't the only language encoded in
| Unicode with complex ZWJ sequences.
| rcoveson wrote:
| My question is about the phrase _up to 5_. What in
| Unicode is up to 5? Codepoints are up to 4 in all the
| encodings I know. ZWJ sequences may as well be
| arbitrarily long. What is "up to 5"?
| cstrahan wrote:
| The original quote for reference:
|
| > So one Unicode character can be up to 5 bytes long and
| take up the same canvas space as 3 characters.
|
| FWIW, I didn't read that as suggesting an upper bound of
| 5 bytes, but rather as an example using arbitrary
| numbers: N bytes of code units could, depending on the
| font providing the glyph(s) for the respective
| grapheme(s), could be rendered at M times the size of,
| say, the letter A, where N != M -- despite the font
| otherwise being monospaced. Which is just another way of
| saying that you _must_ consult the font for the character
| widths involved.
|
| I think you're reading that quote as an assertion that:
| For any grapheme G, G can be encoded in at most 5 bytes.
|
| While what I think was being said was:
| There exists a grapheme G, where G is encoded in 5 bytes,
| and the respective glyph happens to be displayed at 3
| times a single character (e.g. the letter A), despite the
| font otherwise being monospaced. Therefore you *must*
| consult the font for each glyph to correctly determine
| character widths.
| rcoveson wrote:
| Continue to the next sentence of context:
|
| > You _also_ need to read ahead as there are combination
| characters, for example a smiley combined with the color
| brow becomes a brown smiley.
|
| Emphasis mine. Clearly combination characters are being
| treated separately.
|
| Frankly I think it's crazy to read "up to 5 bytes" and
| not think that it suggest an upper bound. I think you're
| reaching for a highly questionably interpretation of a
| totally unambiguous clause. If the author meant to
| express what you're saying, they would certainly have
| written: "Some Unicode characters are 5 bytes long and
| take up the same canvas space as 3 characters". Which
| would _still_ look incorrect if they followed it with the
| sentence "You _also_ need to read ahead as there are
| combination characters... ".
|
| It is far more likely that the author is simply mistaken
| and should have said 4 bytes, and perhaps used the word
| "codepoint" instead of "character" in the original
| sentence. That's a perfectly understandable technical
| error, while the reinterpretation you're putting together
| would imply an error of colloquial language.
| WorldMaker wrote:
| I read this as colloquial English for "around" or
| "approximately". Not setting a bound limit, but setting
| an example size.
| rcoveson wrote:
| But you also read it as referring to ZWJ sequences? So
| the author has picked a number that is actually _below
| average_ and they 've worded it as _up to..._?
|
| Saying a ZWJ sequence can be "up to 5 bytes" is like
| saying "the current generation of Intel processors run at
| clock speeds of up to 2 GHz".
|
| If they were referring to ZWJ sequences (I don't think
| they were; I think they were just misremembering the
| maximum encoded length of a codepoint) and they had said
| "up to 35 bytes", then I might agree with you. It's still
| not technically accurate, but it's a reasonable
| colloquial usage, like saying "human males can grow up to
| seven feet tall".
| WorldMaker wrote:
| I think you are trying to read something that wasn't
| meant to be technical documentation as if it was trying
| to be exact technical specifications. I'm not the
| original author, so I don't have reason to litigate this
| any further, and I'm not sure what you are arguing about
| at this point.
| rcoveson wrote:
| You've now replied to me up to 2 times.
| jychang wrote:
| Emojis with skin color, mostly
| rcoveson wrote:
| No, that's a ZWJ sequence. Those can be arbitrarily long.
| Doesn't explain where "5 bytes" comes from.
| lelanthran wrote:
| > 5 bytes? In what encoding?
|
| I believe UTF-8 reserved up to six bytes for a single
| character.
| rcoveson wrote:
| Yes, UTF-8 was up to 6 bytes per character early on. Some
| broken implementations like MySQL's limit it to up to 3
| bytes per character. The actual number is 4.
|
| So what is "up to 5"?
| starfreakclone wrote:
| Indeed, you could even use a piece tree just like in the blog
| but you store additional information in each piece which tell
| the renderer how to layout the associated text. My
| understanding is that most rich text editors represent the
| text as a node-based tree anyway.
| WorldMaker wrote:
| Some early rich text editors never used a tree
| representation for formatting represented the formatting as
| "control character sequences". This was part of why some
| people loved WordPerfect so much because you could toggle a
| view of all the control characters and just
| delete/copy/paste them like any other text in the same
| document.
|
| It's basically how the classic RTF format [1] works, and
| things like VT100/ANSI escape codes in terminals. It's kind
| of like the difference between imperative code and
| declarative code: "this character sequence means toggle the
| state of bold" versus "this node of characters is bold".
|
| [1] https://docs.fileformat.com/word-processing/rtf/
| kjs3 wrote:
| Perhaps look at how (Open,Libre)Office does it to support ODT
| format (e.g. OASIS Open Document Format)?
| thangalin wrote:
| RichTeXtFX[1] has these abilities[2] along with demo code[3]
| that provides an entry point for sleuthing. My text editor,
| KeenWrite[4], uses RichTeXtFX, but takes a different approach
| due to its usage of inline variable references. Rather than
| edit the document in a WYSIWYG-like fashion, users edit the
| Markdown document as plain text then export as PDF by selecting
| a theme to apply for typesetting. This cleanly separates
| content from presentation. The "themes" video tutorial
| demonstrates how theme selection works.[5]
|
| [1]:
| https://github.com/FXMisc/RichTextFX/tree/master/richtextfx/...
|
| [2]: https://gluonhq.com/presenting-a-new-richtextarea-control/
|
| [3]: https://github.com/gluonhq/rich-text-
| area/blob/master/sample...
|
| [4]: https://github.com/DaveJarvis/keenwrite
|
| [5]: https://www.youtube.com/watch?v=3QpX70O5S30&list=PLB-
| WIt1cZY...
| krackers wrote:
| You might also like reading
| https://news.ycombinator.com/item?id=16385440 (A Brief Glance at
| How Various Text Editors Manage Their Textual Data) and the
| comments therein.
| hgs3 wrote:
| > Create debugging utilities for data structures very early on.
|
| Yes, this isn't encouraged enough. I often serialize my data
| structures as either JSON or Graphviz DOT files for
| visualization. It helps save an immense amount of time. You can
| also use the generated files for regression testing, i.e. diff
| the actual output with the serialized output and if they're
| different, then a bug was introduced.
| bjoli wrote:
| I would probably reach for an RRB tree which would, apart from
| being very very neat, give me free undo/redo.
| vbezhenar wrote:
| Can someone point me to structures/algorithms to use for text
| editor which could support files of unlimited lengths (including
| lines of unlimited length) without loading those fully in the
| buffer?
|
| I miss that editor so much, that I'm considering to write one
| some day, but I have no idea how to do so. I can invent things
| myself, but I guess those things were invented already back in
| the days computers were different.
| mannyv wrote:
| Apple's MPW was Apple's programming environment for a long time.
|
| It's been a while, but I believe that it represented everything
| in a tree of n-character chunks (n = 6?). It was probably the
| first editor that I used that could open files of pretty much any
| length.
| eschneider wrote:
| I do miss the 'select and execute', and 'everything is a shell'
| model of MPW. Definitely one of my favorite development
| environments.
| tomjakubowski wrote:
| sounds like plan9's acme
| scelerat wrote:
| elsewhere on HN today is a short article on BBedit, which has
| a really nice shell worksheet interface similar to the old
| MPW. Select the text you want to execute, hit control-return,
| and the shell output will appear below the selected text.
| [deleted]
| teddyh wrote:
| I would instead suggest _The Craft of Text Editing_
| <https://www.finseth.com/craft/>
| toppy wrote:
| There is also a great documentation on the topic (and CRDT) by
| /u/raphlinus: https://xi-editor.io/docs.html
___________________________________________________________________
(page generated 2023-06-13 23:01 UTC)