[HN Gopher] Overlapping markup
___________________________________________________________________
Overlapping markup
Author : akkartik
Score : 65 points
Date : 2022-12-12 06:33 UTC (16 hours ago)
(HTM) web link (en.wikipedia.org)
(TXT) w3m dump (en.wikipedia.org)
| laszlokorte wrote:
| Wouldnt one obvious solution be to allow tags from different
| namespaces to overlap? Maybe it is mentioned in the article but I
| could not see it: <ns1:root> <ns2:root>
| <ns1:elemA>This is some <ns2:mark>content</ns1:elemA>
| <ns1:elemB>that is split</ns:mark> into two nodes</ns1:elemB>
| </ns2:root> </ns1:root>
|
| Then in this case two trees with common leaf nodes (4 text nodes)
| are constructed. From point of ns2-root there are only 3 children
| (the 2 next nodes outside <mark> and the <mark>) and from point
| fof ns1-root there are two children (elemA and elemB).
|
| Then when parsing one could even pre-select which namespaces to
| parse and skip all other, for example if I am only interested in
| ns1, ns2 could be skipped during parsing.
| layer8 wrote:
| Your proposal is very similar to SGML's CONCUR feature
| mentioned in the Wikipedia article.
| low_tech_punk wrote:
| Maybe the "H" in HTML should stand for "Hierarchical"?
| codetrotter wrote:
| This Wikipedia article is sorely lacking concrete examples to
| help aid understanding..
|
| Anyone care to add some examples to the article??
| tannhaeuser wrote:
| SGML's CONCUR feature (criticized but not described in that
| Wikipedia article) allows tags to have optional _name groups_
| specifying one or more document type names (that must be
| declared in the prolog) to which the tag applies, and allows
| tag pairs with disjoint document type name qualifiers to
| overlap like this:
|
| <(a)x>bla <(b|c)y>bla</(a)x></(b|c)y>
|
| Traditionally used for poetry and lyrics/drama but could also
| be useful for postal addresses, lyrics in certain types of
| musical notation, in translations, and maybe specific text apps
| such as subtitles/tracks for the hearing impaired. Basically,
| wherever there's a desire to markup text in more than a single
| hierarchy.
| teej wrote:
| I immediately thought of this Vox breakdown of rhyming in rap.
| https://youtu.be/QWveXdj6oZU
| dspillett wrote:
| The "Approaches and implementations" section includes some
| clear (to my eyes at least) examines of overlapping lines and
| sentences in poetry represented as html-like markup.
|
| What sort of examples would improve the article's clarity for
| you?
|
| Wrt the existing examples, perhaps there should be a small
| section before that, explicitly called "examples", that
| contains a minimal summary of those examples to illustrate the
| concept before the reader delves deeper.
| codetrotter wrote:
| Yeah, I agree with what you are saying. I was viewing this
| article on mobile and it was hard to spot these examples on
| mobile, because all sections are collapsed by default and
| none of the sections had the examples stand out at a cursory
| glance on mobule. Now that I am on a laptop I easily spot
| them. I also agree with what you are saying that an explicit
| section named examples would be good. Especially for mobile
| reading.
| FigmentEngine wrote:
| overlapping b and i elements <p>he<b>ll<i>o w</b>or</i>ld</p>
|
| contary to the article it can still be represented as a tree,
| by decomposing the children into their own nodes (so in this
| case characters become nodes with child nodes expressing what
| formatting is active, followed by the letter, and then turn of
| all the active formatting)
| admax88qqq wrote:
| No that's just nesting. It's overlapping if the lifetime of a
| child tag is greater than the lifetime of the parent tag.
|
| Example if you have two paragraphs and bold the end of one
| and the start of the next
|
| <p>hello <b>world</p> <p>this is</b> your captain
| speaking</p>
|
| Obviously bold is a poor example as you can just terminate
| and start a new bold without penalty. But if these were more
| semantic elements like "sections" and "verses" and "lines"
| then it might not be possible.
| chrismorgan wrote:
| > _without penalty_
|
| It's actually fiddlier than you may think. Take "Ta" for an
| example: in most decent fonts, there will be a kerning pair
| that tightens those characters, tucking the "a" underneath
| the beam of the "T" a little. The shaper thus needs to
| follow the actual fonts being used, for kerning purposes,
| rather than the markup--but this is still visible at the
| element level, with getBoundingClientRect().
|
| Take this demo (which depends on your default font having
| such a kerning pair; if it doesn't, you may need to find
| one that does and change the font by inserting <html
| style="font-family:sans-serif"> or similar after the
| comma): data:text/html,<p>Ta<p><b>T</b>a<p>
| T<b>a</b><p><b>Ta</b><p><b>T</b><b>a</b><script>document.qu
| erySelectorAll("b").forEach(e=>console.log(e.getBoundingCli
| entRect().width))</script>
|
| This shows five variants of "Ta", with the last two being
| <b>Ta</b> and <b>T</b><b>a</b>, and prints five numbers to
| the console, the widths of each <b> element. Numbers one
| and four (both corresponding to a <b>T</b>) differ if you
| have a kerning pair such as I describe: for me, the first
| is 11.7px, and the second 10.73333px (though it overflows
| that width in its rendering) because of the <b>a</b> that
| follows it. If you gave bold elements the style `display:
| inline-block`, it wouldn't kern the pair and would thus go
| back to 11.7px.
|
| Most fonts could _really_ use italic-aware kerning (that
| is, kerning a pair where one glyph is regular and the other
| italic), but it's sadly not a thing.
| TreeRingCounter wrote:
| Can someone summarize this? 90% of the content on this page seems
| like excessively-verbose nonsense.
| thomascgalvin wrote:
| Many, if not most, computer models represent data as a tree.
| Some data, however, can't really be represented by a tree,
| because a "thing" can have multiple parents.
|
| The example in the link:
|
| Example, with lines marked up: <line>I, by
| attorney, bless thee from thy mother,</line>
| <line>Who prays continually for Richmond's good.</line>
| <line>So much for that.--The silent hours steal on,</line>
| <line>And flaky darkness breaks within the east.</line>
|
| With sentences marked up: <sentence>I, by
| attorney, bless thee from thy mother, Who prays
| continually for Richmond's good.</sentence>
| <sentence>So much for that.</sentence>
| <sentence>--The silent hours steal on, And flaky darkness
| breaks within the east.</sentence>
|
| If you care about lines _and_ sentences, this is difficult to
| represent as a tree.
| TreeRingCounter wrote:
| lioeters wrote:
| One way to solve this could be to provide separate start/end
| tags without inner content. <line-
| start/><sentence-start/>I, by attorney, bless thee from thy
| mother,<line-end/> <line-start/>Who prays continually
| for Richmond's good.<sentence-end/><line-end/>
| thomascgalvin wrote:
| Yeah, that's how the linked article does it, but that's ...
| icky? It's still a token spanning multiple parents, it's
| just masquerading as a couple of self-closing tags.
|
| Which, of course, is the point of the article, and why this
| is a difficult problem.
| lioeters wrote:
| Ah you're right, I should have read the article before
| commenting, haha. I agree it's not an ideal solution. A
| disadvantage I imagine is that this syntax pushes the
| problem onto the parser/consumer to keep track of
| overlapping regions.
|
| > Milestones are empty elements that mark the beginning
| and end of a component, typically using the XML ID
| mechanism to indicate which "begin" element goes with
| which "end" element.
|
| https://en.wikipedia.org/wiki/Overlapping_markup#Mileston
| es
| captainmuon wrote:
| Back in the day when I was in school, and there was a IE
| monopoly, I wrote a simple HTML parser. Instead of parsing it
| into a tree, it just recorded the beginning and end position of
| tags as indicies into the string. I think I did use a stack to
| match nested tags properly. But overlapping markup was common
| back then, and IE rendered it "correctly" IIRC. This simple
| parser was enough to power a scraper (I don't remember what I was
| scraping. Maybe a competitor's emule link site or something like
| that :-P) and a crude rich text renderer, which I was very proud
| of.
| dejj wrote:
| Consider Aftertext (draft): it separates the markup from the text
| entirely. Overlapping markup ranges becomes trivial.
|
| https://breckyunits.com/aftertext.html
| masswerk wrote:
| This is how styled SimpleText read-me files worked in classic
| Mac OS. A normal file was plain text, but styles could be
| appended based on indices (much like selection and regions work
| in modern web APIs).
| NWoodsman wrote:
| Change my view: given any data storage medium, the smallest
| granularity of data must also be the most-child element of any
| markup language. Given the immense overhead of storing markups on
| a granular level, processing markup therefore must be a perpetual
| exercise in recursion.
|
| I.e. Poem->Verse->Line-> <char>
| Book->Page->Chapter->Paragraph->Sentence->Word-> <char>
| HTML->Body->Div->P-> <char>
|
| Therefore, any given letter (here as a <char> type) can retain a
| back reference of parents, so the <char> object retains a hashset
| of {Line,Word,P} parent type references representing three
| domains, but really needs to be a Dictionary of key values, the
| key being the domain name, the value being the parent name, so
| that would be:
|
| Domain: Poetry, Value: Line
|
| Domain: Book Object Model, Value: Word
|
| Domain: HTML, Value: P Element
|
| We could then ask any letter arbitrarily "what is your Font Style
| in your HTML context?" and it would be able to walk up the parent
| P which obtains its style from a CSS markup, and return that
| correctly. Or "What is your Poem's name in your Poetry context?"
| and it could recurse up to the Poem element to find it's Title.
| jerf wrote:
| Are you claiming the parents will always be unique? Because as
| the article says, you can easily have this, where going to the
| _right_ is a parent relationship:
| -> Line -> Verse -> Poem char -> Word
| -> Clause -> Sentence -> Poem
|
| You can try adding a further constraint that any given property
| must have only one path, so you can then recurse over the tree
| and find the one match, but as your model gets richer you will
| find that breaks.
|
| And it's that last clause that is the killer for pretty much
| anything: "As your model gets richer you will find that
| breaks."
|
| Plus the UI experience for that is awful. "I want to add this
| property to this Line but you're telling me it's a duplicate
| for some particular character? What the hell does that mean?
| I'm not adding a property to the character!" etc. etc.
| mdciotti wrote:
| I've frequently wondered why a hierarchical approach is the norm
| for text formatting. It seems that many problems could be solved
| trivially using a text buffer and a list of formatting sequences
| defined by a starting index and a length. The only place I've
| seen this in practice is in Telegram's TL Schema [1]. Is this
| method found anywhere else?
|
| Edit to note: there is one obvious advantage to in-band markup
| such as HTML -- streaming formatted content. Though I wonder if
| this could be done with a non-hierarchical method, for example
| using in-band start tags which also encode the length.
|
| Edit 2: looks like Conde Nast maintains a similar technology
| called atjson [2].
|
| [1]: https://core.telegram.org/api/entities
|
| [2]: https://github.com/CondeNast/atjson
| jake-low wrote:
| There are a number of rich text editors that model documents as
| a flat array of characters and a separate table of formatting
| modifiers (each with an offset and length). Medium's text
| editor is one of them. This post [1] on their engineering blog
| introduced me to the idea, and I think it's a good starting
| point for anyone interested in this topic.
|
| ProseMirror (a JavaScript library for building rich text
| editors) also employs a document model like this. The docs for
| that project [2] do a good job of explaining how their
| implementation of this idea works, and what problems it solves.
|
| [1]: https://medium.engineering/why-contenteditable-is-
| terrible-1...
|
| [2]: https://prosemirror.net/docs/guide/#doc
| samwillis wrote:
| That list of formatting sequences would have to be updated with
| new indexes when the content of the buffer changed. Keeping the
| two in sync wouldn't be trivial (for a computer or a human), a
| tree of nodes fixes that and works for 99.99% of use cases.
| jerf wrote:
| It may not be trivial, but it's a solved problem. Many rich
| text UI widgets and corresponding backing data structures
| exist today, based on a tagging system where tags can
| trivially define regions that overlap with each other. It's
| tricky and full of corner cases, but not _that_ hard if you
| put your mind to it, and it 's not computationally
| inefficient either.
| jcparkyn wrote:
| I guess because it would be a total pain for humans to read and
| write without specialised tooling. Imagine trying to add a word
| at the start of your document.
| jerf wrote:
| "I've frequently wondered why a hierarchical approach is the
| norm for text formatting."
|
| 80/20, if not 90/10, effectiveness. Most people are not trying
| to do what the Wikipedia article is talking about. About the
| most complicated thing that people want to do is the moral
| equivalent of <i>italic <b>bold and italic</i> bold</b>, and
| you can losslessly convert that to <i>italic <b>bold and
| italic</b></i><b> bold</b> for almost all practical purposes.
|
| It isn't until you're getting very precise about what your tags
| mean, for tags that intrinsically "cross" hierarchies like
| that, that you start seeing this issues. And then by the time
| you've gotten that far, you realize you have all sorts of
| problems, as the article says.
|
| But a good deal of the answer is that while the stuff mentioned
| in the Wikipedia article is true and important, it's also
| fairly specialist.
|
| As for "The only place I've seen this in practice is in
| Telegram's TL Schema [1]. Is this method found anywhere else?",
| tag-based formatting is the norm for rich text widgets, which
| generally can natively represent my first HTML example above in
| its internal format. Generally if you dig into your favorite
| language you'll find someone has already implemented this
| efficiently as a library you can pick up if you want to use the
| capability directly outside of a text widget. It has its own
| consequences, as anyone who has ever fought with them may
| realize, but it's not impossibly difficult to deal with.
|
| It isn't a magic solution to everything either, though. Even if
| it is what you think you want, a widget able to represent a
| bold section starting in the middle of a paragraph, then
| proceeding through the first three rows of a table, then
| stopping in the middle of a paragraph in the third column of
| the next row is generally weird. To some extent, people have a
| certain hierarchiness to their thinking about these matters
| too, whether it's cause or effect. But that hierarchiness is
| messy; I think it's fair to say most people wouldn't "mean"
| that bold to mean something in my table case, we don't
| necessarily expect tags to proceed through tables like that,
| but <i>i<b>bi</i>b</b> is something that people might
| intuitively expect to be able to do. It's a fractally messy
| space both in the computer science and human expectations, and
| the fractal messiness only gets messier when we try to
| harmonize those two things.
| samwillis wrote:
| There are so many odd edge cases in HTML, a good one I found was
| with forms. If you open a <form> but don't have a closing tag,
| the browser will close the form block "visually" at the end of
| the forms immediate parent, as you would expect. All styles are
| applied to it, or children via selectors, up to that
| automatically inserted end point. It's how browsers handles most
| unclosed block tags.
|
| However, the forms "functionality" isn't closed at that point,
| any inputs further down the page (outside of the forms DOM tree)
| are included in the post/get when the form is submitted. Or at
| least until another form is found in the DOM. Effectively an
| unclosed form is two things, a visual block that is closed
| automatically, and an "overlapping" form capturing inputs
| indefinitely.
| chrismorgan wrote:
| This behaviour is defined and explained in the HTML spec with
| the _form element pointer_
| <https://html.spec.whatwg.org/multipage/parsing.html#form-
| ele...>:
|
| > _The form element pointer points to the last form element
| that was opened and whose end tag has not yet been seen. It is
| used to make form controls associate with forms in the face of
| dramatically bad markup, for historical reasons._
|
| And search through the rest of the page for the term to find
| how it's implemented--it's straightforward, just set on a
| <form> open tag and reset on an (explicit) </form> close tag.
|
| This is somewhat unreliable: _browsers_ support it, but tools
| using XML pipelines are allowed to ignore it (SS13.2.9), and
| lots of JavaScript code will assume hierarchy rather than using
| form.elements, and thus not catch such elements, or elements
| that manually specify a form owner via the _form_ attribute.
| samwillis wrote:
| Thanks! My 2 minuets of googling back when I found it didn't
| surface that and I moved on to the next job.
|
| Somehow despite coding html for 25 years I had either not
| seen the input form attribute or forgotten about it. I
| suspect the latter!
| chrismorgan wrote:
| Steps on finding this from the HTML spec:
|
| 1 Start at https://html.spec.whatwg.org/multipage/. Or
| https://html.spec.whatwg.org/ if you prefer, with
| everything in one page, but that's a _big_ document. You
| can also build it all locally yourself if you like. I have.
|
| 2 "The form element" sounds like a good place to look.
| https://html.spec.whatwg.org/multipage/forms.html#the-
| form-e...
|
| 3 Look through the DOM interface listed, _elements_ sounds
| promising. Find the explanation of that IDL attribute
| below: "The elements IDL attribute must return an
| HTMLFormControlsCollection rooted at the form element 's
| root, whose filter matches listed elements whose form owner
| is the form element, with the exception of input elements
| whose type attribute is in the Image Button state, which
| must, for historical reasons, be excluded from this
| particular collection." Roll your eyes at the bizarre
| exclusion of <input type=image>, then focus on the term
| _form owner_ which sounds relevant. That links you to
| https://html.spec.whatwg.org/multipage/form-control-
| infrastr....
|
| 4 Hmm... null, parser inserted flag, nearest ancestor form
| element, form attribute. Parser inserted flag sounds
| relevant (though it's just a flag, not the actual
| association link). Also the note "They are also complicated
| by rules in the HTML parser that, for historical reasons,
| can result in a form-associated element being associated
| with a form element that is not its ancestor."
|
| 5 This is where having the _whole_ spec open, rather than
| the multipage version, is handy: you can search the entire
| document for the term "parser inserted flag" to see where
| that gets set. You can also guess that it's going to be in
| SS13.2 _Parsing HTML documents_ (parsing.html). In the end,
| it's https://html.spec.whatwg.org/multipage/parsing.html#cr
| eating...: "... then associate element with the form
| element pointed to by the form element pointer and set
| element's parser inserted flag." Ah hah!
|
| 6 You have found the concept in the parser: "form element
| pointer". You can then look through where it's used and
| quickly see how it's set on <form> and unset on </form>,
| thus deliberately handling the missing-</form> case.
|
| You develop a feeling for this kind of thing over time. I
| didn't know about the form element pointer (though I feel I
| should have known about it), but this is a loose
| description of what I did, though I was able to speed
| through some of the steps, and I really should have just
| started by looking at "An end tag whose tag name is
| "form"", but at first I thought the claim was bogus.
| samwillis wrote:
| I think got to point 2, found no reference in the form
| tag section, and gave up.
|
| But what's fascinating is that it describes the html
| parser effectively implementing "overlapping markup", as
| in the Wikipedia article, for this edge case for
| backwards compatibility.
| lordnacho wrote:
| Why didn't they go the grammar nazi route? Define a spec and if
| the page doesn't conform, draw an error message.
|
| It's really annoying to have this kind of undefined behaviour
| that might end up being relied upon.
| jraph wrote:
| This seems strange, how is that represented in the DOM, which
| is strictly a tree?
| samwillis wrote:
| It's not in the DOM, from memory chrome dev tools even shows
| a closing form tag where it's been inserted. I have no idea
| how it's implemented internally.
|
| Confuse me for a while when debugging a legacy website. It
| had actually been done intentionally to work around a rather
| complex architecture.
| laszlokorte wrote:
| There exists a "form"-attribute for input elements that can
| be used to associate input elements outside the form
| hierarchy to be included in the form submission.
|
| So the semantics of "form field outside the actual form"
| are available anyway. When parsing a not-closed <form> the
| browsers just make use of that.
| ggus wrote:
| Maybe it leverages the "form" optional attribute that can
| specify the form the <input> element belongs to.
___________________________________________________________________
(page generated 2022-12-12 23:00 UTC)