hngopher.com

       [HN Gopher] Solving the regex of madness, and snarky answers on ...
       ___________________________________________________________________
        
       Solving the regex of madness, and snarky answers on StackOverflow
       (2019)
        
       Author : monort
       Score  : 113 points
       Date   : 2021-05-09 07:48 UTC (15 hours ago)
        
 (HTM) web link (www.cargocultcode.com)
 (TXT) w3m dump (www.cargocultcode.com)
        
       | jll29 wrote:
       | > I think the flaw here is that HTML is a Chomsky Type 2 grammar
       | (context free grammar) and a regular expression is a Chomsky Type
       | 3 grammar (regular grammar).
       | 
       | Note that regarding formal language and complexity theory, while
       | it is correct that in general, arbitrary nested structures
       | require a context free grammar (type 2 in the Chomsky hierarchy)
       | and are thus beyond regular (type 3) [1], this statement is NOT
       | true _if_ you limit the nesting depth with a finite constant k.
       | 
       | For example, if you agree to an HTML tag maximum nesting depth
       | of, say, 100, then it can be modeled with a regular (type 3)
       | grammar, including correct required matching of opening and
       | closing tags, and hence you can write a regular expression that
       | matches it as well.
       | 
       | This debate is well-documented in the theoretical linguistics
       | literature, where some say human languages are not regular
       | because you can always embed yet another additional relative
       | clause in any sentence in principle without adversely affecting
       | grammaticality, whereas others say while you could you won't find
       | natural examples in human-written text documents where extreme
       | nesting depth is actually found. At that point psycholinguists
       | and theoretical linguists usually start a fight about whether
       | memory limits are important or "just performance as opposed to
       | competence".
       | 
       | (Goes to show how practical solid theory is.)
       | 
       | [1]
       | https://www.sciencedirect.com/science/article/pii/S001999585...
        
       | jancsika wrote:
       | I love seeing the weirdo CDATA thingy in there! CDATA ftw!
       | 
       | E.g., you've got this enormous spec for SVG which includes CSS,
       | but that CSS has syntax inside a style tag which could break
       | XHTML parsers.
       | 
       | Amateurs out there are probably thinking, "Well, why not just
       | compromise in the spec and tell implementers to do the same thing
       | that HTML does to parse style tags?" Well, professionals know
       | that _cannot_ work for myriad reasons you can read about if you
       | take out a college loan and remain sedentary for the required
       | duration.
       | 
       | The _right_ approach is to throw the CSS stuff inside CDATA tags
       | to tell the parser not to parse it so things don 't break. That
       | is the way sensible, educated professionals solve this problem.
       | 
       | I'm only kidding!
       | 
       | For inline SVGs the HTML5 parser simply says, "Parse this gunk as
       | HTML5, and use _sane_ defaults to interpret the parsed junk in
       | the correct svg namespace so that all the child thingies in that
       | namespace just work. "
       | 
       | Which it does.
       | 
       |  _Unless_ you 're going to grab the innerHTML of the inline SVG
       | and shove it into a file to be used later as an SVG image.
       | 
       | In _that_ case you cross the invisible county line into XHTML
       | territory where the sheriff is waiting to throw you in jail for
       | violating the CDATA rule. In that case the XHTML parser hidden in
       | the guts of the browser doles out the justice of an error in
       | place of your image. Because that is the way sensible, educated
       | professionals solve this problem. :)
       | 
       | My holy grail-- how do I use DOM methods to create a CDATA
       | element to shove my style into? If I could know this then I can
       | jump my Dodge Charger back and forth into XHTML without ever
       | getting caught.
        
         | camehere3saydis wrote:
         | >My holy grail-- how do I use DOM methods to create a CDATA
         | element to shove my style into? If I could know this then I can
         | jump my Dodge Charger back and forth into XHTML without ever
         | getting caught.
         | 
         | Does this help? https://developer.mozilla.org/en-
         | US/docs/Web/API/Document/cr...
        
           | jancsika wrote:
           | Ah, thanks!
           | 
           | In hindsight I probably could have guessed at
           | "document.create" and then just read the autocomplete
           | suggestions in devTools. :)
        
         | jameshart wrote:
         | The 'weirdo' CDATA thing is the only thing that makes XHTML
         | actually amenable to this approach, because XHTML is
         | tokenizable using a regular expression-based grammar, whereas
         | HTML without CDATA is not. As you're obviosuly aware, the
         | language inside <style> elements is not suitable for XML
         | parsing. Nor is the language inside <script> elements:
         | <script>          var y = "<!--";        </script>
         | <p>... a naive tokenizer thinks this is in a comment ...</p>
         | <script>          var z = "-->"        </script>
         | 
         | There's a weird interaction here between the Javascript and
         | HTML parsers, because <, !, and -- are all valid Javascript
         | operators, and you can, in theory, stack them up into
         | syntactically valid JavaScript expressions. The behavior of the
         | following script in a browser is... unpredictable:
         | <script>           var a = 1;           if (0<!--a) {
         | document.write('since --a is 0, and !0 is true, and 0 < true
         | (!), this should print'); }        </script>        <p>... and
         | this should not be in a comment</p>        <script>          if
         | (a-->-10) { document.write('a-- should still be > -10, so this
         | should print, too'); }        </script>
         | 
         | The regex in the article will miss the opening <p> tags here,
         | because it assumes that it's being given valid, tokenizable
         | XHTML.
        
           | jancsika wrote:
           | Part of problem and solution you describe was due to the
           | battle to define who is encapsulating whom, no? The W3C SVG
           | list archive was full of people essentially asking for the
           | ability to flow text and replicate a lot of HTML as part of
           | native SVG. In that dream, it's certainly important to have
           | well-defined behavior for javascript inside SVG since you
           | could have SVG user agents that aren't web browsers. And that
           | means CDATA to hold the non-XML scripting and styling data.
           | 
           | However, at the end of that history HTML was the clear
           | encapsulator and SVG exists either inside it or as a static
           | image in Inkscape, the browser, or some library. So today,
           | scripts inside an SVG are either a curiosity or security
           | nightmare that comes to life when the user clicks "View
           | Image" on an SVG image in their browser.
           | 
           | That leaves only the `<style>` tag content as a potential
           | ambiguity. So I'm curious-- are there examples where content
           | of a `<style>` tag inside an inline SVG causes unpredictable
           | behavior in modern browser HTML parsers? I'm guessing there
           | must be, but I'd like to play with a clear example.
        
       | lifthrasiir wrote:
       | The only part I agree in this writing is that you don't need to
       | be snarky to be correct. (I'd like to introduce the XY problem of
       | the second kind, where the answerer is so confident that it is
       | the answerer who have missed the actual question.)
       | 
       | Some regexes _can_ recognize a language beyond the regular
       | language. They are typically available in two flavors: recursive
       | references (Perl, Ruby, PCRE) and stackable captures (.NET). They
       | are obscure enough that I would not recommend them, but it is
       | patently false that regular expressions (EDIT: of the practical
       | interest) cannot be recursive.
       | 
       | It is possible to match individual HTML tags with regexes, but it
       | _is_ difficult. It cannot use a bare `\w` or `\s` because both
       | XML /XHTML and HTML5 parsers have peculiar definitions for tag
       | name characters and space characters. For example your `\s` will
       | typically match various Unicode space characters, while only
       | ASCII whitespaces are recognized in tags. There are also several
       | notable exceptions to the parser (and external states termed the
       | "tree construction"), so missing any of them would result in an
       | immediate XSS. If you think you can write a correct regex for
       | HTML tags, my quizzes [1] should make you concerned. Limiting the
       | question to XHTML does alleviate some but not all concerns.
       | 
       | The distinction between recognition and parsing is correct, but
       | parsing doesn't necessarily mean the reconstruction of parse
       | tree. Parsing means the access to constituent nonterminals, which
       | can be used to reconstruct parse tree but also directly used as
       | their own (e.g. calculators). Indeed in most regex
       | implementations you can't extract two or more strings out of each
       | capture (Raku is a notable exception), so you can match against
       | e.g. `(\w+)(?:,(\w+))*` but can't extract a list of comma-
       | separated words with it. Practically speaking this means you
       | can't extract a list of attributes with a single regex, making it
       | unsuitable for HTML parsing anyway.
       | 
       | [1] https://news.ycombinator.com/item?id=26355451
        
         | dataflow wrote:
         | > it is patently false that regular expressions cannot be
         | recursive.
         | 
         | No, it's more like the term "regular expression" has gotten
         | hijacked and nowadays gets abused to colloquially include,
         | shall we say, irregular expressions. i.e. people basically say
         | "regex" when they mean "some succinct pattern language with
         | syntax similarities to (classical) regular expressions".
        
           | lifthrasiir wrote:
           | Regular expressions in the formal language theory do not have
           | captures anyway. The name collision is unfortunate, but we
           | have already established that regexes in practice means a
           | pattern language largely modelled after theoretical regular
           | expressions and not the theoretical regular expressions
           | themselves. At the very least the writing could have
           | mentioned this discrepancy.
        
             | mannykannot wrote:
             | >At the very least the writing could have mentioned this
             | discrepancy.
             | 
             | The original meaning of 'regular expression' is very
             | specific, and has some significant implications which are
             | lost with the now-common and less well-defined usage.
             | Therefore, if anything needed having this discrepancy
             | mentioned, it was your original statement "it is patently
             | false that regular expressions cannot be recursive", as
             | this is an issue where the distinction is crucial. It is
             | good to see that you have now done so, though the way you
             | have done so suggests there is nothing of practical
             | interest in the formal definition, which, I suggest, would
             | be patently false.
        
               | lifthrasiir wrote:
               | I intentionally used the term "regex" elsewhere for that
               | reason, but I later realized that the indirect quotation
               | can be still problematic.
        
             | dataflow wrote:
             | Captures are a red herring here. They don't fundamentally
             | alter the nature of what a regex does, which is to
             | recognize regular languages. Pointing to them as if they're
             | some kind of justification is like calling pilots "drivers"
             | because drivers originally drove wagons, and wagons didn't
             | have rubber tires like cars and planes anyway. It's
             | completely missing that the point of the distinction
             | between a plane and a wagon has always been the land vs.
             | air travel, not modern features like tires or the
             | infotainment systems or what have you.
             | 
             | But yes I guess it'd have been better for the writing to
             | mention the discrepancy in any case.
        
               | nixpulvis wrote:
               | On a related note, lookaheads (and behinds) do somewhat
               | change the fundamental expressive power I believe.
        
               | lifthrasiir wrote:
               | > They don't fundamentally alter the nature of what a
               | regex does, which is to recognize regular languages.
               | 
               | It's a bit subjective but captures are harder than
               | recognition. Russ Cox has once noted [1] that the
               | extraction has to be run as a separate step after the
               | recognition and a fast DFA can't always be used for that,
               | suggesting they are related but different problems.
               | 
               | [1] https://swtch.com/~rsc/regexp/regexp3.html
        
               | mannykannot wrote:
               | Well, if you allow an arbitrary depth of capture-group
               | nesting, then that may be so, but it seems beside the
               | point here. It is not clear to me that this article makes
               | any point about extraction that is relevant to this
               | discussion.
        
       | asddubs wrote:
       | problem is that it doesn't reflect how browsers parse things. if
       | you were using this in a security context, e.g., here's an
       | example it won't detect (granted this is not technically valid,
       | but does it matter?):
       | 
       | <div "> put arbitrary html here as you please (using single
       | quotes for attributes)<div ">
        
       | NtrllyIntrstd wrote:
       | The author seems to be missing the point, in my opinion. While it
       | is certainly true that often one can solve simple, seemingly
       | innocent sub-problems within more general languages, the
       | transitions from "I see I can solve this simple program with
       | regex'es!" to "Then I can probably solve this other, almost
       | identical problem as well!" and have the problem explode right
       | into your face are subtle (almost imperceivable to a novice) and
       | it would be a more robust solution to go for the right tools
       | (i.e. an (x)html parser), as well as a good learning example. On
       | a side note: regular expressions can not - by definition - parse
       | recursive languages. A regular expression matcher that does is
       | not a regular expression parser but an ugly-duckling in the
       | family of context-free grammar matchers. People should learn when
       | and how to use those.
        
         | nixpulvis wrote:
         | The regex is surely faster for the specific case. I can't say
         | I've seen an XHTML parser off hand that allows me to stop
         | parsing after just the start tag. Perhaps a lazy parser could
         | start to compete, but I'm just guessing.
        
           | josefx wrote:
           | Aren't most XML parsers SAX or STaX based? Only time I ran
           | into a library that only offered a full DOM without the
           | underlying event based parser was whatever browsers consider
           | the JavaScript standard library.
        
             | [deleted]
        
             | nixpulvis wrote:
             | You're totally right! Many good stock parsers already
             | stream things (more or less).
             | 
             | Still, I'm just making a comment about the overhead... I
             | would hedge a guess that you're going to have a hard time
             | beating a regex with an HTML parser for speed, assuming
             | what you want can be done with both.
             | 
             | This is all irrelevant, because as the OP mentions, the SO
             | question at hand cannot be solved with standards compliant
             | parsers because self-closing tags will not be
             | distinguishable.
        
           | Akronymus wrote:
           | I believe you could build such a parser out of parsec.
           | Altough, I am not sure if that is exactly what you are going
           | for.
        
         | wodenokoto wrote:
         | The article goes as far as to say that a parser is not the
         | right tool.
         | 
         | > Not only can the task be solved with a regular expression -
         | regular expressions are basically the only practical way to
         | solve the problem. Which is why none of the clever answers
         | actually suggest another way to solve the problem.
         | 
         | So no, the author is not missing the point at all.
        
           | [deleted]
        
           | IshKebab wrote:
           | I mean that bit is clearly wrong. An XML/HTML parser is a
           | perfectly practical way to solve the problem.
           | 
           | However I completely agree that they didn't miss the point. A
           | regex to do this might be fine for hacky things that you
           | don't need to be robust (e.g. for searching for stuff,
           | measuring stats, one-off scripts etc.).
        
         | yoz-y wrote:
         | But the original SO question does not imply that they want to
         | solve a more complex problem. The SO asker explicitly asked for
         | opinions, so that's what they got. However, I absolutely think
         | it is the right choice to choose simpler tools to solve simpler
         | problems, as long as you are aware of the implications.
        
       | nooyurrsdey wrote:
       | This is one of those things that people will debate about
       | endlessly and ultimately it feels so silly.
       | 
       | The poster asked how to do it, and this person provided a
       | practical regex to cover most (if not all) cases.
       | 
       | Everything else is just pedantic debate.
        
         | h2odragon wrote:
         | The pedantic discussions can be fun and educational, but the
         | regex based hack gets the job done in a few minutes, while the
         | pedants are still wrestling with parsers and libraries.
         | 
         | ... and then there's the anticipated joy of seeing the pedants'
         | complicated, theoretically correct solution explode because the
         | input wasn't what they assumed, in the first place.
         | 
         | The pedants that have that experience either become
         | enlightened, or vociferously strident about the importance of
         | proper, theoretically correct solutions in place of quick
         | hacks.
         | 
         | Thus the meme status of the SO answer.
        
           | IshKebab wrote:
           | The issue is that _sometimes_ you should use a robust parser
           | and do it properly, and _sometimes_ a hacky regex is fine.
           | But people forget that when arguing about which you should
           | use.
        
       | SavantIdiot wrote:
       | Oh I've seen this many times in different forms. Especially with
       | regexes.
       | 
       | You know what this is a great example of? A case where hacking
       | makes a mess, and thinking before coding solves the problem.
       | 
       | The madness comes from using the wrong tool for the problem. Yes,
       | you can hack a regex to parse XHTML this might be "good enough",
       | but it is more robust, cleaner and easier to explain if you use a
       | lexical tokenizer and a grammar.
       | 
       | The lure is an illusion that comes from an initial effort
       | assessment. Where the effort to hack a quick-and-dirty regex
       | (call this Ehack) vs a "oh, man, you mean I gotta think about the
       | problem" (call this Ethink) appears as "Equick <<< Ethink."
       | However, it soon evolves to the scenario where "Equick >>>
       | Ethink," driven by the thought process, "I'm almost there, this
       | regex just needs one more tweak." Aka, the gambler's fallacy: it
       | comes into play and the sunk costs are ignored.
       | 
       | TL;DR - Use the right tool for the problem, even if it means a
       | slightly larger up-front effort investment.
        
         | Rapzid wrote:
         | Yeah, and you never actually end up _solving_ the _problem_.
         | You just end up solving every edge case that comes up :D
         | 
         | It's the same mentality that can lead to fixing but symptoms
         | without making any real progress on the underlying issues..
         | Sometimes that's good enough I guess.
        
       | tester756 wrote:
       | uhh?
       | 
       | just because you can doesn't mean you should
       | 
       | just take a look at proposed regex
       | 
       | >(
       | 
       | > # match all tags in XHTML but capture only opening tags
       | 
       | > <!-- . _? -- > # comment
       | 
       | > | <!\\[CDATA\\[ ._? \\]\\]> # CData section
       | 
       | > | <!DOCTYPE ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* >
       | 
       | > | <\? . _? \? > # xml declaration or processing instruction
       | 
       | > | < \w+ ( "" [^""]_ "" | ' [^']* ' | [^>/'""] )* /> # self-
       | closing tag
       | 
       | > | < (?<tag> \w+ ) ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > #
       | opening tag - captured
       | 
       | > | </ \w+ \s* > # end tag
       | 
       | > )
       | 
       | it's ugly as hell
       | 
       | >Parsing typically uses (at least) two steps: Tokenization which
       | uses regular expressions to splits the input string into a
       | sequence of syntax elements
       | 
       | I don't use regex for tokenization, I'm doing something wrong?
       | 
       | But overall I think this is important post, even despite I
       | believe that regex is the best example of "good idea, shitty API"
        
       | BiteCode_dev wrote:
       | Weird article that basically says people are wrong then prove
       | they are right.
        
       | motoboi wrote:
       | The problem is: you cannot parse malformed (real, everyday) html
       | with regexes.
       | 
       | But if you need to parse html even malformed) generated by the
       | same template (like a scrapping situation), the whole file
       | becomes regular, which can be parsed by a regular expression.
       | 
       | But if you try to parse html in general, too bad because then
       | you'll need to take html in consideration and will need a
       | recursive descent parser, not a regex.
       | 
       | This question popped up so many times in forums in 2000's that
       | people got mad at that.
        
       | math-dev wrote:
       | Great article (assuming the solution provided works).
       | 
       | I do a lot of parsing in my projects, I find natural text based
       | input vital for power users who don't to point and click always.
       | 
       | What are some good parsing algorithms, theoretical articles etc
       | to help me become more professional in the parsing tools I write?
        
         | [deleted]
        
       | inopinatus wrote:
       | > The question is about finding opening tags in XHTML using a
       | regular expression
       | 
       | Bzzzt, wrong, sorry! The question is about finding open tags in
       | the presence of XHTML self-closing tags. That difference alone
       | places these interpretations gulfs apart. But there's more: it
       | does not specify that the input document is even XHTML, _only_
       | that XHTML-style self-closing elements may be present. In fact
       | the original question was barely minutes old and tagged merely
       | "regex" when that famous answer was written in 2009; the question
       | was not tagged with "xhtml" until 2012, and not by the original
       | author either.
       | 
       | Revealingly, then, if we review the broader context (i.e question
       | history) of the original question author, it's clear that yes
       | indeed _they were trying to fix a malformed document_ , and in
       | particular to normalise it _into_ XHTML, with focus on fixing up
       | any so-called "dangling tags". For this task, the suggestion of
       | "use a parser" is indeed sound advice.
       | 
       | The real moral here is, don't be a jerk about the precise
       | semantics of a question, look at what the person _needs_ , and
       | help them ask better questions.
       | 
       | Otherwise, you're just gonna discover that there's always a
       | bigger jerk, and they're on Stack Overflow, moderating your
       | stuff.
        
         | a1369209993 wrote:
         | > For this task, the suggestion of "use a parser" is indeed
         | sound advice.
         | 
         | Perhaps technically, but it's also useless advice because a
         | parser does not exist for their particular flavor of malformed
         | XHTML. XHTML parsers parse XHTML, which you yourself have said
         | it wasn't:
         | 
         | > they were trying to fix a malformed document
         | 
         | So in the absence of a reference to a particular malformed-
         | XHTML-recovering parser (which may or may not work on the
         | specific input they have, but "try this thing" is at least
         | actionable advice), "use a parser" amounts to "write a entire
         | parser yourself, then use it".
         | 
         | > don't be a jerk about the precise semantics of a question,
         | look at what the person needs
         | 
         | Pot, kettle.
        
           | [deleted]
        
           | inopinatus wrote:
           | > "use a parser" amounts to "write a entire parser yourself,
           | then use it"
           | 
           | "Use a parser" is a common answer, besides being the accepted
           | one, and with good reason: it'll work. The world is not short
           | of HTML parsers (although, who knows, perhaps PHP may have
           | been short of very good parsers back in 2009). Serializing
           | XML from the resulting memory structure, DOM or otherwise,
           | closes the loop, and this remains a conventional and
           | commonplace means to normalize some incoming HTML-like mush
           | into something that can be interpolated into XHTML (or, more
           | specifically I think, XML RSS) and a strict receiver will at
           | least accept.
           | 
           | > Pot, kettle.
           | 
           | Oh look, personal abuse! Good-day to you, too.
        
         | IshKebab wrote:
         | Read the regex. It handles self-closing tags fine.
         | 
         | Also you're doing the _intensely_ annoying thing that lots of
         | StackOverflow people do of imagining that the asker really
         | wanted to ask a different question. It happens sometimes. But
         | you shouldn 't just jump in and _assume_ that they don 't know
         | what they want and you're so much smarter than them so you know
         | what they really want.
         | 
         | Offer additional answers if you want, but _answer the question
         | they asked first_.
         | 
         | (Sorry, pet peeve.)
        
           | inopinatus wrote:
           | You're not the first to take that line, so I'll refer you to
           | my previous observations:
           | https://news.ycombinator.com/item?id=27097403
           | 
           | There's no wild assumption going on here. I just bothered to
           | keep reading, very carefully, everything the original author
           | actually wrote.
           | 
           | Then, please, further reflect that Stack Overflow is not
           | Codewars; it is a forum for practical, focused, and relevant
           | problem-solving advice, and at its best the moderation and
           | answer processes help folks to iteratively revise and improve
           | their questions. A crucial step is, therefore, analytically
           | clarifying both the parameters and the intended outcome.
           | 
           | Contextualisation and focusing of requirements is a familiar
           | and essential skill for any programmer being handed a
           | requirements statement, and answering S.O. questions involves
           | the same exercise, just in vignette.
           | 
           | The handling of this question was not, in this sense, the
           | best. But expecting anything less, would just be a load of
           | fizzbuzz.
        
         | goto11 wrote:
         | The problem is the answers saying "this is not possible".
         | 
         | The OP asks a perfectly reasonable question. The answers assume
         | the OP _actually_ meant to ask a different question, and they
         | they ridicule the OP for this imagined question.
         | 
         | The question they _imagine_ the OP asks was:  "How do I parse
         | an arbitrary HTML document into an element-tree using only a
         | single regular expression and not using any auxiliary data
         | structures, not even the call stack?"
         | 
         | Yes, this is indeed not possible, given the limitations of
         | vanilla regular expressions.
         | 
         | But that was not the question asked.
         | 
         | Of course "use a parser" is perfectly sound advice (if a parser
         | exists which can solve the OP's problem). But saying what the
         | OP is attempting to do (tokenize xhtml) is _impossible_ is
         | absurd, since then it would also be impossible to write a
         | parser!
        
           | inopinatus wrote:
           | They're trying to tokenize HTML into two arrays, one being an
           | array of opening tags and one of closing tags, with the hope
           | being to pairwise compare and reconcile the elements of these
           | arrays.
           | 
           | The necessary clarification appears further down the page.
           | 
           | As for whether it's a reasonable question; as written I beg
           | to differ, it's the opposite, since it does not convey their
           | problem except by a very careful and nuanced reading. The
           | proper response of the S.O. community should've been to aid
           | the OP in clarifying their intentions and amending the
           | question accordingly to bring focus onto their actual
           | problem.
           | 
           | That is not what happened. Instead they are indeed on the
           | receiving end of some ridicule, which is shameful, but the
           | advice of "use a parser" would still likely have been a top
           | answer in some fashion had the OP's scattered ancillary
           | questions and clarifications been incorporated.
           | 
           | Sadly, only one correspondent of many seems to have seen fit
           | to ask.
        
         | 1vuio0pswjnm7 wrote:
         | What if we looked at the XHTML parsers trusted by the people
         | who mindlessly dismiss the utility of regular expressions and
         | found they were constructed using a lexer that relied on
         | regular expressions.
        
           | inopinatus wrote:
           | Well, since I've learned at least four assembly languages,
           | and wrote my first toy lexer at least 25 years ago, probably
           | in Standard ML, by this standard I seem qualified to comment.
           | 
           | It doesn't matter what the parser looks like under the
           | bonnet. What matters is the utility it provides.
           | 
           | One might otherwise similarly offer the advice to mine your
           | own copper and grow your own silicon, these being equally
           | essential activities for anyone seeking the ultimate in
           | mechanical sympathy.
        
           | [deleted]
        
         | swiley wrote:
         | Here's the original text for reference:
         | Locked. Comments on this question have been    disabled, but it
         | is still accepting new answers and    other interactions. Learn
         | more.                  I need to match all of these opening
         | tags:                  <p>         <a href="foo">
         | But not these:                  <br />         <hr class="foo"
         | />                  I came up with this and wanted to make sure
         | I've got it right. I am only capturing the a-z.
         | <([a-z]+) *[^/]*?>                  I believe it says:
         | Find a less-than, then             Find (and capture) a-z one
         | or more times, then             Find zero or more spaces, then
         | Find any character zero or more times, greedy, except /, then
         | Find a greater-than                  Do I have that right? And
         | more importantly, what do    you think?         html
         | regex         xhtml         Share         Improve this question
         | edited May 26 '12 at 20:37         community wiki
         | 11 revs, 7 users 58%         Jeff
        
           | inopinatus wrote:
           | Not sure which revision that's pasted from, but it is not the
           | original text, which can be found at
           | https://stackoverflow.com/revisions/1732348/1 and I strongly
           | advise against neglecting the title; doing so is how some
           | folks blunder into misinterpretation, since the wording of
           | that title is telegraphing a quite different underlying need
           | to "please solve this regex puzzle".
        
             | DangitBobby wrote:
             | Reading both, I see little to no difference. Is there an
             | important change between the two revisions that I am
             | missing? Also, my reading of the title does not, in fact,
             | telegraph anything other than "help me with this regex
             | puzzle."
        
               | inopinatus wrote:
               | I recommend consulting my other remarks within this
               | topic, but on this specific point:
               | 
               | The title's inconsistency with the body text, i.e. "open
               | tags" vs "opening tags", is especially and immediately
               | notable because they are (in context) grammatically
               | interchangeable but have dissimilar meanings. This is
               | immediately suggestive of (but _not_ diagnostic of) a
               | writer revealing context and then switching to detail. As
               | a longtime reader of requirements documents and S.O.
               | questions, a mental flag to check the intended meaning of
               | both is already raised at this point.
               | 
               | The reference to XHTML is ambiguous, since it speaks to
               | working around XHTML that is already present, rather than
               | defining an input or output document, which means this is
               | something thinking at the character level rather than in
               | terms of DOM. This impression is verified by the
               | comparison pairs in the question body, leaving the
               | essential context question of _what are the desired
               | inputs and outputs? "_ glaringly unanswered.
               | 
               | The early construct, "I need to ..." is a secondary flag
               | that immediately reinforces the likelihood of a critical
               | gap in context, since it does not explain _why_ the need
               | has arisen.
               | 
               | To anyone familiar with the structure & semantics of the
               | HTMLs, the omission of _any_ mention of tag closing,
               | having discussed  "opening tags", boosts the sense that
               | something relevant is missing from the question. The
               | obviously flawed regular expression then attracts a
               | "probable novice" qualifier, which is amplified by both
               | the closing "what do you think?" and the original's
               | emoticon.
               | 
               | At this point, we're about ten seconds into the read-
               | through and there's already a ton of labels pointing to a
               | beginner who may be conflating dissimilar concepts and
               | (through inexperience) choosing the wrong tool for the
               | task, whatever that may be. Another few seconds to review
               | comments, at which point it appears that author _does_
               | understand the distinction between an  "open tag" and an
               | "opening tag", and this reweights the XHTML reference in
               | the title toward being a need to _output_ strict XHTML.
               | 
               | Given their apparent beginner standing it's likely this
               | is either their first S.O. post, or one of slightly too
               | many, and a check in their profile for proximate
               | questions immediately reveals the latter case. References
               | to balancing of "dangling tags" abound, collapsing all
               | preceding problem variants and likelihoods into the only
               | focused understanding that fits: it's a PHP novice,
               | bright but inexperienced, who wants to normalise a
               | nonconforming document into XHTML by balancing the tags,
               | and that the mostly likely use case is injecting an
               | existing HTML fragment into a RSS XML feed.
               | 
               | Elapsed time ca.90 seconds.
               | 
               | Another minute to scroll through surrounding Q&A
               | materials, to allow alternative options to appear (they
               | don't).
               | 
               | A few seconds for a chuckle at that old friend of an
               | amusing answer, and recognise that the advice within is
               | sound: what they really need is, indeed, a parser. Then,
               | a pause to consider a contrasting notion, that what
               | they're trying to do is _write_ a parser. Consider
               | evidence that author has taken at least one step towards
               | inventing stacks on their own from first principles;
               | prospect loses to existing top answer because the
               | experience level required to write parsers cannot,
               | ultimately, be acquired from a single S.O. question.
               | 
               | Last but not least; the final boss: convincing the chorus
               | of Hacker News to accept this interpretation. That takes
               | _much_ longer.
        
               | DangitBobby wrote:
               | Consider alternatively that this is a tiny piece in the
               | much broader puzzle of what they are trying to
               | accomplish, that they are aware both of their own
               | beginner status but also that, in this case, good enough
               | will be good enough, or that they don't have the time or
               | inclination to switch to a real parser and that's why
               | they didn't ask about it. Or maybe even that the goal
               | here is to specifically learn to use regex to solve a
               | real software problem that they have. It's all a matter
               | of interpretation of context that we don't have. As an
               | asker of questions on SO where people provide answers to
               | questions I specifically intended to not ask, this
               | pattern of reading deeply into context we don't have is
               | actually kind of annoying. Unless you provide both the
               | answer to the actual question and then some guidance on
               | how you _really_ should be doing something. Those answers
               | are very, very helpful.
        
               | inopinatus wrote:
               | These objections may seem relevant to your personal
               | experience, but they don't pertain to the case at hand,
               | for which ample context is available.
               | 
               | Dealing with responses from people who've rushed to write
               | something without bothering to properly consider the
               | problem is, of course, internet 101, but such answers
               | generally take the short road to content oblivion.
               | 
               | For those answers of any quality, however, note this: if
               | people are answering, for free, on their own time,
               | questions that you don't think you asked, but they
               | evidently do, then consider taking some personal
               | responsibility for that outcome. Complaining about it
               | would seem wholly entitled and ungrateful, and
               | unconstructive besides. If they are well-meaning
               | beginners, _coach them_ : it's how the stone soup gets
               | made.
        
               | DangitBobby wrote:
               | I also both ask and respond to questions. Mostly respond.
               | For free. On my own time. People who answer on SO are not
               | a unique special breed in the sense that they are willing
               | to help others with their problems. I have read your
               | reading of the thread, gone through it myself with that
               | in mind, and disagree wholeheartedly that you have
               | meaningfully extrapolated the context you claim.
        
               | inopinatus wrote:
               | That much is quite evident.
               | 
               | Perhaps we'll never know if Jeff's XML RSS feed was
               | conforming or not, but I like to think they got there.
        
           | [deleted]
        
       | funyunpowder wrote:
       | based on the stackoverflow thread, and then the comments here, an
       | interesting research paper topic would be 'why do people get so
       | passionate about regex'
        
       | AzzieElbab wrote:
       | Long regexes are the root of all evil
        
         | crispyambulance wrote:
         | Long regex's are not the root of all evil but they're certainly
         | a tendril of bad-practice.
         | 
         | I would say that the OP's usage, a home-made concoction to find
         | all opening tags in an xhtml doc, crosses the line into bad
         | practice. Get a room and use a parser, people!
        
           | goto11 wrote:
           | So if using regular expressions is "bad practice", how should
           | one write the tokenizer or lexer stage of a parser?
        
         | jrockway wrote:
         | Regexes are computer programs just like any other computer
         | program. A long program, terse syntax, embedding one
         | programming language in another programming language, etc. are
         | risk factors for unmaintainable code, but are manageable in the
         | same way that you manage the risk inherent in other complex
         | computer programs.
         | 
         | I did always like elisp's "rx"[1] library for removing some of
         | the terseness. Not enough to ever use it, of course, but it's
         | an interesting idea.
         | 
         | [1] https://www.emacswiki.org/emacs/rx
        
         | goto11 wrote:
         | If the task at hand is something which _can_ be solved with a
         | regex, then any other solution (e.g manual string scanning)
         | would be far more complex.
        
           | AzzieElbab wrote:
           | I am guilty of this type of thinking as well. I saw the
           | errors of my ways when I had to fix someone's else regular
           | expressions
        
             | goto11 wrote:
             | So what do you use instead of regular expressions for such
             | tasks?
        
               | hansvm wrote:
               | IMO the biggest problem with regexes as seen in the wild
               | is a lack of composability. If you need some kind of
               | pattern like "[setA][setA+setB]{0,n}" then you'll copy-
               | paste the definition of setA in both places. If you need
               | to re-use that entire regex you'll copy-paste it again
               | and construct a monstrous string with a really well-
               | defined structure that isn't even slightly apparent
               | without a reverse engineering session.
               | 
               | Up to a point you can solve that by just giving names to
               | relevant sub-expressions, using a regex builder, etc, but
               | in my experience if I'm going to write even a moderately
               | complicated regex I'll probably be better served with
               | something like parsec (a python implementation here [0])
               | in whichever language I'm currently using.
               | 
               | That isn't to say that regexes don't have their place --
               | anything quick and dirty, situations where you need to
               | handle unsanitized input (mind you, the builtin regex
               | engine is probably vulnerable to exponential runtime
               | attacks) and don't want to execute a turing-complete
               | language, etc.... I just think it has bad ergonomics for
               | any parser you might use more than once, and I haven't
               | yet regretted using parsec in situations where a complex
               | regex would have sufficed.
               | 
               | [0] https://pythonhosted.org/parsec/
        
               | User23 wrote:
               | Perl is great for this. It's been a long time since I've
               | written any, but with the right flags and use of qr// you
               | can write extremely readable perlre.
        
       | 03b17999-4268 wrote:
       | I haven't read the rest of the thread, but the article is even
       | wronger than the glib replies there. HTML needn't be well formed.
       | There are adhoc rules which let major browsers parse broken HTML.
       | If you do not follow the spec to the letter you will have your
       | tooling break on input that every browser thinks is acceptable.
       | 
       | Which is why you always use whatever html parsing library comes
       | with your language. There is no simple answer in the thread
       | because there is no simple answer in the real world.
       | 
       | That said, anyone who says:
       | 
       | >It is quite possible and not even that difficult:
       | (              # match all tags in XHTML but capture only opening
       | tags             <!-- .*? --> # comment              |
       | <!\[CDATA\[ .*? \]\]> # CData section           | <!DOCTYPE ( ""
       | [^""]* "" | ' [^']* ' | [^>/'""] )* >            | <\? .*? \?> #
       | xml declaration or processing instruction             | < \w+ (
       | "" [^""]* "" | ' [^']* ' | [^>/'""] )* /> # self-closing tag
       | | < (?<tag> \w+ ) ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > #
       | opening tag - captured             | </ \w+ \s* > # end tag
       | )
       | 
       | Should be seriously mentored by someone.
        
         | megous wrote:
         | > Should be seriously mentored by someone.
         | 
         | Maybe people who think a basic regex such as this is difficult,
         | need some mentoring.
         | 
         | It doesn't even use lookahead/lookbehind or other more
         | complicated features. It only uses non-greedy matching 2 times,
         | which is the only thing that's more complicated than the basic
         | AND/OR logic expressed by the rest of the expression.
        
         | tannhaeuser wrote:
         | You make this more complex than it actually is. HTML is
         | basically SGML with a DTD declaring rules for tag
         | omission/inference, empty elements, and attribute shortforms.
         | This is what the majority of XML heads seem to struggle with,
         | but which nevertheless is every bit as formal as XML is, XML
         | being just a proper subset of SGML, though too small for
         | parsing HTML.
         | 
         | Then HTML has a number of historical quirks including:
         | 
         | - the style and script elements have special rules for comments
         | that would make older pre-CSS, pre-Javascript browsers just see
         | markup comments and ignore those
         | 
         | - the URL syntax using & ampersand characters needs to be
         | treated special because & starts an entity reference in SGML's
         | default concrete syntax
         | 
         | - the HTML 5 spec added a procedural parsing algorithm (in
         | addition to specifying a grammar-based spec) that basically
         | parses every byte sequence as HTML via fallback rules; for most
         | intents and purposes, the language recognized by these rules,
         | taken to the extreme, is not what's commonly understood as HTML
         | 
         | - WHATWG have added a number of element content rules on top of
         | the HTML 4.01/HTML 5.0 baseline with ill-defined parsing rules
         | (such as the content models for tables and description lists);
         | the reason is precisely that WHATWG, once Ian Hickson had
         | distilled the HTML DTD 4.01 grammar rules into the HTML 5
         | grammar presentation as prose, a formal basis was no longer
         | used for vocabulary extension
        
         | dgrunwald wrote:
         | But both the original question and that article were about
         | XHTML. The non-well-formed mess only matters for HTML, not
         | XHTML. The regex is a valid answer to the stack overflow
         | question.
        
           | WesolyKubeczek wrote:
           | Sorry, but no.
           | 
           | The original question only mentions the author wants to
           | ignore XHTML-style self-closing tags which is in no way
           | implying the input to be well-formed XHTML.
        
         | Rapzid wrote:
         | The CDATA section doesn't appear to match this first example
         | from Wikipedia:                   <![CDATA[<sender>John
         | Smith</sender>]]>
         | 
         | So it's either bugged due to the spaces or there is something
         | else going on I don't understand; complexity.
         | 
         | Agree on the sentiment. Through a "Simple made easy" lens, easy
         | or hard, there is a lot of complexity in the task and this
         | solution..
        
       | dataflow wrote:
       | Try that regex on                 < script>
       | console.log("<script2>"); </script>
       | 
       | Edit 1: I'm unsure if the inner <script2> is valid (X)HTML, so it
       | might not be an issue of being unable to parse correct (X)HTML,
       | but rather an issue of being unable to detect invalid (X)HTML.
       | (Can someone verify?)
       | 
       | Edit 2: It seems Chrome chokes on the space... does anyone know
       | if the initial space is valid? I'm pretty sure I've seen parsers
       | that accept it...
        
         | adament wrote:
         | But it should be easy based on this example to include correct
         | HTML tags in the script which the regular expression will emit.
         | Or if you want to recognise HTML tags in the script, you can
         | easily obfuscate construction of in the script using string
         | concatenation.
        
         | eCa wrote:
         | I don't think any part of that is valid XML. There cant be
         | space between < and the tag name, and I believe content
         | containing tags should be in a CDATA section.
        
         | goto11 wrote:
         | The question is about XHTML, not HTML. HTML does not even have
         | the self-closing tags the question is concerned about.
         | 
         | In XHTML, either the opening angle bracket must be escaped, or
         | the script should be in a CDATA section.
        
         | capableweb wrote:
         | > Edit 2: It seems Chrome chokes on the space... does anyone
         | know if the initial space is valid? I'm pretty sure I've seen
         | parsers that accept it...
         | 
         | Most browsers probably can deal with it, but it's not valid
         | xml/html. Try passing it through a validator, it'll complain
         | about foreign characters after `<` and then complain about a
         | trailing `</script>` as <script> was never opened in the first
         | place.
        
       | chubot wrote:
       | This conversation would be a lot clearer with a distinction
       | between "regexes" and "regular languages". The former is what
       | Perl/Python/etc. have, and the latter is a mathematical concept
       | (and automata-based non-backtracking engines like RE2, re2c, and
       | rust/regex are closer to this set-based definition).
       | 
       | https://www.oilshell.org/blog/2020/07/eggex-theory.html
       | 
       | With those definitions, this part of the snarky answer is wrong:
       | 
       |  _HTML is not a regular language and hence cannot be parsed by
       | regular expressions_
       | 
       | That is, regular expressions as found in the wild can parse more
       | than regular languages. (And that does happen to be useful in the
       | HTML case!)
       | 
       | This answer is also irrelevant, since the poster is asking for a
       | solution with regexes, NOT regular languages:
       | 
       |  _I think the flaw here is that HTML is a Chomsky Type 2 grammar
       | (context free grammar) and a regular expression is a Chomsky Type
       | 3 grammar (regular grammar)._
       | 
       | In this post, the example given IS a regex, but it IS NOT a
       | regular language:                   <!-- .*? --> # comment
       | 
       | The nongreedy match of .*? isn't a mathematical construct; it
       | implies a backtracking engine.
       | 
       | I gave my analysis here and listed 3 or 4 caveats:
       | https://news.ycombinator.com/item?id=26359556
       | 
       | I prefer to use regular languages and an explicit stack, but this
       | is not really what the original question was asking.
        
         | a1369209993 wrote:
         | > This conversation would be a lot clearer with a distinction
         | between "regexes" and "regular languages".
         | 
         | Very much so.
         | 
         | > In this post, the example given IS a regex, but it IS NOT a
         | regular language: `<!-- .*? --> # comment` The nongreedy match
         | of .*? isn't a mathematical construct; it implies a
         | backtracking engine.
         | 
         | Actually, that's [edit: "it IS NOT a regular language"] wrong,
         | at least in principle. If you're limiting it to only the
         | shortest match (which is how HTML (and most other) comments
         | actually work), then (just like `(abc)+` is shorthand for
         | `(abc)(abc)*`) `<!-- .*? -->` is shorthand for (assuming I
         | haven't made a mistake in the for-lack-of-a-better-word-
         | arithmetic):                 <!-- ([^ ]| [^-]| -[^-]| --[^>])*
         | -->
         | 
         | That is, shortest-repetition-only can be implemented in a
         | purely regular system.
         | 
         | On the other hand, if you want to allow longer matches when
         | actually needed, then for purely-regular purposes (where it
         | either matches or not) `<!-- .*? -->` is just a wierd way of
         | writing `<!-- .* -->` (which is quite obviously a regular
         | language).
        
           | goto11 wrote:
           | Does your regex assumes that "-->" must be prefixed by a
           | space? This is not the case in XML. (Also the string "--"
           | must not occur inside a comment, so the last clause is not
           | necessary.)
        
             | a1369209993 wrote:
             | > Does your regex assumes that "-->" must be prefixed by a
             | space?
             | 
             | Yep, because the quoted regex assumed the same thing, and I
             | didn't see a point in editorializing more than necessary.
             | 
             | > Also the string "--" must not occur inside a comment, so
             | the last clause is not necessary.
             | 
             | "Must not" seems unreliable in webpage parsing. What page
             | does your XHTML parser produce when fed text of the form
             | `<!-- --is this a comment? <a href=-->`, for example?
        
           | chubot wrote:
           | Yeah I think you are right -- the nongreedy match can be
           | simply written as a more awkward pattern. I think regular
           | language negation and intersection also help (which the rarer
           | derivatives-based implementations seem to have). They are
           | still regular and equivalent in power, but there's a richer
           | set of operators for writing patterns.
           | 
           | I would also divide it into the "recognition problem" and the
           | "parsing/capturing problem".
           | 
           | Recognition is just yes/no -- does it match? And notably that
           | excludes common regex APIs that either return the end
           | position (Python's re.match) or search (re.search). It is
           | more like math -- is it in the set or not?
           | 
           | For the recognition problem, there's no difference between
           | greedy and non-greedy, in terms of the result. (It does
           | matter in terms of performance, which shows up in practice in
           | backtracking engines!)
           | 
           | But parsing / capturing is more relevant to programmers. I
           | don't remember all the details but there is some discussion
           | on the interaction between greediness and capturing here:
           | https://swtch.com/~rsc/regexp/regexp2.html
           | 
           | It looks like it can all be done in linear time and RE2
           | supports it fine.
           | 
           | So in that case maybe the 2 questions are closer than I
           | thought. I have a very similar set of regexes that I use for
           | parsing (my own) HTML, but I use one regex for each kind of
           | token, and I do it with the help of an explicit stack.
           | 
           | I'm using Python's engine so I think it is better to separate
           | each case, for performance reasons. But maybe with RE2 it
           | would be fine to combine them all into one big expression as
           | is done here, for "parallel" matching. The expression is a
           | little weird because it's solving a specific problem and not
           | something more general (which is useful and natural).
        
             | a1369209993 wrote:
             | > common regex APIs that either return the end position
             | (Python's re.match) or search (re.search).
             | 
             | Yeah, I probably should have explicitly said that's what
             | the first translation (`[^a]|a[^b]|ab[^c]|...`) was for;
             | it's a optimization (possibly-backtracking-parser ->
             | [ND]FA) I've used a couple of times to beat things into
             | guaranteed O(N) time.
             | 
             | > But parsing / capturing is more relevant to programmers.
             | 
             | I'd debate "more", since that's a additional thing _on top
             | of_ matching and searching. Any case where you need the
             | former, you also need the latter to even know what _to_
             | capture. But it 's definitely something you _do_ frequently
             | need.
        
       | amelius wrote:
       | 1. determine size of XHTML input
       | 
       | 2. build regex that works up to the size determined in step 1.
       | 
       | 3. apply regex
        
       | shmageggy wrote:
       | It's not explicitly stated, but I believe the author's point is
       | that the original question didn't require a recursive solution
       | (because it's only asking about individual tags, not matching
       | opening tags with their closing partners)
       | 
       | Edit: yes looking at the answers, someone pointed this out in a
       | comment response to the"Chomsky" answer:
       | 
       | > _The OP is asking to parse a very limited subset of XHTML:
       | start tags. What makes (X)HTML a CFG is its potential to have
       | elements between the start and end tags of other elements (as in
       | a grammar rule A - > s A e). (X)HTML does not have this property
       | within a start tag: a start tag cannot contain other start tags.
       | The subset that the OP is trying to parse is not a CFG. - LarsH
       | Mar 2 '12 at 8:43_
        
         | inopinatus wrote:
         | Look again. The context is definitely one of identifying tags
         | that should be closed, but aren't, and being identified thus in
         | order to normalise a malformed document into well-formed XHTML.
         | It is not prominently stated, but it is nevertheless the case.
         | Consequently, balance matters. The clue is in the title: "match
         | open tags", for which the body text has only one part of the
         | author's train of thought (i.e identifying the opening tag).
         | Further discussion of matching the closing tag is omitted from
         | the body of the question, and many people forget (or disregard)
         | the nuanced difference in the title at this point in their
         | read-through, not stopping to ponder "why the heck would they
         | just want the opening tags? what have I missed here about the
         | context?" and instead treating it like some particularly awful
         | and weirdly contrived exam question, as in the article at the
         | top.
         | 
         | The final confirmation of all this is in the question history
         | of the question author around the same date: it's definitely
         | what they were working on.
         | 
         | Consequently, our zalgo-spewing correspondent has it right,
         | they should use a parser, and in addition the question
         | should've received feedback early to help them describe the
         | context more clearly.
         | 
         | That all this was never properly clarified is a failure of
         | moderation, and further a demonstration of how many folks
         | struggle to challenge (or even identify) their own assumptions.
        
           | goto11 wrote:
           | "Match open tags" obviously refers to writing a regular
           | expression which matches opening tags, not to pair opening
           | tags with end tags.
           | 
           | If you look at the regex which the OP themself suggests, it
           | is clearly only intended to match opening tags (excluding
           | self-closing tags), not search for corresponding end-tags.
        
             | inopinatus wrote:
             | Well, as now expressed at hopefully sufficient length, it
             | _doesn't_ just say solely that, unless one a) disregards
             | the difference in phrasing, and then b) disengages any
             | sense of purpose and practical utility and instead treats
             | it like a badly worded test question.
             | 
             | It's kind of a shibboleth, in a way, for developer
             | sensitivity to actual needs, as opposed to getting hung up
             | on how clever they are.
        
               | jcelerier wrote:
               | > disengages any sense of purpose and practical utility
               | and instead treats it like a badly worded test question.
               | 
               | Who are you to know the purpose and utility better than
               | the person who asked the question ?
        
               | inopinatus wrote:
               | Oh, are they here? It'd be fascinating to hear from them.
               | 
               | Alternatively, perhap, that hostile tone is suggesting
               | I'm personally unqualified to interpret loosely framed
               | questions? I suppose, since I've only been doing it for a
               | few decades, I'm definitely a novice by any standard, and
               | my tendency to observe and follow up on anomalous,
               | incomplete, subtly conflicting, or otherwise inexplicable
               | requirements by investigating both the timeline and
               | substance of the original context, and the apparent
               | motivations and outcome preferences of the author, is
               | sheer beginners luck, and any uptick in stakeholder
               | utility that from time to time accompanies amending
               | recommendations following such investigative and
               | analytical activity a sheer coincidence! So that must be
               | it - as you can probably tell from all this smug, empty
               | bravado, I'm really just sharing pure speculation, wild
               | guesswork, total fluke, impertinent leaps of inferential
               | faith, only just grasping at the vague outline of my own
               | blind spots et cetera et cetera, and consequently _yes_ ,
               | I'd love to hear the original intent restated from the
               | horse's mouth, too; but, for the meantime, I'll read the
               | tealeaves, systematically analyse and synthesise to the
               | best of my ability, describe and discuss any substantive
               | points of comprehension that I think might help enrich,
               | or at any rate challenge, a reader's perspective
               | (including my own), and cross my fingers hoping to read,
               | mark, and inwardly digest what new understanding or
               | revealed wisdom as I can - even when it comes, as it has
               | there and here, via diacritical allegory and dialectical
               | hellfire.
               | 
               | Or, finally, if you just want the TL;DR version, the
               | reason I feel unshakeably comfortable asserting that the
               | question author's actual purpose is normalising a
               | nonconforming document into XHTML, by balancing open &
               | close tags, is _because they said so_.
               | 
               | It's in an answer comment about halfway down the first
               | page.
               | 
               | > "Can you provide a little more information on the
               | problem you're trying to solve"
               | 
               |  _" Determining all the tags that are currently open,
               | then compare that against the closed tags in a separate
               | array"_
               | 
               | If that's not enough, here are quotes from their other
               | questions, posted in the minutes and hours prior:
               | 
               |  _" How to balance tags with PHP"_, and
               | 
               |  _" I need to write a function which closes the opened
               | HTML tags."_
               | 
               | Sometimes, we just have to bother reading what's in front
               | of us.
        
               | [deleted]
        
               | jcelerier wrote:
               | > Sometimes, we just have to bother reading what's in
               | front of us.
               | 
               | You are making an absolutely great example of that. I was
               | absolutely not talking about the original SO post, but
               | about the generally extremely entitled answers which
               | assume the existence of a very specific X to the Y of a
               | post.
        
               | inopinatus wrote:
               | If I understand correctly, you're suggesting "Who are
               | you" wasn't directed at me personally, "the person"
               | wasn't referring to the OP but all possible authors, and
               | "the question" wasn't referring to, well, the original SO
               | question at hand, but the class of all possible
               | questions.
               | 
               | If so, then I see, I think: perhaps it was more intended
               | as "Who is anyone to know the purpose and utility _(of a
               | question)_ better than the person who asks that question?
               | "
               | 
               | I can't answer that, since I agree with the sentiment.
               | I'm only really discussing this one specific question,
               | possible because there's so much extra context, and so
               | much ambiguity in the original text. My personal hubris
               | doesn't generalise to the class of all possible
               | questions, and I'd struggle to sympathise with anyone
               | making such an ambit claim.
               | 
               | But then, maybe it wasn't even a question at all.
        
               | shmageggy wrote:
               | I have to ask, is this series of comments some kind of
               | performance art or some sort of social experiment? Or is
               | this unironically how you write/speak/act?
        
               | Y_Y wrote:
               | You don't have to ask. Isn't there a rule about fake
               | curiosity here?
               | 
               | I hope this is the last comment in this bad bad thread.
        
               | inopinatus wrote:
               | To someone paying enough attention to crystallize the
               | trichotomy, all I can say is, thank you for reading.
        
       | adament wrote:
       | Does the proposed regular expression really handle embedded
       | script content correctly? From my limited understanding of HTML,
       | pretty much only </script> counts as closing the script contents
       | and everything else is treated as part of the script.
        
         | goto11 wrote:
         | The question is about XHTML though, not HTML which have a more
         | complex syntax.
        
           | jontro wrote:
           | To me it's a bit ambiguous if the original question is about
           | both html and xhtml. It's tagged with both
        
             | goto11 wrote:
             | The headline says XHTML. HTML does not even have the self-
             | closing tags the question is concerned about.
        
         | megous wrote:
         | Does it matter? You can create regex with lookahead for
         | </script>. The point is that it's possible to solve the problem
         | from SO this way, due to the nature of the problem, not that
         | this particular expression is perfectly correct.
        
           | anaerobicover wrote:
           | It seems the article does defend "it's possible" for strictly
           | regular expressions. To allow lookahead will be context-
           | sensitive, so not in that spirit.
        
       ___________________________________________________________________
       (page generated 2021-05-09 23:02 UTC)