[HN Gopher] Solving the regex of madness, and snarky answers on ...
___________________________________________________________________
Solving the regex of madness, and snarky answers on StackOverflow
(2019)
Author : monort
Score : 113 points
Date : 2021-05-09 07:48 UTC (15 hours ago)
(HTM) web link (www.cargocultcode.com)
(TXT) w3m dump (www.cargocultcode.com)
| jll29 wrote:
| > I think the flaw here is that HTML is a Chomsky Type 2 grammar
| (context free grammar) and a regular expression is a Chomsky Type
| 3 grammar (regular grammar).
|
| Note that regarding formal language and complexity theory, while
| it is correct that in general, arbitrary nested structures
| require a context free grammar (type 2 in the Chomsky hierarchy)
| and are thus beyond regular (type 3) [1], this statement is NOT
| true _if_ you limit the nesting depth with a finite constant k.
|
| For example, if you agree to an HTML tag maximum nesting depth
| of, say, 100, then it can be modeled with a regular (type 3)
| grammar, including correct required matching of opening and
| closing tags, and hence you can write a regular expression that
| matches it as well.
|
| This debate is well-documented in the theoretical linguistics
| literature, where some say human languages are not regular
| because you can always embed yet another additional relative
| clause in any sentence in principle without adversely affecting
| grammaticality, whereas others say while you could you won't find
| natural examples in human-written text documents where extreme
| nesting depth is actually found. At that point psycholinguists
| and theoretical linguists usually start a fight about whether
| memory limits are important or "just performance as opposed to
| competence".
|
| (Goes to show how practical solid theory is.)
|
| [1]
| https://www.sciencedirect.com/science/article/pii/S001999585...
| jancsika wrote:
| I love seeing the weirdo CDATA thingy in there! CDATA ftw!
|
| E.g., you've got this enormous spec for SVG which includes CSS,
| but that CSS has syntax inside a style tag which could break
| XHTML parsers.
|
| Amateurs out there are probably thinking, "Well, why not just
| compromise in the spec and tell implementers to do the same thing
| that HTML does to parse style tags?" Well, professionals know
| that _cannot_ work for myriad reasons you can read about if you
| take out a college loan and remain sedentary for the required
| duration.
|
| The _right_ approach is to throw the CSS stuff inside CDATA tags
| to tell the parser not to parse it so things don 't break. That
| is the way sensible, educated professionals solve this problem.
|
| I'm only kidding!
|
| For inline SVGs the HTML5 parser simply says, "Parse this gunk as
| HTML5, and use _sane_ defaults to interpret the parsed junk in
| the correct svg namespace so that all the child thingies in that
| namespace just work. "
|
| Which it does.
|
| _Unless_ you 're going to grab the innerHTML of the inline SVG
| and shove it into a file to be used later as an SVG image.
|
| In _that_ case you cross the invisible county line into XHTML
| territory where the sheriff is waiting to throw you in jail for
| violating the CDATA rule. In that case the XHTML parser hidden in
| the guts of the browser doles out the justice of an error in
| place of your image. Because that is the way sensible, educated
| professionals solve this problem. :)
|
| My holy grail-- how do I use DOM methods to create a CDATA
| element to shove my style into? If I could know this then I can
| jump my Dodge Charger back and forth into XHTML without ever
| getting caught.
| camehere3saydis wrote:
| >My holy grail-- how do I use DOM methods to create a CDATA
| element to shove my style into? If I could know this then I can
| jump my Dodge Charger back and forth into XHTML without ever
| getting caught.
|
| Does this help? https://developer.mozilla.org/en-
| US/docs/Web/API/Document/cr...
| jancsika wrote:
| Ah, thanks!
|
| In hindsight I probably could have guessed at
| "document.create" and then just read the autocomplete
| suggestions in devTools. :)
| jameshart wrote:
| The 'weirdo' CDATA thing is the only thing that makes XHTML
| actually amenable to this approach, because XHTML is
| tokenizable using a regular expression-based grammar, whereas
| HTML without CDATA is not. As you're obviosuly aware, the
| language inside <style> elements is not suitable for XML
| parsing. Nor is the language inside <script> elements:
| <script> var y = "<!--"; </script>
| <p>... a naive tokenizer thinks this is in a comment ...</p>
| <script> var z = "-->" </script>
|
| There's a weird interaction here between the Javascript and
| HTML parsers, because <, !, and -- are all valid Javascript
| operators, and you can, in theory, stack them up into
| syntactically valid JavaScript expressions. The behavior of the
| following script in a browser is... unpredictable:
| <script> var a = 1; if (0<!--a) {
| document.write('since --a is 0, and !0 is true, and 0 < true
| (!), this should print'); } </script> <p>... and
| this should not be in a comment</p> <script> if
| (a-->-10) { document.write('a-- should still be > -10, so this
| should print, too'); } </script>
|
| The regex in the article will miss the opening <p> tags here,
| because it assumes that it's being given valid, tokenizable
| XHTML.
| jancsika wrote:
| Part of problem and solution you describe was due to the
| battle to define who is encapsulating whom, no? The W3C SVG
| list archive was full of people essentially asking for the
| ability to flow text and replicate a lot of HTML as part of
| native SVG. In that dream, it's certainly important to have
| well-defined behavior for javascript inside SVG since you
| could have SVG user agents that aren't web browsers. And that
| means CDATA to hold the non-XML scripting and styling data.
|
| However, at the end of that history HTML was the clear
| encapsulator and SVG exists either inside it or as a static
| image in Inkscape, the browser, or some library. So today,
| scripts inside an SVG are either a curiosity or security
| nightmare that comes to life when the user clicks "View
| Image" on an SVG image in their browser.
|
| That leaves only the `<style>` tag content as a potential
| ambiguity. So I'm curious-- are there examples where content
| of a `<style>` tag inside an inline SVG causes unpredictable
| behavior in modern browser HTML parsers? I'm guessing there
| must be, but I'd like to play with a clear example.
| lifthrasiir wrote:
| The only part I agree in this writing is that you don't need to
| be snarky to be correct. (I'd like to introduce the XY problem of
| the second kind, where the answerer is so confident that it is
| the answerer who have missed the actual question.)
|
| Some regexes _can_ recognize a language beyond the regular
| language. They are typically available in two flavors: recursive
| references (Perl, Ruby, PCRE) and stackable captures (.NET). They
| are obscure enough that I would not recommend them, but it is
| patently false that regular expressions (EDIT: of the practical
| interest) cannot be recursive.
|
| It is possible to match individual HTML tags with regexes, but it
| _is_ difficult. It cannot use a bare `\w` or `\s` because both
| XML /XHTML and HTML5 parsers have peculiar definitions for tag
| name characters and space characters. For example your `\s` will
| typically match various Unicode space characters, while only
| ASCII whitespaces are recognized in tags. There are also several
| notable exceptions to the parser (and external states termed the
| "tree construction"), so missing any of them would result in an
| immediate XSS. If you think you can write a correct regex for
| HTML tags, my quizzes [1] should make you concerned. Limiting the
| question to XHTML does alleviate some but not all concerns.
|
| The distinction between recognition and parsing is correct, but
| parsing doesn't necessarily mean the reconstruction of parse
| tree. Parsing means the access to constituent nonterminals, which
| can be used to reconstruct parse tree but also directly used as
| their own (e.g. calculators). Indeed in most regex
| implementations you can't extract two or more strings out of each
| capture (Raku is a notable exception), so you can match against
| e.g. `(\w+)(?:,(\w+))*` but can't extract a list of comma-
| separated words with it. Practically speaking this means you
| can't extract a list of attributes with a single regex, making it
| unsuitable for HTML parsing anyway.
|
| [1] https://news.ycombinator.com/item?id=26355451
| dataflow wrote:
| > it is patently false that regular expressions cannot be
| recursive.
|
| No, it's more like the term "regular expression" has gotten
| hijacked and nowadays gets abused to colloquially include,
| shall we say, irregular expressions. i.e. people basically say
| "regex" when they mean "some succinct pattern language with
| syntax similarities to (classical) regular expressions".
| lifthrasiir wrote:
| Regular expressions in the formal language theory do not have
| captures anyway. The name collision is unfortunate, but we
| have already established that regexes in practice means a
| pattern language largely modelled after theoretical regular
| expressions and not the theoretical regular expressions
| themselves. At the very least the writing could have
| mentioned this discrepancy.
| mannykannot wrote:
| >At the very least the writing could have mentioned this
| discrepancy.
|
| The original meaning of 'regular expression' is very
| specific, and has some significant implications which are
| lost with the now-common and less well-defined usage.
| Therefore, if anything needed having this discrepancy
| mentioned, it was your original statement "it is patently
| false that regular expressions cannot be recursive", as
| this is an issue where the distinction is crucial. It is
| good to see that you have now done so, though the way you
| have done so suggests there is nothing of practical
| interest in the formal definition, which, I suggest, would
| be patently false.
| lifthrasiir wrote:
| I intentionally used the term "regex" elsewhere for that
| reason, but I later realized that the indirect quotation
| can be still problematic.
| dataflow wrote:
| Captures are a red herring here. They don't fundamentally
| alter the nature of what a regex does, which is to
| recognize regular languages. Pointing to them as if they're
| some kind of justification is like calling pilots "drivers"
| because drivers originally drove wagons, and wagons didn't
| have rubber tires like cars and planes anyway. It's
| completely missing that the point of the distinction
| between a plane and a wagon has always been the land vs.
| air travel, not modern features like tires or the
| infotainment systems or what have you.
|
| But yes I guess it'd have been better for the writing to
| mention the discrepancy in any case.
| nixpulvis wrote:
| On a related note, lookaheads (and behinds) do somewhat
| change the fundamental expressive power I believe.
| lifthrasiir wrote:
| > They don't fundamentally alter the nature of what a
| regex does, which is to recognize regular languages.
|
| It's a bit subjective but captures are harder than
| recognition. Russ Cox has once noted [1] that the
| extraction has to be run as a separate step after the
| recognition and a fast DFA can't always be used for that,
| suggesting they are related but different problems.
|
| [1] https://swtch.com/~rsc/regexp/regexp3.html
| mannykannot wrote:
| Well, if you allow an arbitrary depth of capture-group
| nesting, then that may be so, but it seems beside the
| point here. It is not clear to me that this article makes
| any point about extraction that is relevant to this
| discussion.
| asddubs wrote:
| problem is that it doesn't reflect how browsers parse things. if
| you were using this in a security context, e.g., here's an
| example it won't detect (granted this is not technically valid,
| but does it matter?):
|
| <div "> put arbitrary html here as you please (using single
| quotes for attributes)<div ">
| NtrllyIntrstd wrote:
| The author seems to be missing the point, in my opinion. While it
| is certainly true that often one can solve simple, seemingly
| innocent sub-problems within more general languages, the
| transitions from "I see I can solve this simple program with
| regex'es!" to "Then I can probably solve this other, almost
| identical problem as well!" and have the problem explode right
| into your face are subtle (almost imperceivable to a novice) and
| it would be a more robust solution to go for the right tools
| (i.e. an (x)html parser), as well as a good learning example. On
| a side note: regular expressions can not - by definition - parse
| recursive languages. A regular expression matcher that does is
| not a regular expression parser but an ugly-duckling in the
| family of context-free grammar matchers. People should learn when
| and how to use those.
| nixpulvis wrote:
| The regex is surely faster for the specific case. I can't say
| I've seen an XHTML parser off hand that allows me to stop
| parsing after just the start tag. Perhaps a lazy parser could
| start to compete, but I'm just guessing.
| josefx wrote:
| Aren't most XML parsers SAX or STaX based? Only time I ran
| into a library that only offered a full DOM without the
| underlying event based parser was whatever browsers consider
| the JavaScript standard library.
| [deleted]
| nixpulvis wrote:
| You're totally right! Many good stock parsers already
| stream things (more or less).
|
| Still, I'm just making a comment about the overhead... I
| would hedge a guess that you're going to have a hard time
| beating a regex with an HTML parser for speed, assuming
| what you want can be done with both.
|
| This is all irrelevant, because as the OP mentions, the SO
| question at hand cannot be solved with standards compliant
| parsers because self-closing tags will not be
| distinguishable.
| Akronymus wrote:
| I believe you could build such a parser out of parsec.
| Altough, I am not sure if that is exactly what you are going
| for.
| wodenokoto wrote:
| The article goes as far as to say that a parser is not the
| right tool.
|
| > Not only can the task be solved with a regular expression -
| regular expressions are basically the only practical way to
| solve the problem. Which is why none of the clever answers
| actually suggest another way to solve the problem.
|
| So no, the author is not missing the point at all.
| [deleted]
| IshKebab wrote:
| I mean that bit is clearly wrong. An XML/HTML parser is a
| perfectly practical way to solve the problem.
|
| However I completely agree that they didn't miss the point. A
| regex to do this might be fine for hacky things that you
| don't need to be robust (e.g. for searching for stuff,
| measuring stats, one-off scripts etc.).
| yoz-y wrote:
| But the original SO question does not imply that they want to
| solve a more complex problem. The SO asker explicitly asked for
| opinions, so that's what they got. However, I absolutely think
| it is the right choice to choose simpler tools to solve simpler
| problems, as long as you are aware of the implications.
| nooyurrsdey wrote:
| This is one of those things that people will debate about
| endlessly and ultimately it feels so silly.
|
| The poster asked how to do it, and this person provided a
| practical regex to cover most (if not all) cases.
|
| Everything else is just pedantic debate.
| h2odragon wrote:
| The pedantic discussions can be fun and educational, but the
| regex based hack gets the job done in a few minutes, while the
| pedants are still wrestling with parsers and libraries.
|
| ... and then there's the anticipated joy of seeing the pedants'
| complicated, theoretically correct solution explode because the
| input wasn't what they assumed, in the first place.
|
| The pedants that have that experience either become
| enlightened, or vociferously strident about the importance of
| proper, theoretically correct solutions in place of quick
| hacks.
|
| Thus the meme status of the SO answer.
| IshKebab wrote:
| The issue is that _sometimes_ you should use a robust parser
| and do it properly, and _sometimes_ a hacky regex is fine.
| But people forget that when arguing about which you should
| use.
| SavantIdiot wrote:
| Oh I've seen this many times in different forms. Especially with
| regexes.
|
| You know what this is a great example of? A case where hacking
| makes a mess, and thinking before coding solves the problem.
|
| The madness comes from using the wrong tool for the problem. Yes,
| you can hack a regex to parse XHTML this might be "good enough",
| but it is more robust, cleaner and easier to explain if you use a
| lexical tokenizer and a grammar.
|
| The lure is an illusion that comes from an initial effort
| assessment. Where the effort to hack a quick-and-dirty regex
| (call this Ehack) vs a "oh, man, you mean I gotta think about the
| problem" (call this Ethink) appears as "Equick <<< Ethink."
| However, it soon evolves to the scenario where "Equick >>>
| Ethink," driven by the thought process, "I'm almost there, this
| regex just needs one more tweak." Aka, the gambler's fallacy: it
| comes into play and the sunk costs are ignored.
|
| TL;DR - Use the right tool for the problem, even if it means a
| slightly larger up-front effort investment.
| Rapzid wrote:
| Yeah, and you never actually end up _solving_ the _problem_.
| You just end up solving every edge case that comes up :D
|
| It's the same mentality that can lead to fixing but symptoms
| without making any real progress on the underlying issues..
| Sometimes that's good enough I guess.
| tester756 wrote:
| uhh?
|
| just because you can doesn't mean you should
|
| just take a look at proposed regex
|
| >(
|
| > # match all tags in XHTML but capture only opening tags
|
| > <!-- . _? -- > # comment
|
| > | <!\\[CDATA\\[ ._? \\]\\]> # CData section
|
| > | <!DOCTYPE ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* >
|
| > | <\? . _? \? > # xml declaration or processing instruction
|
| > | < \w+ ( "" [^""]_ "" | ' [^']* ' | [^>/'""] )* /> # self-
| closing tag
|
| > | < (?<tag> \w+ ) ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > #
| opening tag - captured
|
| > | </ \w+ \s* > # end tag
|
| > )
|
| it's ugly as hell
|
| >Parsing typically uses (at least) two steps: Tokenization which
| uses regular expressions to splits the input string into a
| sequence of syntax elements
|
| I don't use regex for tokenization, I'm doing something wrong?
|
| But overall I think this is important post, even despite I
| believe that regex is the best example of "good idea, shitty API"
| BiteCode_dev wrote:
| Weird article that basically says people are wrong then prove
| they are right.
| motoboi wrote:
| The problem is: you cannot parse malformed (real, everyday) html
| with regexes.
|
| But if you need to parse html even malformed) generated by the
| same template (like a scrapping situation), the whole file
| becomes regular, which can be parsed by a regular expression.
|
| But if you try to parse html in general, too bad because then
| you'll need to take html in consideration and will need a
| recursive descent parser, not a regex.
|
| This question popped up so many times in forums in 2000's that
| people got mad at that.
| math-dev wrote:
| Great article (assuming the solution provided works).
|
| I do a lot of parsing in my projects, I find natural text based
| input vital for power users who don't to point and click always.
|
| What are some good parsing algorithms, theoretical articles etc
| to help me become more professional in the parsing tools I write?
| [deleted]
| inopinatus wrote:
| > The question is about finding opening tags in XHTML using a
| regular expression
|
| Bzzzt, wrong, sorry! The question is about finding open tags in
| the presence of XHTML self-closing tags. That difference alone
| places these interpretations gulfs apart. But there's more: it
| does not specify that the input document is even XHTML, _only_
| that XHTML-style self-closing elements may be present. In fact
| the original question was barely minutes old and tagged merely
| "regex" when that famous answer was written in 2009; the question
| was not tagged with "xhtml" until 2012, and not by the original
| author either.
|
| Revealingly, then, if we review the broader context (i.e question
| history) of the original question author, it's clear that yes
| indeed _they were trying to fix a malformed document_ , and in
| particular to normalise it _into_ XHTML, with focus on fixing up
| any so-called "dangling tags". For this task, the suggestion of
| "use a parser" is indeed sound advice.
|
| The real moral here is, don't be a jerk about the precise
| semantics of a question, look at what the person _needs_ , and
| help them ask better questions.
|
| Otherwise, you're just gonna discover that there's always a
| bigger jerk, and they're on Stack Overflow, moderating your
| stuff.
| a1369209993 wrote:
| > For this task, the suggestion of "use a parser" is indeed
| sound advice.
|
| Perhaps technically, but it's also useless advice because a
| parser does not exist for their particular flavor of malformed
| XHTML. XHTML parsers parse XHTML, which you yourself have said
| it wasn't:
|
| > they were trying to fix a malformed document
|
| So in the absence of a reference to a particular malformed-
| XHTML-recovering parser (which may or may not work on the
| specific input they have, but "try this thing" is at least
| actionable advice), "use a parser" amounts to "write a entire
| parser yourself, then use it".
|
| > don't be a jerk about the precise semantics of a question,
| look at what the person needs
|
| Pot, kettle.
| [deleted]
| inopinatus wrote:
| > "use a parser" amounts to "write a entire parser yourself,
| then use it"
|
| "Use a parser" is a common answer, besides being the accepted
| one, and with good reason: it'll work. The world is not short
| of HTML parsers (although, who knows, perhaps PHP may have
| been short of very good parsers back in 2009). Serializing
| XML from the resulting memory structure, DOM or otherwise,
| closes the loop, and this remains a conventional and
| commonplace means to normalize some incoming HTML-like mush
| into something that can be interpolated into XHTML (or, more
| specifically I think, XML RSS) and a strict receiver will at
| least accept.
|
| > Pot, kettle.
|
| Oh look, personal abuse! Good-day to you, too.
| IshKebab wrote:
| Read the regex. It handles self-closing tags fine.
|
| Also you're doing the _intensely_ annoying thing that lots of
| StackOverflow people do of imagining that the asker really
| wanted to ask a different question. It happens sometimes. But
| you shouldn 't just jump in and _assume_ that they don 't know
| what they want and you're so much smarter than them so you know
| what they really want.
|
| Offer additional answers if you want, but _answer the question
| they asked first_.
|
| (Sorry, pet peeve.)
| inopinatus wrote:
| You're not the first to take that line, so I'll refer you to
| my previous observations:
| https://news.ycombinator.com/item?id=27097403
|
| There's no wild assumption going on here. I just bothered to
| keep reading, very carefully, everything the original author
| actually wrote.
|
| Then, please, further reflect that Stack Overflow is not
| Codewars; it is a forum for practical, focused, and relevant
| problem-solving advice, and at its best the moderation and
| answer processes help folks to iteratively revise and improve
| their questions. A crucial step is, therefore, analytically
| clarifying both the parameters and the intended outcome.
|
| Contextualisation and focusing of requirements is a familiar
| and essential skill for any programmer being handed a
| requirements statement, and answering S.O. questions involves
| the same exercise, just in vignette.
|
| The handling of this question was not, in this sense, the
| best. But expecting anything less, would just be a load of
| fizzbuzz.
| goto11 wrote:
| The problem is the answers saying "this is not possible".
|
| The OP asks a perfectly reasonable question. The answers assume
| the OP _actually_ meant to ask a different question, and they
| they ridicule the OP for this imagined question.
|
| The question they _imagine_ the OP asks was: "How do I parse
| an arbitrary HTML document into an element-tree using only a
| single regular expression and not using any auxiliary data
| structures, not even the call stack?"
|
| Yes, this is indeed not possible, given the limitations of
| vanilla regular expressions.
|
| But that was not the question asked.
|
| Of course "use a parser" is perfectly sound advice (if a parser
| exists which can solve the OP's problem). But saying what the
| OP is attempting to do (tokenize xhtml) is _impossible_ is
| absurd, since then it would also be impossible to write a
| parser!
| inopinatus wrote:
| They're trying to tokenize HTML into two arrays, one being an
| array of opening tags and one of closing tags, with the hope
| being to pairwise compare and reconcile the elements of these
| arrays.
|
| The necessary clarification appears further down the page.
|
| As for whether it's a reasonable question; as written I beg
| to differ, it's the opposite, since it does not convey their
| problem except by a very careful and nuanced reading. The
| proper response of the S.O. community should've been to aid
| the OP in clarifying their intentions and amending the
| question accordingly to bring focus onto their actual
| problem.
|
| That is not what happened. Instead they are indeed on the
| receiving end of some ridicule, which is shameful, but the
| advice of "use a parser" would still likely have been a top
| answer in some fashion had the OP's scattered ancillary
| questions and clarifications been incorporated.
|
| Sadly, only one correspondent of many seems to have seen fit
| to ask.
| 1vuio0pswjnm7 wrote:
| What if we looked at the XHTML parsers trusted by the people
| who mindlessly dismiss the utility of regular expressions and
| found they were constructed using a lexer that relied on
| regular expressions.
| inopinatus wrote:
| Well, since I've learned at least four assembly languages,
| and wrote my first toy lexer at least 25 years ago, probably
| in Standard ML, by this standard I seem qualified to comment.
|
| It doesn't matter what the parser looks like under the
| bonnet. What matters is the utility it provides.
|
| One might otherwise similarly offer the advice to mine your
| own copper and grow your own silicon, these being equally
| essential activities for anyone seeking the ultimate in
| mechanical sympathy.
| [deleted]
| swiley wrote:
| Here's the original text for reference:
| Locked. Comments on this question have been disabled, but it
| is still accepting new answers and other interactions. Learn
| more. I need to match all of these opening
| tags: <p> <a href="foo">
| But not these: <br /> <hr class="foo"
| /> I came up with this and wanted to make sure
| I've got it right. I am only capturing the a-z.
| <([a-z]+) *[^/]*?> I believe it says:
| Find a less-than, then Find (and capture) a-z one
| or more times, then Find zero or more spaces, then
| Find any character zero or more times, greedy, except /, then
| Find a greater-than Do I have that right? And
| more importantly, what do you think? html
| regex xhtml Share Improve this question
| edited May 26 '12 at 20:37 community wiki
| 11 revs, 7 users 58% Jeff
| inopinatus wrote:
| Not sure which revision that's pasted from, but it is not the
| original text, which can be found at
| https://stackoverflow.com/revisions/1732348/1 and I strongly
| advise against neglecting the title; doing so is how some
| folks blunder into misinterpretation, since the wording of
| that title is telegraphing a quite different underlying need
| to "please solve this regex puzzle".
| DangitBobby wrote:
| Reading both, I see little to no difference. Is there an
| important change between the two revisions that I am
| missing? Also, my reading of the title does not, in fact,
| telegraph anything other than "help me with this regex
| puzzle."
| inopinatus wrote:
| I recommend consulting my other remarks within this
| topic, but on this specific point:
|
| The title's inconsistency with the body text, i.e. "open
| tags" vs "opening tags", is especially and immediately
| notable because they are (in context) grammatically
| interchangeable but have dissimilar meanings. This is
| immediately suggestive of (but _not_ diagnostic of) a
| writer revealing context and then switching to detail. As
| a longtime reader of requirements documents and S.O.
| questions, a mental flag to check the intended meaning of
| both is already raised at this point.
|
| The reference to XHTML is ambiguous, since it speaks to
| working around XHTML that is already present, rather than
| defining an input or output document, which means this is
| something thinking at the character level rather than in
| terms of DOM. This impression is verified by the
| comparison pairs in the question body, leaving the
| essential context question of _what are the desired
| inputs and outputs? "_ glaringly unanswered.
|
| The early construct, "I need to ..." is a secondary flag
| that immediately reinforces the likelihood of a critical
| gap in context, since it does not explain _why_ the need
| has arisen.
|
| To anyone familiar with the structure & semantics of the
| HTMLs, the omission of _any_ mention of tag closing,
| having discussed "opening tags", boosts the sense that
| something relevant is missing from the question. The
| obviously flawed regular expression then attracts a
| "probable novice" qualifier, which is amplified by both
| the closing "what do you think?" and the original's
| emoticon.
|
| At this point, we're about ten seconds into the read-
| through and there's already a ton of labels pointing to a
| beginner who may be conflating dissimilar concepts and
| (through inexperience) choosing the wrong tool for the
| task, whatever that may be. Another few seconds to review
| comments, at which point it appears that author _does_
| understand the distinction between an "open tag" and an
| "opening tag", and this reweights the XHTML reference in
| the title toward being a need to _output_ strict XHTML.
|
| Given their apparent beginner standing it's likely this
| is either their first S.O. post, or one of slightly too
| many, and a check in their profile for proximate
| questions immediately reveals the latter case. References
| to balancing of "dangling tags" abound, collapsing all
| preceding problem variants and likelihoods into the only
| focused understanding that fits: it's a PHP novice,
| bright but inexperienced, who wants to normalise a
| nonconforming document into XHTML by balancing the tags,
| and that the mostly likely use case is injecting an
| existing HTML fragment into a RSS XML feed.
|
| Elapsed time ca.90 seconds.
|
| Another minute to scroll through surrounding Q&A
| materials, to allow alternative options to appear (they
| don't).
|
| A few seconds for a chuckle at that old friend of an
| amusing answer, and recognise that the advice within is
| sound: what they really need is, indeed, a parser. Then,
| a pause to consider a contrasting notion, that what
| they're trying to do is _write_ a parser. Consider
| evidence that author has taken at least one step towards
| inventing stacks on their own from first principles;
| prospect loses to existing top answer because the
| experience level required to write parsers cannot,
| ultimately, be acquired from a single S.O. question.
|
| Last but not least; the final boss: convincing the chorus
| of Hacker News to accept this interpretation. That takes
| _much_ longer.
| DangitBobby wrote:
| Consider alternatively that this is a tiny piece in the
| much broader puzzle of what they are trying to
| accomplish, that they are aware both of their own
| beginner status but also that, in this case, good enough
| will be good enough, or that they don't have the time or
| inclination to switch to a real parser and that's why
| they didn't ask about it. Or maybe even that the goal
| here is to specifically learn to use regex to solve a
| real software problem that they have. It's all a matter
| of interpretation of context that we don't have. As an
| asker of questions on SO where people provide answers to
| questions I specifically intended to not ask, this
| pattern of reading deeply into context we don't have is
| actually kind of annoying. Unless you provide both the
| answer to the actual question and then some guidance on
| how you _really_ should be doing something. Those answers
| are very, very helpful.
| inopinatus wrote:
| These objections may seem relevant to your personal
| experience, but they don't pertain to the case at hand,
| for which ample context is available.
|
| Dealing with responses from people who've rushed to write
| something without bothering to properly consider the
| problem is, of course, internet 101, but such answers
| generally take the short road to content oblivion.
|
| For those answers of any quality, however, note this: if
| people are answering, for free, on their own time,
| questions that you don't think you asked, but they
| evidently do, then consider taking some personal
| responsibility for that outcome. Complaining about it
| would seem wholly entitled and ungrateful, and
| unconstructive besides. If they are well-meaning
| beginners, _coach them_ : it's how the stone soup gets
| made.
| DangitBobby wrote:
| I also both ask and respond to questions. Mostly respond.
| For free. On my own time. People who answer on SO are not
| a unique special breed in the sense that they are willing
| to help others with their problems. I have read your
| reading of the thread, gone through it myself with that
| in mind, and disagree wholeheartedly that you have
| meaningfully extrapolated the context you claim.
| inopinatus wrote:
| That much is quite evident.
|
| Perhaps we'll never know if Jeff's XML RSS feed was
| conforming or not, but I like to think they got there.
| [deleted]
| funyunpowder wrote:
| based on the stackoverflow thread, and then the comments here, an
| interesting research paper topic would be 'why do people get so
| passionate about regex'
| AzzieElbab wrote:
| Long regexes are the root of all evil
| crispyambulance wrote:
| Long regex's are not the root of all evil but they're certainly
| a tendril of bad-practice.
|
| I would say that the OP's usage, a home-made concoction to find
| all opening tags in an xhtml doc, crosses the line into bad
| practice. Get a room and use a parser, people!
| goto11 wrote:
| So if using regular expressions is "bad practice", how should
| one write the tokenizer or lexer stage of a parser?
| jrockway wrote:
| Regexes are computer programs just like any other computer
| program. A long program, terse syntax, embedding one
| programming language in another programming language, etc. are
| risk factors for unmaintainable code, but are manageable in the
| same way that you manage the risk inherent in other complex
| computer programs.
|
| I did always like elisp's "rx"[1] library for removing some of
| the terseness. Not enough to ever use it, of course, but it's
| an interesting idea.
|
| [1] https://www.emacswiki.org/emacs/rx
| goto11 wrote:
| If the task at hand is something which _can_ be solved with a
| regex, then any other solution (e.g manual string scanning)
| would be far more complex.
| AzzieElbab wrote:
| I am guilty of this type of thinking as well. I saw the
| errors of my ways when I had to fix someone's else regular
| expressions
| goto11 wrote:
| So what do you use instead of regular expressions for such
| tasks?
| hansvm wrote:
| IMO the biggest problem with regexes as seen in the wild
| is a lack of composability. If you need some kind of
| pattern like "[setA][setA+setB]{0,n}" then you'll copy-
| paste the definition of setA in both places. If you need
| to re-use that entire regex you'll copy-paste it again
| and construct a monstrous string with a really well-
| defined structure that isn't even slightly apparent
| without a reverse engineering session.
|
| Up to a point you can solve that by just giving names to
| relevant sub-expressions, using a regex builder, etc, but
| in my experience if I'm going to write even a moderately
| complicated regex I'll probably be better served with
| something like parsec (a python implementation here [0])
| in whichever language I'm currently using.
|
| That isn't to say that regexes don't have their place --
| anything quick and dirty, situations where you need to
| handle unsanitized input (mind you, the builtin regex
| engine is probably vulnerable to exponential runtime
| attacks) and don't want to execute a turing-complete
| language, etc.... I just think it has bad ergonomics for
| any parser you might use more than once, and I haven't
| yet regretted using parsec in situations where a complex
| regex would have sufficed.
|
| [0] https://pythonhosted.org/parsec/
| User23 wrote:
| Perl is great for this. It's been a long time since I've
| written any, but with the right flags and use of qr// you
| can write extremely readable perlre.
| 03b17999-4268 wrote:
| I haven't read the rest of the thread, but the article is even
| wronger than the glib replies there. HTML needn't be well formed.
| There are adhoc rules which let major browsers parse broken HTML.
| If you do not follow the spec to the letter you will have your
| tooling break on input that every browser thinks is acceptable.
|
| Which is why you always use whatever html parsing library comes
| with your language. There is no simple answer in the thread
| because there is no simple answer in the real world.
|
| That said, anyone who says:
|
| >It is quite possible and not even that difficult:
| ( # match all tags in XHTML but capture only opening
| tags <!-- .*? --> # comment |
| <!\[CDATA\[ .*? \]\]> # CData section | <!DOCTYPE ( ""
| [^""]* "" | ' [^']* ' | [^>/'""] )* > | <\? .*? \?> #
| xml declaration or processing instruction | < \w+ (
| "" [^""]* "" | ' [^']* ' | [^>/'""] )* /> # self-closing tag
| | < (?<tag> \w+ ) ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > #
| opening tag - captured | </ \w+ \s* > # end tag
| )
|
| Should be seriously mentored by someone.
| megous wrote:
| > Should be seriously mentored by someone.
|
| Maybe people who think a basic regex such as this is difficult,
| need some mentoring.
|
| It doesn't even use lookahead/lookbehind or other more
| complicated features. It only uses non-greedy matching 2 times,
| which is the only thing that's more complicated than the basic
| AND/OR logic expressed by the rest of the expression.
| tannhaeuser wrote:
| You make this more complex than it actually is. HTML is
| basically SGML with a DTD declaring rules for tag
| omission/inference, empty elements, and attribute shortforms.
| This is what the majority of XML heads seem to struggle with,
| but which nevertheless is every bit as formal as XML is, XML
| being just a proper subset of SGML, though too small for
| parsing HTML.
|
| Then HTML has a number of historical quirks including:
|
| - the style and script elements have special rules for comments
| that would make older pre-CSS, pre-Javascript browsers just see
| markup comments and ignore those
|
| - the URL syntax using & ampersand characters needs to be
| treated special because & starts an entity reference in SGML's
| default concrete syntax
|
| - the HTML 5 spec added a procedural parsing algorithm (in
| addition to specifying a grammar-based spec) that basically
| parses every byte sequence as HTML via fallback rules; for most
| intents and purposes, the language recognized by these rules,
| taken to the extreme, is not what's commonly understood as HTML
|
| - WHATWG have added a number of element content rules on top of
| the HTML 4.01/HTML 5.0 baseline with ill-defined parsing rules
| (such as the content models for tables and description lists);
| the reason is precisely that WHATWG, once Ian Hickson had
| distilled the HTML DTD 4.01 grammar rules into the HTML 5
| grammar presentation as prose, a formal basis was no longer
| used for vocabulary extension
| dgrunwald wrote:
| But both the original question and that article were about
| XHTML. The non-well-formed mess only matters for HTML, not
| XHTML. The regex is a valid answer to the stack overflow
| question.
| WesolyKubeczek wrote:
| Sorry, but no.
|
| The original question only mentions the author wants to
| ignore XHTML-style self-closing tags which is in no way
| implying the input to be well-formed XHTML.
| Rapzid wrote:
| The CDATA section doesn't appear to match this first example
| from Wikipedia: <![CDATA[<sender>John
| Smith</sender>]]>
|
| So it's either bugged due to the spaces or there is something
| else going on I don't understand; complexity.
|
| Agree on the sentiment. Through a "Simple made easy" lens, easy
| or hard, there is a lot of complexity in the task and this
| solution..
| dataflow wrote:
| Try that regex on < script>
| console.log("<script2>"); </script>
|
| Edit 1: I'm unsure if the inner <script2> is valid (X)HTML, so it
| might not be an issue of being unable to parse correct (X)HTML,
| but rather an issue of being unable to detect invalid (X)HTML.
| (Can someone verify?)
|
| Edit 2: It seems Chrome chokes on the space... does anyone know
| if the initial space is valid? I'm pretty sure I've seen parsers
| that accept it...
| adament wrote:
| But it should be easy based on this example to include correct
| HTML tags in the script which the regular expression will emit.
| Or if you want to recognise HTML tags in the script, you can
| easily obfuscate construction of in the script using string
| concatenation.
| eCa wrote:
| I don't think any part of that is valid XML. There cant be
| space between < and the tag name, and I believe content
| containing tags should be in a CDATA section.
| goto11 wrote:
| The question is about XHTML, not HTML. HTML does not even have
| the self-closing tags the question is concerned about.
|
| In XHTML, either the opening angle bracket must be escaped, or
| the script should be in a CDATA section.
| capableweb wrote:
| > Edit 2: It seems Chrome chokes on the space... does anyone
| know if the initial space is valid? I'm pretty sure I've seen
| parsers that accept it...
|
| Most browsers probably can deal with it, but it's not valid
| xml/html. Try passing it through a validator, it'll complain
| about foreign characters after `<` and then complain about a
| trailing `</script>` as <script> was never opened in the first
| place.
| chubot wrote:
| This conversation would be a lot clearer with a distinction
| between "regexes" and "regular languages". The former is what
| Perl/Python/etc. have, and the latter is a mathematical concept
| (and automata-based non-backtracking engines like RE2, re2c, and
| rust/regex are closer to this set-based definition).
|
| https://www.oilshell.org/blog/2020/07/eggex-theory.html
|
| With those definitions, this part of the snarky answer is wrong:
|
| _HTML is not a regular language and hence cannot be parsed by
| regular expressions_
|
| That is, regular expressions as found in the wild can parse more
| than regular languages. (And that does happen to be useful in the
| HTML case!)
|
| This answer is also irrelevant, since the poster is asking for a
| solution with regexes, NOT regular languages:
|
| _I think the flaw here is that HTML is a Chomsky Type 2 grammar
| (context free grammar) and a regular expression is a Chomsky Type
| 3 grammar (regular grammar)._
|
| In this post, the example given IS a regex, but it IS NOT a
| regular language: <!-- .*? --> # comment
|
| The nongreedy match of .*? isn't a mathematical construct; it
| implies a backtracking engine.
|
| I gave my analysis here and listed 3 or 4 caveats:
| https://news.ycombinator.com/item?id=26359556
|
| I prefer to use regular languages and an explicit stack, but this
| is not really what the original question was asking.
| a1369209993 wrote:
| > This conversation would be a lot clearer with a distinction
| between "regexes" and "regular languages".
|
| Very much so.
|
| > In this post, the example given IS a regex, but it IS NOT a
| regular language: `<!-- .*? --> # comment` The nongreedy match
| of .*? isn't a mathematical construct; it implies a
| backtracking engine.
|
| Actually, that's [edit: "it IS NOT a regular language"] wrong,
| at least in principle. If you're limiting it to only the
| shortest match (which is how HTML (and most other) comments
| actually work), then (just like `(abc)+` is shorthand for
| `(abc)(abc)*`) `<!-- .*? -->` is shorthand for (assuming I
| haven't made a mistake in the for-lack-of-a-better-word-
| arithmetic): <!-- ([^ ]| [^-]| -[^-]| --[^>])*
| -->
|
| That is, shortest-repetition-only can be implemented in a
| purely regular system.
|
| On the other hand, if you want to allow longer matches when
| actually needed, then for purely-regular purposes (where it
| either matches or not) `<!-- .*? -->` is just a wierd way of
| writing `<!-- .* -->` (which is quite obviously a regular
| language).
| goto11 wrote:
| Does your regex assumes that "-->" must be prefixed by a
| space? This is not the case in XML. (Also the string "--"
| must not occur inside a comment, so the last clause is not
| necessary.)
| a1369209993 wrote:
| > Does your regex assumes that "-->" must be prefixed by a
| space?
|
| Yep, because the quoted regex assumed the same thing, and I
| didn't see a point in editorializing more than necessary.
|
| > Also the string "--" must not occur inside a comment, so
| the last clause is not necessary.
|
| "Must not" seems unreliable in webpage parsing. What page
| does your XHTML parser produce when fed text of the form
| `<!-- --is this a comment? <a href=-->`, for example?
| chubot wrote:
| Yeah I think you are right -- the nongreedy match can be
| simply written as a more awkward pattern. I think regular
| language negation and intersection also help (which the rarer
| derivatives-based implementations seem to have). They are
| still regular and equivalent in power, but there's a richer
| set of operators for writing patterns.
|
| I would also divide it into the "recognition problem" and the
| "parsing/capturing problem".
|
| Recognition is just yes/no -- does it match? And notably that
| excludes common regex APIs that either return the end
| position (Python's re.match) or search (re.search). It is
| more like math -- is it in the set or not?
|
| For the recognition problem, there's no difference between
| greedy and non-greedy, in terms of the result. (It does
| matter in terms of performance, which shows up in practice in
| backtracking engines!)
|
| But parsing / capturing is more relevant to programmers. I
| don't remember all the details but there is some discussion
| on the interaction between greediness and capturing here:
| https://swtch.com/~rsc/regexp/regexp2.html
|
| It looks like it can all be done in linear time and RE2
| supports it fine.
|
| So in that case maybe the 2 questions are closer than I
| thought. I have a very similar set of regexes that I use for
| parsing (my own) HTML, but I use one regex for each kind of
| token, and I do it with the help of an explicit stack.
|
| I'm using Python's engine so I think it is better to separate
| each case, for performance reasons. But maybe with RE2 it
| would be fine to combine them all into one big expression as
| is done here, for "parallel" matching. The expression is a
| little weird because it's solving a specific problem and not
| something more general (which is useful and natural).
| a1369209993 wrote:
| > common regex APIs that either return the end position
| (Python's re.match) or search (re.search).
|
| Yeah, I probably should have explicitly said that's what
| the first translation (`[^a]|a[^b]|ab[^c]|...`) was for;
| it's a optimization (possibly-backtracking-parser ->
| [ND]FA) I've used a couple of times to beat things into
| guaranteed O(N) time.
|
| > But parsing / capturing is more relevant to programmers.
|
| I'd debate "more", since that's a additional thing _on top
| of_ matching and searching. Any case where you need the
| former, you also need the latter to even know what _to_
| capture. But it 's definitely something you _do_ frequently
| need.
| amelius wrote:
| 1. determine size of XHTML input
|
| 2. build regex that works up to the size determined in step 1.
|
| 3. apply regex
| shmageggy wrote:
| It's not explicitly stated, but I believe the author's point is
| that the original question didn't require a recursive solution
| (because it's only asking about individual tags, not matching
| opening tags with their closing partners)
|
| Edit: yes looking at the answers, someone pointed this out in a
| comment response to the"Chomsky" answer:
|
| > _The OP is asking to parse a very limited subset of XHTML:
| start tags. What makes (X)HTML a CFG is its potential to have
| elements between the start and end tags of other elements (as in
| a grammar rule A - > s A e). (X)HTML does not have this property
| within a start tag: a start tag cannot contain other start tags.
| The subset that the OP is trying to parse is not a CFG. - LarsH
| Mar 2 '12 at 8:43_
| inopinatus wrote:
| Look again. The context is definitely one of identifying tags
| that should be closed, but aren't, and being identified thus in
| order to normalise a malformed document into well-formed XHTML.
| It is not prominently stated, but it is nevertheless the case.
| Consequently, balance matters. The clue is in the title: "match
| open tags", for which the body text has only one part of the
| author's train of thought (i.e identifying the opening tag).
| Further discussion of matching the closing tag is omitted from
| the body of the question, and many people forget (or disregard)
| the nuanced difference in the title at this point in their
| read-through, not stopping to ponder "why the heck would they
| just want the opening tags? what have I missed here about the
| context?" and instead treating it like some particularly awful
| and weirdly contrived exam question, as in the article at the
| top.
|
| The final confirmation of all this is in the question history
| of the question author around the same date: it's definitely
| what they were working on.
|
| Consequently, our zalgo-spewing correspondent has it right,
| they should use a parser, and in addition the question
| should've received feedback early to help them describe the
| context more clearly.
|
| That all this was never properly clarified is a failure of
| moderation, and further a demonstration of how many folks
| struggle to challenge (or even identify) their own assumptions.
| goto11 wrote:
| "Match open tags" obviously refers to writing a regular
| expression which matches opening tags, not to pair opening
| tags with end tags.
|
| If you look at the regex which the OP themself suggests, it
| is clearly only intended to match opening tags (excluding
| self-closing tags), not search for corresponding end-tags.
| inopinatus wrote:
| Well, as now expressed at hopefully sufficient length, it
| _doesn't_ just say solely that, unless one a) disregards
| the difference in phrasing, and then b) disengages any
| sense of purpose and practical utility and instead treats
| it like a badly worded test question.
|
| It's kind of a shibboleth, in a way, for developer
| sensitivity to actual needs, as opposed to getting hung up
| on how clever they are.
| jcelerier wrote:
| > disengages any sense of purpose and practical utility
| and instead treats it like a badly worded test question.
|
| Who are you to know the purpose and utility better than
| the person who asked the question ?
| inopinatus wrote:
| Oh, are they here? It'd be fascinating to hear from them.
|
| Alternatively, perhap, that hostile tone is suggesting
| I'm personally unqualified to interpret loosely framed
| questions? I suppose, since I've only been doing it for a
| few decades, I'm definitely a novice by any standard, and
| my tendency to observe and follow up on anomalous,
| incomplete, subtly conflicting, or otherwise inexplicable
| requirements by investigating both the timeline and
| substance of the original context, and the apparent
| motivations and outcome preferences of the author, is
| sheer beginners luck, and any uptick in stakeholder
| utility that from time to time accompanies amending
| recommendations following such investigative and
| analytical activity a sheer coincidence! So that must be
| it - as you can probably tell from all this smug, empty
| bravado, I'm really just sharing pure speculation, wild
| guesswork, total fluke, impertinent leaps of inferential
| faith, only just grasping at the vague outline of my own
| blind spots et cetera et cetera, and consequently _yes_ ,
| I'd love to hear the original intent restated from the
| horse's mouth, too; but, for the meantime, I'll read the
| tealeaves, systematically analyse and synthesise to the
| best of my ability, describe and discuss any substantive
| points of comprehension that I think might help enrich,
| or at any rate challenge, a reader's perspective
| (including my own), and cross my fingers hoping to read,
| mark, and inwardly digest what new understanding or
| revealed wisdom as I can - even when it comes, as it has
| there and here, via diacritical allegory and dialectical
| hellfire.
|
| Or, finally, if you just want the TL;DR version, the
| reason I feel unshakeably comfortable asserting that the
| question author's actual purpose is normalising a
| nonconforming document into XHTML, by balancing open &
| close tags, is _because they said so_.
|
| It's in an answer comment about halfway down the first
| page.
|
| > "Can you provide a little more information on the
| problem you're trying to solve"
|
| _" Determining all the tags that are currently open,
| then compare that against the closed tags in a separate
| array"_
|
| If that's not enough, here are quotes from their other
| questions, posted in the minutes and hours prior:
|
| _" How to balance tags with PHP"_, and
|
| _" I need to write a function which closes the opened
| HTML tags."_
|
| Sometimes, we just have to bother reading what's in front
| of us.
| [deleted]
| jcelerier wrote:
| > Sometimes, we just have to bother reading what's in
| front of us.
|
| You are making an absolutely great example of that. I was
| absolutely not talking about the original SO post, but
| about the generally extremely entitled answers which
| assume the existence of a very specific X to the Y of a
| post.
| inopinatus wrote:
| If I understand correctly, you're suggesting "Who are
| you" wasn't directed at me personally, "the person"
| wasn't referring to the OP but all possible authors, and
| "the question" wasn't referring to, well, the original SO
| question at hand, but the class of all possible
| questions.
|
| If so, then I see, I think: perhaps it was more intended
| as "Who is anyone to know the purpose and utility _(of a
| question)_ better than the person who asks that question?
| "
|
| I can't answer that, since I agree with the sentiment.
| I'm only really discussing this one specific question,
| possible because there's so much extra context, and so
| much ambiguity in the original text. My personal hubris
| doesn't generalise to the class of all possible
| questions, and I'd struggle to sympathise with anyone
| making such an ambit claim.
|
| But then, maybe it wasn't even a question at all.
| shmageggy wrote:
| I have to ask, is this series of comments some kind of
| performance art or some sort of social experiment? Or is
| this unironically how you write/speak/act?
| Y_Y wrote:
| You don't have to ask. Isn't there a rule about fake
| curiosity here?
|
| I hope this is the last comment in this bad bad thread.
| inopinatus wrote:
| To someone paying enough attention to crystallize the
| trichotomy, all I can say is, thank you for reading.
| adament wrote:
| Does the proposed regular expression really handle embedded
| script content correctly? From my limited understanding of HTML,
| pretty much only </script> counts as closing the script contents
| and everything else is treated as part of the script.
| goto11 wrote:
| The question is about XHTML though, not HTML which have a more
| complex syntax.
| jontro wrote:
| To me it's a bit ambiguous if the original question is about
| both html and xhtml. It's tagged with both
| goto11 wrote:
| The headline says XHTML. HTML does not even have the self-
| closing tags the question is concerned about.
| megous wrote:
| Does it matter? You can create regex with lookahead for
| </script>. The point is that it's possible to solve the problem
| from SO this way, due to the nature of the problem, not that
| this particular expression is perfectly correct.
| anaerobicover wrote:
| It seems the article does defend "it's possible" for strictly
| regular expressions. To allow lookahead will be context-
| sensitive, so not in that spirit.
___________________________________________________________________
(page generated 2021-05-09 23:02 UTC)