[HN Gopher] You can't parse [X]HTML with regex (2009)
       ___________________________________________________________________
        
       You can't parse [X]HTML with regex (2009)
        
       Author : BerislavLopac
       Score  : 117 points
       Date   : 2021-03-05 09:57 UTC (13 hours ago)
        
 (HTM) web link (stackoverflow.com)
 (TXT) w3m dump (stackoverflow.com)
        
       | vbezhenar wrote:
       | You can't in general case. But you can in lots of typical cases.
       | 
       | Actually real world HTML usually can't be parsed by any strict
       | parser, as it's not valid. It's just a machine-generated text
       | which pretends to be similar to HTML. So extracting some bits of
       | information with regexes often works.
        
         | syrrim wrote:
         | The html parser spec defines what every sequence of bytes
         | should parse into. It defines certain such sequences as
         | containing "errors", but it still defines exactly how they
         | should be parsed. There is no invalid html. Every browser
         | follows the spec, so every browser will parse the same html to
         | the same thing. This is true even if the html contains
         | "errors". The only checking most html receives is to make sure
         | it renders correctly in a browser. If you are writing your own
         | parser, you likely want it to do the same thing as every
         | browser. In that case, you should use a parser that conforms to
         | the spec.
        
         | retox wrote:
         | Exactly my experience on this. In a past life I've had to parse
         | valid HTML that was generated by a forum system; the user
         | submitted something akin to bbcode [b]this sort of thing[/b]
         | that was pre-parsed and converted to valid HTML, and then I had
         | to parse that fragment again after the fact.
         | 
         | Given the constraints it's entirely possible to parse (a subset
         | of) irregular grammar with regular expressions. Asking a
         | question along those lines on SO would have have only elicited
         | responses that I/someone was doing it wrong.
         | 
         | I won't argue that it was or wasn't the wrong to do, but you
         | don't always get to pick your client.
        
         | lifthrasiir wrote:
         | I believe you really meant that you are frequently dealing with
         | HTML which structure is already known in advance, not the
         | general HTML. Because...
         | 
         | > [...] real world HTML usually can't be parsed by any strict
         | parser [...]
         | 
         | There is the literal standard for parsing HTML [1]. Any
         | conformant implementation (and there are plenty) can of course
         | parse the real world HTML by definition. Just that you don't
         | always need the full HTML parser to do your job.
         | 
         | [1] https://html.spec.whatwg.org/multipage/parsing.html
        
           | karatinversion wrote:
           | > There is the literal standard for parsing HTML [1]. Any
           | conformant implementation (and there are plenty) can of
           | course parse the real world HTML by definition.
           | 
           | I believe GP was alluding to the fact that many actual
           | resources that declare themselves HTML are not spec
           | conformant, and thus can't be parsed by a parser that only
           | accepts valid HTML.
        
             | lifthrasiir wrote:
             | The distinction between "valid" and "invalid" HTML used to
             | matter once upon a time, but it no longer does at least for
             | agents (authors can still benefit from error-free HTMLs
             | because errors can distort their intents). Pretty much
             | every string can be parsed to HTML since HTML5 and all
             | errors are non-fatal, so many modern HTML parsers default
             | to ignore errors. There are parsers that can be configured
             | to abort on any error, but I don't think the GP intended
             | that.
        
               | tannhaeuser wrote:
               | True, and it's worth noting that since WHATWG HTML 5 has
               | usurped HTML and taken it ad absurdum, WHATWG's parsing
               | spec isn't actually useful nor representative at all of
               | what people usually think HTML is. Nor do people have to
               | follow WHATWG's (= bunch of Chrome developers) idea of
               | HTML anymore than WHATWG did follow other's.
        
               | chrismorgan wrote:
               | WHATWG's HTML spec is the _only_ thing that matters when
               | considering what HTML is, because it's what every browser
               | uses, which is the primary target of HTML.
        
               | detaro wrote:
               | WHATWG is not a "bunch of Chrome developers", and if you
               | want to understand what a browser does with HTML, it's
               | the place to look. "HTML, but not the HTML web browsers
               | mean" is a fairly niche concern.
        
           | tomashubelbauer wrote:
           | I think this is only true for HTML5, but previous versions of
           | HTML supposedly weren't specced well enough to write a
           | prefect parser. Fixing this was one of the goals of the HTML5
           | revision if I'm not mistaken.
        
             | lifthrasiir wrote:
             | You are correct, but I don't think they are even slightly
             | relevant in 2021.
        
             | bawolff wrote:
             | Web browsers exist, therefore its possible to parse html,
             | even earlier versions.
             | 
             | You're right that the de jure spec did not match de facto
             | html, and browsers didn't neccesarily agree with each
             | other. But that's always true. GCC has language extensions
             | that aren't part of the c spec, but you wouldn't say that c
             | is impossible to parse. Old html may have taken it up to
             | 11, but its not fair to say its impossible to parse.
        
               | uuidgen wrote:
               | No, not really. The browsers did guess a lot and did
               | standard-deviating parsing because the typical uses were
               | wrong and they had to work. Nobody would switch to a new
               | browser that doesn't work with existing pages.
               | 
               | Modern example - mXSS. Even though modern html have to be
               | valid xml the browser will, instead of giving an error
               | when served invalid html, transform what's given to make
               | it standard-compliant.
        
               | bawolff wrote:
               | Modern html by definition is not valid XML, unless you
               | are using the xml serialization of html5 which isn't
               | really teccomended and nobody does.
               | 
               | Really no version of the official html spec was valid xml
               | other than XHTML which was never particularly popular.
               | 
               | But i don't really see your point. An implementation
               | having a different idea how to parse html than you think
               | is correct is not the same thing as something being
               | unparsable. Its a tautology that if there exists a
               | computer progran to parse something than it is possible
               | to parse it with a computer program.
        
               | chrismorgan wrote:
               | Perl is famously unparseable: it's impossible to
               | determine a parse tree without executing the code.
               | 
               | (HTML, however, was never unparseable, merely
               | insufficiently defined.)
        
             | anoncake wrote:
             | Previous versions of HTML were based on SGML. You can write
             | a perfect SGML parser, but the developers of web browsers
             | couldn't be bothered.
        
       | ricardo81 wrote:
       | I think we've all (mostly?) tried it. It really is the Wild West
       | of the web when you're trying to parse other people's HTML,
       | though.
       | 
       | I've played around with this parser which is extremely quick.
       | https://github.com/lexbor/lexbor
        
       | [deleted]
        
       | Corazoor wrote:
       | Well, as stated that particuar answer is both right and wrong...
       | 
       | Yes, you can not use "true" regular expressions to parse
       | recursive structures.
       | 
       | But the libraries that get used for regular expressions quite
       | often include non-regular extensions (and confusingly call the
       | resulting expressions still "regular").
       | 
       | Most notably, PCRE allows for recursive patterns via "(?R)". You
       | can absolutely parse arbitrary HTML with it.
       | 
       | In fact you can parse anything whith that, including binary
       | formats. You just can't do it whithout recursively applying the
       | same "regex" again and again...
       | 
       | And precise error handling is basically impossible without
       | writing a proper lexer anyway, since your regex won't (can't,
       | really) tell you where it was thrown off. It either works or
       | doesn't, the "why" is left to the program to figure out...
        
       | Doctor_Fegg wrote:
       | "This post is locked to prevent inappropriate edits to its
       | content. The post looks exactly as it is supposed to look - there
       | are no problems with its content. Please do not flag it for our
       | attention."
       | 
       | Stack Overflow can be remarkably humourless at times.
        
         | captainmuon wrote:
         | Note that you cannot even vote on it, _and_ it is marked CW, so
         | it can doubly not give reputation. This reputation resentment
         | makes me sad...
        
         | kroltan wrote:
         | That is a defense against the humourless people that tried to
         | edit the post down into a more objective answer years after the
         | fact:
         | 
         | https://stackoverflow.com/posts/1732454/revisions?page=2
        
           | Doctor_Fegg wrote:
           | Yes, that was my point.
        
             | kroltan wrote:
             | Ah I see, I misunderstood your comment as referring to the
             | quote's clinical wording. Sorry!
        
       | imedadel wrote:
       | I've seen many people complain about StackOverflow, but this is
       | the best example I've ever encountered.
       | 
       | The question: How to match                 <p>       <a
       | href="foo">
       | 
       | The answers: _rants about how RegEx is not suitable for parsing
       | entire HTML._
       | 
       | Only the 5th answer starts to actually answer the question.
        
         | ajanuary wrote:
         | According to the post, the more important part of the question
         | is "what do you think", to which "I think you shouldn't,
         | because..." is a good answer.
        
         | chrismorgan wrote:
         | "You're asking the wrong question" is a valid response.
        
           | anoncake wrote:
           | But not a valid answer. That's what comments are for.
        
           | tomashubelbauer wrote:
           | Yes, but I wish people who like to assume and answer saying
           | so would still answer the question they think is wrong.
           | Context matters and I don't think you can determine that with
           | certainly nearly as often as some people online like to
           | think.
        
             | shawnz wrote:
             | In this case the answer is correct given the parameters of
             | the question: There is no way to have a regex that only
             | matches the things which OP wants to match, but not any of
             | the things OP doesn't want to match.
             | 
             | Given a specific situation, like a particular page or
             | something, sure, regexes are still a possibility for
             | solving the problem. The 2nd highest answer on the page
             | details exactly that. So what is the problem? Is every
             | single contributer obligated to artificially entertain the
             | OP's preconceptions before giving the advice which they
             | believe actually helps best? For example, if I were
             | knowledgeable about XML but not regex, should I just not
             | contribute in such a situation?
        
           | dgellow wrote:
           | Really, it depends the context. You might be aware that's not
           | something to generally do and still want to know the answer
           | to the actual question.
        
           | josefx wrote:
           | Do you educate people about the complexity of the physics and
           | bureaucracy involved with defining the current time every
           | time someone asks you "what time is it?" Or do avoid going
           | onto irrelevant tangents that get you labeled as crazy and
           | just tell them the current time?
        
             | IgorPartola wrote:
             | What time is it isn't an invalid question. "How do I make
             | my hamster grow wings and fly?" is. How to parse HTML with
             | a RegEx is an in-between. For a specialized case, why not?
             | Answer that question, then provide a counter example to
             | show how it will be very fragile, then explain the theory,
             | then show a better way. IME that tends to work better to
             | teach someone what you think they should know.
        
             | jasode wrote:
             | _> Do you educate people about the complexity of the
             | physics and bureaucracy involved with defining the current
             | time every time someone asks you "what time is it?"_
             | 
             | Maybe you're (inadvertently) making a caricature by using a
             | simple _" what time is it?"_ question but many user
             | questions are _under-specified_.
             | 
             | Because of that, Stackoverflow answerers in particular do
             | go into the extra complexities because it's part of its
             | _editorial DNA_ to restate the q &a so it's a high-quality
             | _community knowledgebase_ instead of just answering the
             | direct question as stated. I tried to explain this hard-to-
             | grasp nuance previously:
             | https://news.ycombinator.com/item?id=21115438
             | 
             | But sometimes, this X-Y problem editorializing mechanism
             | gets so enthusiastic that it can detract from a correct
             | answer. Here's a famous example of a string bytes
             | extraction question with smart people arguing with the
             | correct answers from user541686 (was Mehrdad) and Michael
             | Buen:
             | 
             | + correct answer has lots of X-Y pushback in the comments:
             | https://stackoverflow.com/questions/472906/how-do-i-get-a-
             | co...
             | 
             | + another correct answer from Buen that emphasizes
             | user541686/Mehrdad works for broken unpaired surrogates:
             | https://stackoverflow.com/questions/472906/how-do-i-get-a-
             | co...
             | 
             | The meta layer issue is that the question is
             | _underspecified_ which causes 2 sides with very intelligent
             | people arguing whether or not it 's an X-Y problem!
        
               | shawnz wrote:
               | I think the top answer in your example is highly
               | misleading and deserves to have the caveats highlighted
               | more clearly even though it's not "wrong". It is saying,
               | "you don't need to worry about encoding", but really the
               | point it is proving is "if you just use ONLY toCharArray
               | and BlockCopy on ONLY one system and framework version
               | then you can be sure they always use the same encoding as
               | one another, so in that situation you don't need to
               | worry".
               | 
               | So, the solution works, but only in specific situations
               | which are not clearly explained and might be totally
               | unrelated from OP's situation, and furthermore it doesn't
               | really address the second part of OP's question "why take
               | encoding into consideration?" I wouldn't necessarily call
               | the problems with that answer just "XY pushback".
        
           | anon946 wrote:
           | If the question were about full validated parsing of HTML
           | with a regex, then I'd agree that "You can't do that" might
           | be part of a valid answer. But finding tags is not doing a
           | full validating parse.
           | 
           | Note that the set of valid C programs is not a context-free
           | language. Yet it's common to use a context free-based
           | approach to parsing. You just add additional code to handle
           | the context-sensitive aspects (such as a symbol table).
        
           | bill_mon wrote:
           | I disagree, I see it as saying "you don't know what you
           | really want, but I can read your mind". It's disrespectful
           | and not giving the benefit of the doubt.
        
             | lordgrenville wrote:
             | Strongly disagree. The point of SO is for experts to answer
             | questions. They've learned things the hard way and would
             | like to help others do better. They're not being paid. As
             | such, telling the questioner that their whole approach is
             | wrong is appropriate and even preferable.
             | 
             | From what I've heard Jeff Atwood and Joel Spolsky had
             | different views on this and Spolsky's more tolerant, "no
             | such thing as a stupid question" approach won out within
             | the company, but is less popular among the people who write
             | answers.
        
             | shawnz wrote:
             | I don't think it is disrespectful to suggest someone is
             | falling victim to the XY problem.
             | 
             | Actually I think it is a common and expected outcome that
             | when investigating a new problem, we often get stuck in "XY
             | problem" traps while researching the solution.
             | 
             | I very much value any feedback that suggests I should
             | rethink the entire problem with a simpler model, because
             | without experience it's hard to know what the simplest
             | models are.
        
               | anaerobicover wrote:
               | Absolutely agree. In my experience this is one of the
               | more valuable features of asking someone to discuss a
               | problem I'm mired in. Because they haven't been looking
               | under every rock and studying the bark of every tree like
               | I have, they're very likely to quickly see when I've
               | wandered into entirely the wrong part of the forest.
        
             | ivanbakel wrote:
             | >It's disrespectful and not giving the benefit of the
             | doubt.
             | 
             | Unfortunately, a good number of users who post questions on
             | StackOverflow have _not_ earned the benefit of the doubt.
             | Browsing the site, you will occasionally come across
             | questions which are the tech equivalent of asking  "Which
             | screwdriver is the right size to stick in this electrical
             | socket?"
             | 
             | Frame challenges are a necessary part of learning, so they
             | belong on a Q&A site. If a user doesn't want their problem
             | to be challenged, the onus is on them to clarify in the
             | question why their particular approach is the necessary
             | one. It's only possible to respond with alternative
             | solutions when the problem is not specified enough.
        
               | kemitche wrote:
               | The problem is, the answers are useful to more than just
               | the original questioner. Sure, the questioner may be
               | doing things vastly wrong - but the people who land on
               | that question's page via search may have legitimate
               | reasons for doing things a certain way.
               | 
               | The silent majority of viewers will benefit from an
               | answer that does both of (1) explaining why the answer is
               | probably not what is wanted, and (2) answering the
               | initial question _as written_ anyway, for future viewers.
        
               | ivanbakel wrote:
               | Then those other viewers will either benefit in the same
               | way from the frame challenge as a learning experience, or
               | they will have a sufficiently-specific problem that they
               | can ask their own questions with more justification for
               | taking a specific approach.
               | 
               | Answering the question as written has the risk that any
               | solution will be blindly applied without appreciating why
               | the approach itself should be avoided. This is especially
               | true for those users who see SO as a "write my code"
               | site, and copy-paste anything in backticks.
        
               | jstanley wrote:
               | > Which screwdriver is the right size to stick in this
               | electrical socket?
               | 
               | Note that this is a legitimate technique in UK sockets.
               | 
               | The live and neutral pins have a little gate over them
               | that is retracted when you insert the earth pin, so you
               | need to first stick a screwdriver into the earth pin in
               | order to get your fingers into the live pin.
        
               | jerven wrote:
               | I can't parse if this is humour or a mistake. Putting
               | your fingers on the live pin is not a great idea, trying
               | this to get an euro 15 plug into an UK socket, also not
               | great but in a different category.
        
               | Keyframe wrote:
               | Well, there's also a mains tester screwdriver which is a
               | legit tool that you stick into a socket and also
               | participate in the electrict current loop for the light
               | on it to light up.
        
               | bill_mon wrote:
               | Good points.
               | 
               | I'm not so sure benefit of the doubt must be earned. More
               | like, any participant in a discussion forum must show it
               | when answering, and do proper research before asking
               | anything. If all questions are good questions, there's no
               | problem. But, as you say, they really aren't. I think
               | poor question should be down voted with a brief
               | explanation instead of trying to answer the "real"
               | question. Or moved to a Frame challenge forum.
               | 
               | Are we trying to answer the question or to solve the
               | problem?
        
               | shawnz wrote:
               | > do proper research before asking anything.
               | 
               | Asking on SO is itself research. It is good to review the
               | existing literature before taking contributors time, of
               | course, but if the problem is not solved in the existing
               | literature, then perhaps the framing issue isn't
               | addressed by the existing literature either. In that case
               | how could the learner know the best way to frame the
               | problem in advance?
               | 
               | > I think poor question should be down voted with a brief
               | explanation instead of trying to answer the "real"
               | question. Or moved to a Frame challenge forum.
               | 
               | This precludes the possibility that some contributors
               | might want to address the framing problem, whereas others
               | might want to address the specific question as asked.
               | They may have different opinions about whether it is
               | framed wrong at all. It also means the OP is losing karma
               | or getting penalized for no fault of their own.
        
             | tester34 wrote:
             | >It's disrespectful and not giving the benefit of the
             | doubt.
             | 
             | So what?
             | 
             | If actually OP knows that this is bad approach, then OP
             | will clarify that he's aware and yada yada.
             | 
             | What's the problem? lack of thick skin?
        
             | bryanrasmussen wrote:
             | unfortunately sometimes people who ask questions are really
             | junior, and need to be told they are going to have an
             | unpleasant surprise if they go down the path they are
             | planning on going.
             | 
             | sometimes people who ask questions know the pitfalls but
             | don't clarify that they know adequately because they are
             | pressed for time. in this case those people unfortunately
             | run the risk of being talked down to and they should accept
             | that.
             | 
             | on the other hand if they have clarified adequately that
             | they know what they're doing and they still want to do
             | something that might seem weird then I agree it is
             | disrespectful. Which is a thing you see often enough on
             | StackOverflow to be notable.
        
               | umanwizard wrote:
               | Maybe so, but what about the non-junior person who needs
               | to do something weird for an actual valid reason and
               | stumbles on the refusal to answer the question years
               | later? StackOverflow answers aren't just for the original
               | asker.
        
               | andybak wrote:
               | I think in that case - the new person should probably
               | post a new question.
               | 
               | The point is that the original question - as framed - was
               | better served by saying "if you go back a step and
               | reexamine your assumptions, you'll find there is a better
               | path to your intended goal".
               | 
               | The new person has a different goal or a different set of
               | constraints.
        
               | detaro wrote:
               | Because asking new questions and getting them closed as
               | duplicate because they sound vaguely similar to an
               | existing question is sooo helpful an experience...
        
               | andybak wrote:
               | Yeah - but I'm just playing with hypothetical ideal cases
               | here. "Annoying flawed habits of Stack Overflow
               | moderators" isn't something that's on my list of things
               | I'm thinking about. ;-)
               | 
               | EDIT - which got me thinking. Maybe the "correct" thing
               | to do is answer the original question _as asked_ but
               | gently point out to the person asking it that there is
               | probably a better solution for them if only they had
               | asked a different question.
               | 
               | The original question still stands and has an answer
               | useful for other people. The original questioner has the
               | opportunity to learn and ask the question they should
               | have asked in the first place.
               | 
               | It's going to be annoying for someone - so it should at
               | least be the person that kicked things off in the first
               | place.
        
             | andybak wrote:
             | > It's disrespectful
             | 
             | How do you respectful tell someone you think they are
             | mistaken? I'd rather not be pussyfooted around by someone
             | if I'm in the role of "person who has asked a question
             | based on a faulty assumption". Don't be rude but don't
             | avoid trying to answer truthfully to the best of your
             | ability.
        
               | bill_mon wrote:
               | > How do you respectful tell someone you think they are
               | mistaken?
               | 
               | How about "you're mistaken"?
               | 
               | The problem is with "You don't know what you're talking
               | about, but I do, so let me answer your real question".
        
               | andybak wrote:
               | The wording used was "You're asking the wrong question"
               | not "You don't know what you're talking about".
               | 
               | I find that perfectly fine. It was slightly disingenuous
               | to reword it.
        
           | andybak wrote:
           | https://en.wikipedia.org/wiki/XY_problem
           | 
           | https://xyproblem.info/
           | 
           | https://meta.stackexchange.com/questions/66377/what-is-
           | the-x...
        
           | mumblemumble wrote:
           | Only when you know enough about the person's context to be
           | able to tell them what question they should be asking
           | instead.
           | 
           | If you don't have that context, then the correct thing to do
           | is to ask for more information, or say, "did you consider
           | this", or find some other way to come up with a constructive
           | response. You don't just assume you know what the person
           | really wants to do and then try to mainsplain it to them.
        
         | EamonnMR wrote:
         | I find this type of answer infinitely more paletable than "your
         | question is answered here" or "comments are not for extended
         | discussion, this conversation has been moved to chat"
        
         | armada651 wrote:
         | Actually, if you read his rant all the way to the end he does
         | offer a helpful suggestion:
         | 
         | > Have you tried using an XML parser instead?
        
           | andybak wrote:
           | Except he's wrong in this case. The OP could use a regex in
           | this specific scenario.
        
             | chrismorgan wrote:
             | Not true. The questioner has not provided anywhere near
             | enough detail to determine if regular expressions are
             | sufficient. For example: should <br> match, or not? Its
             | semantics are identical to <br />. To determine if regular
             | expressions are enough, you would need to know exactly what
             | markup you're dealing with, and that has not been provided.
        
               | andybak wrote:
               | Yeah. I guess in other parts of this discussion I'm
               | arguing for always probing hidden assumptions and missing
               | background whilst here I'm saying "let's interpret the
               | question in the most charitable way possible".
               | 
               | Plus - Stack Overflow is about trying to generalize any
               | given question to maximize it's wider usefulness.
        
               | bhaak wrote:
               | > Plus - Stack Overflow is about trying to generalize any
               | given question to maximize it's wider usefulness.
               | 
               | Since when? You don't get extra points if you write stuff
               | that doesn't concern OP's problem. Most SO problems don't
               | get viral and get lots of upvotes from other people. From
               | a game theory perspective, it doesn't make sense to add
               | more to an answer than to make it the accepted one.
               | 
               | If you have slightly different constraints you are
               | encouraged to open another question. Discussions are
               | frowned upon and sometimes even interrupted by admins so
               | you can't discuss if your situation is different from
               | OP's situation and so could warrant a different answer.
        
               | chrismorgan wrote:
               | This has always been the _intent_ of Stack Overflow, from
               | its very earliest days. One of the stock reasons for
               | closing a question used to be "too specific--this
               | question is unlikely to help anyone else" (or words to
               | that effect), though that has been removed now (I think
               | because it upset too many people who took its blunt
               | message the wrong way). People have always been nudged
               | towards adjusting questions so that they'll be generally
               | useful.
               | 
               | Discussions on questions are routinely about unrelated or
               | not-closely-related matters, and quite apart from that
               | Stack Overflow wants to be a Q&A platform, not a
               | discussion platform.
        
         | commandlinefan wrote:
         | https://www.aprogrammerlife.com/top-rated/stack-overflow-how...
        
         | omginternets wrote:
         | In all fairness, he does provide hints as to _why_ regex will
         | not work (namely: HTML is not a regular grammar).
         | 
         | Sure, it's somewhat obscured by the humorous rant, but it's not
         | that bad an answer, either.
         | 
         | More to the point: I'm not sure I want to suck the humor out of
         | everything. I agree that SO has problems, but humor and poetry
         | are worthwhile things in otherewise serious places. It's all
         | about quantity.
        
         | seiferteric wrote:
         | Yes, I have been down-voted and scolded for answering a
         | question as literally described simply because others would
         | rather assume the ignorance of the questioner. Yes I know
         | people will often ask a question due to not understanding what
         | they are doing, but when 10 other people have already responded
         | with "don't do it that way!" I think it can be useful to
         | actually answer the question as stated (if possible).
        
       | a1369209993 wrote:
       | You can't parse _fully-general_ HTML with regex, but unless you
       | 're writing a web browser of something, that's not what you're
       | trying to do; you're trying to parse the particular _subset_ of
       | HTML that happens to be emitted by this particular website that
       | you got the HTML to be parsed from. And, much like the halting
       | problem or integer factorization, despite the general case being
       | difficult or impossible, the overwhelming majority of _specific_
       | cases are easy.
        
       | loxias wrote:
       | This never gets old.
        
       | JackeJR wrote:
       | Counter argument: Oh Yes You Can Use Regexes to Parse HTML!
       | 
       | https://stackoverflow.com/questions/4231382/what-to-do-regul...
       | 
       | Discussion: https://news.ycombinator.com/item?id=26357237
        
         | lucideer wrote:
         | Counter-counter-argument, one of the comments underneath that
         | answer:
         | 
         | > _what you have written is not really a regular expression
         | (modern, regular, or otherwise), but rather a Perl program that
         | uses regular expressions heavily. Does your post really support
         | the claim that regular expressions can parse HTML correctly? Or
         | is it more like evidence that Perl can parse HTML correctly?_
        
         | harveywi wrote:
         | Do the Halting Problem next.
        
       | xg15 wrote:
       | Yes, it's hopeless to try and parse arbitrary _HTML_ with regexes
       | (or with anything really. I believe, the full HTML5 parsing
       | algorithm is not even type-2, it 's more or less turing-complete.
       | In-the-wild HTML can also interact with scripting in all kinds of
       | entertaining ways, to the point that a conforming HTML5 parser
       | has to be able to execute javascript _while parsing_ - and the
       | javascript can inject additional tokens into the parser 's input
       | stream. It's possible to create a HTML5 document that only
       | validates on every second thursday of the month.)
       | 
       | However, if we know the document is well-formed(!) XHTML,
       | shouldn't it be possible? This would mean the document is valid
       | XML and XML was specifically designed to be regex-friendly, I
       | believe.
       | 
       | At least, out of my head, the only gotchas that have to be
       | accounted for are comments and CDATA sections - those may contain
       | arbitrary unescaped text, including angle brackets. However they
       | also have unambigous start and end markers and can't be nested,
       | so a regex could account for them.
       | 
       | Attribute values should not be a problem, as angle brackets must
       | be escaped inside those to be valid XML.
       | 
       | I'm not sure about processing instructions and doctypes though.
        
         | chubot wrote:
         | Yes that's basically what I wrote about here:
         | 
         | https://news.ycombinator.com/item?id=26359556
         | 
         | I listed 3 or 4 caveats. CDATA might be another one since I'm
         | not handling those... I've never used it so I left it out.
         | 
         | Actually I remember that regular languages weren't entirely
         | enough; I think the .*? non-greedy matching was useful, e.g.
         | for finding --> at the end of comments.
        
         | theandrewbailey wrote:
         | On my blog, I write my posts in markdown. After it's converted
         | to HTML, I use regex to search and replace images (for high-res
         | and alternative formats), and get the first paragraph (for a
         | 'preview'). I've been doing this for years, so the 'never use
         | regex over HTML' advice isn't holding up for me.
        
       | taylodl wrote:
       | 2nd answer: _" While arbitrary HTML with only a regex is
       | impossible, it's sometimes appropriate to use them for parsing a
       | limited, known set of HTML."_
       | 
       | This. So much this. Yes, you can't parse arbitrary, unknown XML
       | with regex. But I don't find myself parsing arbitrary, unknown
       | XML very often. Usually I know _exactly_ what I 'm expecting and
       | if I can't find the information I need then it's a problem. Regex
       | parsing is _perfect_ for this scenario - and much, much faster. I
       | created a regex parser for Java that even handles namespaces and
       | relative paths. Can it parse _every_ XML file? No - you can 't
       | parse XML with regex. But I can parse everything I need to parse
       | - and if I can't? I can always fall back on full-featured XML
       | parsers.
        
         | chubot wrote:
         | All you need to parse HTML is regular expressions (to recognize
         | tags) and a stack (to match tags).
         | 
         | Your programming language has a stack -- a call stack.
         | 
         | So in practice all you really need is regular expressions.
         | (Which I tend to call "regular languages" to make a distinction
         | with Perl-style regexes [1], although they work fine too in
         | practice for this case)
         | 
         | Using the call stack in a more functional style is nicer than
         | using the OOP style that s in the Python standard library,
         | which is probably inherited from Java, etc.
         | 
         | I have done this with a bunch of HTML processors for the Oil
         | blog and doc toolchain:
         | 
         | https://github.com/oilshell/oil/tree/master/doctools
         | 
         | It works well in practice and is correct and fast.
         | 
         | Big caveat: this style is only for HTML that I generate myself,
         | e.g. the blog and docs. There are a bunch of rules around
         | matching tags in HTML5 that are subtle. Although one of the
         | points here is that you don't have to do a full DOM-style parse
         | and can ignore those rules for many useful cases.
         | 
         | The other caveat is that HTML has a bunch of rules for what
         | happens when you see a stray < or > that isn't part of a tag.
         | This style makes it a hard syntax error, so it's really a
         | subset of HTML (which has no syntax errors). For my purposes
         | that is a feature rather than a bug, basically following
         | Postel's law.
         | 
         | I meant to write a blog post titled "why/when you can parse
         | HTML with regexes" about this but didn't get around to it.
         | 
         | There is also a caveat where parsing arbitrary name=value pairs
         | with regexes isn't ergonomic, because it's hard to capture a
         | variable number of pairs. However the point is that I wrote 5
         | or 6 useful and compact HTML processors that don't need that.
         | In practice when you parse HTML you often have a "fixed
         | schema".
         | 
         | Concrete examples are generating a TOC from <h1>, <h2>, etc.
         | and syntax highlighting <pre><code> blocks. Those all work
         | great with the regex + call stack style.
         | 
         | [1] http://www.oilshell.org/blog/2020/07/eggex-theory.html
         | 
         | edit: for completeness, another caveat is that the stack-based
         | style is particularly bad for C++/Rust and arbitrary input
         | because you could blow the stack, although we already limited
         | the problem to "HTML generated ourselves"
        
           | aidenn0 wrote:
           | Yes, HTML is a deterministic context-free language, so you
           | can parse with a DPDA.
           | 
           | In addition, tokens are regular (as it is for many
           | languages), so you can use a regex for tokenization.
           | 
           | All you need for writing a recursive descent parser with
           | backtracking is a call-stack, so are all LL(k) and most
           | practical CFGs also parseable with regex?
        
             | chubot wrote:
             | I am not sure if you can say that precisely -- I added the
             | caveats about the rules for end tags, and the rules for
             | stray < and >.
             | 
             | But certainly a useful subset of HTML can be parsed with a
             | DPDA. (I'd be interested in more analysis of that;
             | arbitrary tags are another factor)
             | 
             | It's a matter of opinion but I would say recursive descent
             | is "nontrivial", whereas matching tags is "trivial".
             | 
             | Recursive descent involves some choice around lookahead or
             | backtracking. It can be slow if you do the wrong thing,
             | hard to debug, etc. It takes a little practice, and
             | correspondence with a grammar is important.
             | 
             | Whereas matching HTML tags requires no lookahead, and I
             | would say anyone can figure it out just with a simple code
             | reading. Even the "inverted" OOP style is "simple", but
             | annoying for me to read and correctly modify. The call
             | stack reads much better.
        
           | c-cube wrote:
           | What you're describing is using regex to _lex_ html, not
           | parse it. The parser is easy to build once you have a lexer
           | (after all most of XML's unpleasantness comes from escaping).
           | It still isn't the same thing as parsing with regular
           | expressions.
           | 
           | Similarly you can't parse S-expressions with regular
           | expressions, but if you have the lexer (e.g. with `lex` or
           | other languages' equivalent) then the parser on top of it
           | becomes absolutely trivial.
        
             | chubot wrote:
             | Right exactly, in fact I call the library "lazylex" because
             | it only does the minimum to find the <> structure.
             | 
             | If you want to recognize more, then you invoke a attribute
             | "name=value" lexer on the tag, but usually you don't. This
             | makes it quite fast (and speed is useful because most doc
             | toolchains are slow).
             | 
             | The lexing is the "generic" part and parsing is specific to
             | the task at hand.
             | 
             | In the case of making an HTML table of contents, you
             | literally just find <h1>, <h2> with regexes, and don't do
             | all this DOM nonsense. It's easier to write than the SAX
             | style with an explicit stack.
             | 
             | So yes the point is that parsing HTML is trivial for most
             | purposes: use the call stack. It helps if your language has
             | exceptions to indicate errors.
        
         | polote wrote:
         | When I was working on a link parser in python for a crawler. I
         | had two choices:
         | 
         | 1. use some form of regex
         | 
         | 2. use libxml and find links
         | 
         | 1. was faster than 2. by a factor or two
         | 
         | Does a link-only parser would have been faster ? Yeah probably
         | but it is much more complex to do
        
           | benlivengood wrote:
           | What was the average difference in accuracy?
           | 
           | It's a fair tradeoff especially for a crawler where it's
           | never guaranteed to reach all documents anyway.
        
           | contravariant wrote:
           | Maybe I'm confused about the problem you're solving but
           | regexp was faster than just:                   from lxml
           | import html         doc = html.from_string(response.content)
           | doc.xpath('//a/@href')          ?
        
         | xurukefi wrote:
         | In my experience, a regex can be a much more robust solution
         | than an XML parser depending on the use case. Back in the days
         | when I did some webscraping I often had parsers throwing
         | exceptions becacuse of ivalid HTML. More often than not I
         | switched to regular expressions, which always worked out
         | flawlessly.
        
           | metalliqaz wrote:
           | "always flawless" is a bold claim
        
             | mumblemumble wrote:
             | It's hard to imagine any web scraping being described as
             | "flawless". But there's a whole lot of room for "worked
             | fine for my purposes."
             | 
             | I think that it's actually a pretty great example of a case
             | where capturing data from HTML may not be best modeled as a
             | parsing task. You might not need a whole parse tree just to
             | match some pattern and grab an associated string. And
             | skipping the parse may enable you to get useful data out of
             | a file that technically can't be parsed due to syntactic
             | errors. It's a fairly classic precision/recall tradeoff
             | situation.
        
             | oeiiooeieo wrote:
             | "I've never lost playing Russian Roulette"
        
           | btilly wrote:
           | Yes. HTML isn't XML.
           | 
           | As you discovered, HTML in the real world is allowed to be
           | malformed in various ways, while XML is not. A compliant
           | parser MUST barf on various kinds of malformed text. (Search
           | for "fatal error" in https://www.w3.org/TR/REC-xml/ to
           | verify.) This makes XML parsers inappropriate for HTML
           | parsing.
           | 
           | Interestingly, even perfectly valid HTML may not be valid
           | XML. To see that consider <b>this <i>example</b>
           | carefully</i>.
        
         | dragontamer wrote:
         | I found a one-pass top-down "XML parser". Not like a proper SAX
         | parser, no... the XML had to be specially formatted, almost
         | like TOML.                   <whatever> <--- Parser parsed this
         | as a new section         <foo attrib=bar> <--- All "XML" had to
         | be one-per-line         </whatever> <---- parser ignored this
         | 
         | It was an "XML parser" per se, but it really was just a linear
         | one-pass parser that tricked me into thinking it was XML.
         | 
         | So really, it was more like TOML (or .INI files) than like XML.
         | But I guess the advantage of making it "bastard XML" instead of
         | TOML is that maybe this worked with XML-editors or something. I
         | dunno...
        
           | cbsks wrote:
           | I've written an XML parser like this for a toy project. I
           | passed the XML through a prettifier first so that it was in a
           | standard format. It wouldn't work for every XML file, but it
           | worked on the files that I needed it for.
           | 
           | I have also had success searching through html files with
           | grep after passing them through a prettifier. It's ugly, but
           | 90% of the time, it works every time!
        
             | dragontamer wrote:
             | The specific project I saw that did this had ALL of its
             | configuration files in this "Bastard XML" format.
             | 
             | Its ugly, but when 100% of your files you interpret match
             | the one-per-line (and other clearly made-up rules), then it
             | works 100% of the time!!
        
       | gostsamo wrote:
       | I get why all the complains against the top answer. At the same
       | time one should appreciate its literary qualities in regard to
       | structure and style.
        
       | darkwater wrote:
       | My 5yo daughter can already write more or less ok-ish but she has
       | big problems reading that back, especially when she does spelling
       | mistakes.
       | 
       | I feel more or less the same when I write regular expressions.
        
       | goldsteinq wrote:
       | You can't parse HTML with regex, but PCRE is not regex. I'm not
       | sure if you can parse HTML with PCRE.
        
       | mumblemumble wrote:
       | This pops up every so often, and it sort of irritates me every
       | time. Partially because it's overly simplistic, but, even more
       | so, because, while it's cute and humorous, it's not actually very
       | good advice and it doesn't actually answer the question. No, you
       | can't _parse_ html with regex. But go look at the question. The
       | author is just trying to detect some tags. That 's not _exactly_
       | parsing.
       | 
       | It's true that there are some complications around things like
       | "What if > appears in an attribute's value?" If you know your
       | input well enough, or you don't need perfection, that might be a
       | problem you can ignore. Alternatively, you can still use regex,
       | if you use a sufficiently powerful regular expression tool.
       | .NET's regular expressions, for example, have a concept of
       | balancing groups that will let you do this.
       | 
       | I would also point out that a lot of open source HTML parsing
       | libraries are even more dangerous than regular expressions for
       | parsing unknown HTML, because they use recursive descent. Where
       | you have recursion, you have the potential for a stack overflow.
       | With a regex library, you do have to be careful about
       | catastrophic backtracking, but that's at least something you can
       | usually handle in your own code, or, in the worst case, defend
       | against with timeouts.
       | 
       | A parser that's capable of blowing the call stack, and has been
       | exposed to input from the Internet, though, is capable of taking
       | down your process in a way you can't defend against in most
       | languages. And it's difficult to patch up a parser like that
       | without more-or-less rewriting it. I absolutely have had to deal
       | with html handling code getting into situations like that in the
       | past. Malicious input is real. So is plain old bad input. Reading
       | the code before you use it is often a good idea.
        
       | h2odragon wrote:
       | This argument is common, and this is a good answer; but so often
       | people aren't "parsing" XML but extracting a few bits of it and
       | would have benefited from less cargo cult and more thought in the
       | answer cited.
       | 
       | As it is, I've seen this article used to scare people away from
       | "can i make a game behave differently?" efforts that would have
       | been trivial to do and likely given these people a gateway "i
       | _can_ try to be a programmer " experience.
        
       | csours wrote:
       | To me, this typifies working with technology and programming.
       | Computer programs only ever look like they are working, because
       | they have not encountered problem data or conditions.
       | 
       | Aka, it works on the happy path.
       | 
       | Software engineering is how we balance how much of the unhappy
       | path and corner cases we take care of, and how we handle them,
       | imo.
        
       | indymike wrote:
       | Funny thing. Email addresses need a rant like this too. Yes, you
       | can parse 99% or so with a regex, but like HTML or XML, you
       | really need an email address parser. RFC 2822 was designed to be
       | parsed using string processing (in C no less) and requires some
       | complexity that most regexes fail on. Here's a discussion about
       | using the simpler, older RFC (822) and regex:
       | https://stackoverflow.com/questions/20771794/mailrfc822addre...
        
         | mumblemumble wrote:
         | For most purposes, if you're trying to use parsing to achieve
         | email address validation perfection, you've already lost the
         | battle.
         | 
         | A valid email address typically isn't just a syntactically
         | correct one; it's also one that can be used to get an email to
         | the recipient. The only way to test that is to send an email
         | and see if it gets to the recipient. Which is why it's much
         | more common to see some minimal client-side validation that
         | uses a simple regex that will (ideally) match all valid email
         | addresses but only catch gross syntactic errors like typing #
         | where you meant @, more for the sake of decent UX than anything
         | else, and rely on asking the user to double-type their email
         | address and sending an activation email to deal with finger-
         | grained syntactic errors and the whole universe of non-
         | syntactic errors.
        
       | rswskg wrote:
       | Old, but gold.
        
       | 8lall0 wrote:
       | TONY THE PONY.
        
       | llimos wrote:
       | Would be interesting to know how many up and down votes HN is
       | sending that answer.
        
         | daniel-thompson wrote:
         | The answer had 4440 upvotes and 27 downvotes at the time it was
         | locked (click on it to reveal the breakdown, if you have
         | sufficient SO "reputation").
        
         | tomashubelbauer wrote:
         | The question is locked, so it cannot be voted on.
        
       | legec wrote:
       | Where are all the comments gone ?
       | 
       | (note : I mean the comments on StackOverflow, not the comments
       | here in Hacker News ... )
        
       | The_rationalist wrote:
       | Aren't html code highlighters using regex? Isn't vscode using
       | TextMate regex for color highlithing?
        
         | chrismorgan wrote:
         | Most code highlighters do. They're normally close enough to
         | accurate for highlighting purposes (though they will commonly
         | have some uncommon constructs that they get wrong), but they
         | tend to fall apart when you try to use that for much more; for
         | example, indentation when you use regular expressions to parse
         | your HTML tends to start falling apart if you take what XML
         | users might consider "shortcuts" (such as omitting optional end
         | tags).
        
           | The_rationalist wrote:
           | Valid HTML allow optional end tags? For example?
        
             | lifthrasiir wrote:
             | `<script>` is an usual example that you can't self-close
             | and absolutely need to be followed by `</script>` in HTML5.
             | 
             | In general though self-closing tag has no effect in HTML5
             | anyway, `<script>` is just an example where the usual
             | heuristic specified by HTML5 doesn't help you at all (since
             | it switches the lexer state).
        
             | [deleted]
        
         | layer8 wrote:
         | You can usually use regexes for tokenization, which is
         | sufficient for syntax highlighting, but you generally can't use
         | regexes for parsing (nested structures).
        
       | zihotki wrote:
       | Should 2011 (when answer was first provided) or 2009 (when
       | question was posted) be added to the title?
        
         | usrusr wrote:
         | Clearly it should be 20(?:09|11)
         | 
         | On the other hand, the inclusion of [X] in the title is more
         | than enough to establish the historical setting.
        
       | lifthrasiir wrote:
       | I had once tweeted related quizzes. Can you guess parse trees (or
       | reserializations) for following HTML fragments without invoking
       | browsers? Assume that everything gets pasted right after the
       | document body.                   1. <a b="42>c">d         2.
       | <a/b/c=d/e>f         3. <a/="42>b         4. <a x=&amp0>&amp0</a>
       | 5. a<!--->b<!--+->c<!-->d
       | 
       | Really, don't try to answer and just use complaint HTML parsers.
        
         | bhaak wrote:
         | What kind of HTML parser? A SGML one or a HTML5 one?
         | 
         | I'm really sad that they didn't go with a XML base for HTML5.
        
           | lifthrasiir wrote:
           | > What kind of HTML parser? A SGML one or a HTML5 one?
           | 
           | I intended the latter. In fact I'm a bit surprised that I
           | have ever been asked for this, I thought "HTML" nowadays
           | exclusively refers to HTML5...
        
             | bhaak wrote:
             | I would have expected the former. Given that one of the
             | rationales of HTML5 was "simpler parsing". I'm obviously
             | not uptodate with the HTML5 parsing.
             | 
             | But why should I? Who writes HTML by hand these days?
             | 
             | The SGML heritage of HTML 4.01 and earlier lead to some
             | gruesome legal constructs that look surprisingly similar to
             | your examples. Looks like every generation has to make
             | their own mistakes.
        
           | lucideer wrote:
           | > _I 'm really sad that they didn't go with a XML base for
           | HTML5._
           | 
           | I'm really sad that they didn't evangelise an XML base for
           | HTML5, and that many HTML5-ish tools don't explicitly support
           | XML, but it's not strictly true that they didn't go for an
           | XML base for HTML5[0][1]
           | 
           | [0] https://html.spec.whatwg.org/multipage/xhtml.html
           | 
           | [1] https://www.w3.org/TR/html-polyglot/
        
             | bhaak wrote:
             | That's just HTML5 rewritten as a well formed XML document.
             | HTML5 does not describe well formed XML.
             | 
             | That you can put the same information of a HTML5 document
             | into a XML document doesn't help much if most of the HTML5
             | documents out there are not polygot.
        
               | lucideer wrote:
               | It doesn't help from a client perspective, but depending
               | on your page delivery pipeline could be of potential help
               | for some from a server perspective.
        
               | chrismorgan wrote:
               | But the point is that XML syntax is still a thing,
               | supported by all browsers (and it's reasonable to expect
               | that support to remain as long as HTML remains). See also
               | https://html.spec.whatwg.org/multipage/introduction.html#
               | htm...:
               | 
               | > _When a document is transmitted with an XML MIME type,
               | such as application /xhtml+xml, then it is treated as an
               | XML document by web browsers, to be parsed by an XML
               | processor. Authors are reminded that the processing for
               | XML and HTML differs; in particular, even minor syntax
               | errors will prevent a document labeled as XML from being
               | rendered fully, whereas they would be ignored in the HTML
               | syntax._
               | 
               | HTML _does_ use an XML base (elements, attributes,
               | namespaces, _& c._), it just doesn't use an XML _parser_
               | most of the time. But the XMLness is easily observed in
               | various DOM APIs.
        
       ___________________________________________________________________
       (page generated 2021-03-05 23:02 UTC)