[HN Gopher] You can't parse [X]HTML with regex (2009)
___________________________________________________________________
You can't parse [X]HTML with regex (2009)
Author : BerislavLopac
Score : 117 points
Date : 2021-03-05 09:57 UTC (13 hours ago)
(HTM) web link (stackoverflow.com)
(TXT) w3m dump (stackoverflow.com)
| vbezhenar wrote:
| You can't in general case. But you can in lots of typical cases.
|
| Actually real world HTML usually can't be parsed by any strict
| parser, as it's not valid. It's just a machine-generated text
| which pretends to be similar to HTML. So extracting some bits of
| information with regexes often works.
| syrrim wrote:
| The html parser spec defines what every sequence of bytes
| should parse into. It defines certain such sequences as
| containing "errors", but it still defines exactly how they
| should be parsed. There is no invalid html. Every browser
| follows the spec, so every browser will parse the same html to
| the same thing. This is true even if the html contains
| "errors". The only checking most html receives is to make sure
| it renders correctly in a browser. If you are writing your own
| parser, you likely want it to do the same thing as every
| browser. In that case, you should use a parser that conforms to
| the spec.
| retox wrote:
| Exactly my experience on this. In a past life I've had to parse
| valid HTML that was generated by a forum system; the user
| submitted something akin to bbcode [b]this sort of thing[/b]
| that was pre-parsed and converted to valid HTML, and then I had
| to parse that fragment again after the fact.
|
| Given the constraints it's entirely possible to parse (a subset
| of) irregular grammar with regular expressions. Asking a
| question along those lines on SO would have have only elicited
| responses that I/someone was doing it wrong.
|
| I won't argue that it was or wasn't the wrong to do, but you
| don't always get to pick your client.
| lifthrasiir wrote:
| I believe you really meant that you are frequently dealing with
| HTML which structure is already known in advance, not the
| general HTML. Because...
|
| > [...] real world HTML usually can't be parsed by any strict
| parser [...]
|
| There is the literal standard for parsing HTML [1]. Any
| conformant implementation (and there are plenty) can of course
| parse the real world HTML by definition. Just that you don't
| always need the full HTML parser to do your job.
|
| [1] https://html.spec.whatwg.org/multipage/parsing.html
| karatinversion wrote:
| > There is the literal standard for parsing HTML [1]. Any
| conformant implementation (and there are plenty) can of
| course parse the real world HTML by definition.
|
| I believe GP was alluding to the fact that many actual
| resources that declare themselves HTML are not spec
| conformant, and thus can't be parsed by a parser that only
| accepts valid HTML.
| lifthrasiir wrote:
| The distinction between "valid" and "invalid" HTML used to
| matter once upon a time, but it no longer does at least for
| agents (authors can still benefit from error-free HTMLs
| because errors can distort their intents). Pretty much
| every string can be parsed to HTML since HTML5 and all
| errors are non-fatal, so many modern HTML parsers default
| to ignore errors. There are parsers that can be configured
| to abort on any error, but I don't think the GP intended
| that.
| tannhaeuser wrote:
| True, and it's worth noting that since WHATWG HTML 5 has
| usurped HTML and taken it ad absurdum, WHATWG's parsing
| spec isn't actually useful nor representative at all of
| what people usually think HTML is. Nor do people have to
| follow WHATWG's (= bunch of Chrome developers) idea of
| HTML anymore than WHATWG did follow other's.
| chrismorgan wrote:
| WHATWG's HTML spec is the _only_ thing that matters when
| considering what HTML is, because it's what every browser
| uses, which is the primary target of HTML.
| detaro wrote:
| WHATWG is not a "bunch of Chrome developers", and if you
| want to understand what a browser does with HTML, it's
| the place to look. "HTML, but not the HTML web browsers
| mean" is a fairly niche concern.
| tomashubelbauer wrote:
| I think this is only true for HTML5, but previous versions of
| HTML supposedly weren't specced well enough to write a
| prefect parser. Fixing this was one of the goals of the HTML5
| revision if I'm not mistaken.
| lifthrasiir wrote:
| You are correct, but I don't think they are even slightly
| relevant in 2021.
| bawolff wrote:
| Web browsers exist, therefore its possible to parse html,
| even earlier versions.
|
| You're right that the de jure spec did not match de facto
| html, and browsers didn't neccesarily agree with each
| other. But that's always true. GCC has language extensions
| that aren't part of the c spec, but you wouldn't say that c
| is impossible to parse. Old html may have taken it up to
| 11, but its not fair to say its impossible to parse.
| uuidgen wrote:
| No, not really. The browsers did guess a lot and did
| standard-deviating parsing because the typical uses were
| wrong and they had to work. Nobody would switch to a new
| browser that doesn't work with existing pages.
|
| Modern example - mXSS. Even though modern html have to be
| valid xml the browser will, instead of giving an error
| when served invalid html, transform what's given to make
| it standard-compliant.
| bawolff wrote:
| Modern html by definition is not valid XML, unless you
| are using the xml serialization of html5 which isn't
| really teccomended and nobody does.
|
| Really no version of the official html spec was valid xml
| other than XHTML which was never particularly popular.
|
| But i don't really see your point. An implementation
| having a different idea how to parse html than you think
| is correct is not the same thing as something being
| unparsable. Its a tautology that if there exists a
| computer progran to parse something than it is possible
| to parse it with a computer program.
| chrismorgan wrote:
| Perl is famously unparseable: it's impossible to
| determine a parse tree without executing the code.
|
| (HTML, however, was never unparseable, merely
| insufficiently defined.)
| anoncake wrote:
| Previous versions of HTML were based on SGML. You can write
| a perfect SGML parser, but the developers of web browsers
| couldn't be bothered.
| ricardo81 wrote:
| I think we've all (mostly?) tried it. It really is the Wild West
| of the web when you're trying to parse other people's HTML,
| though.
|
| I've played around with this parser which is extremely quick.
| https://github.com/lexbor/lexbor
| [deleted]
| Corazoor wrote:
| Well, as stated that particuar answer is both right and wrong...
|
| Yes, you can not use "true" regular expressions to parse
| recursive structures.
|
| But the libraries that get used for regular expressions quite
| often include non-regular extensions (and confusingly call the
| resulting expressions still "regular").
|
| Most notably, PCRE allows for recursive patterns via "(?R)". You
| can absolutely parse arbitrary HTML with it.
|
| In fact you can parse anything whith that, including binary
| formats. You just can't do it whithout recursively applying the
| same "regex" again and again...
|
| And precise error handling is basically impossible without
| writing a proper lexer anyway, since your regex won't (can't,
| really) tell you where it was thrown off. It either works or
| doesn't, the "why" is left to the program to figure out...
| Doctor_Fegg wrote:
| "This post is locked to prevent inappropriate edits to its
| content. The post looks exactly as it is supposed to look - there
| are no problems with its content. Please do not flag it for our
| attention."
|
| Stack Overflow can be remarkably humourless at times.
| captainmuon wrote:
| Note that you cannot even vote on it, _and_ it is marked CW, so
| it can doubly not give reputation. This reputation resentment
| makes me sad...
| kroltan wrote:
| That is a defense against the humourless people that tried to
| edit the post down into a more objective answer years after the
| fact:
|
| https://stackoverflow.com/posts/1732454/revisions?page=2
| Doctor_Fegg wrote:
| Yes, that was my point.
| kroltan wrote:
| Ah I see, I misunderstood your comment as referring to the
| quote's clinical wording. Sorry!
| imedadel wrote:
| I've seen many people complain about StackOverflow, but this is
| the best example I've ever encountered.
|
| The question: How to match <p> <a
| href="foo">
|
| The answers: _rants about how RegEx is not suitable for parsing
| entire HTML._
|
| Only the 5th answer starts to actually answer the question.
| ajanuary wrote:
| According to the post, the more important part of the question
| is "what do you think", to which "I think you shouldn't,
| because..." is a good answer.
| chrismorgan wrote:
| "You're asking the wrong question" is a valid response.
| anoncake wrote:
| But not a valid answer. That's what comments are for.
| tomashubelbauer wrote:
| Yes, but I wish people who like to assume and answer saying
| so would still answer the question they think is wrong.
| Context matters and I don't think you can determine that with
| certainly nearly as often as some people online like to
| think.
| shawnz wrote:
| In this case the answer is correct given the parameters of
| the question: There is no way to have a regex that only
| matches the things which OP wants to match, but not any of
| the things OP doesn't want to match.
|
| Given a specific situation, like a particular page or
| something, sure, regexes are still a possibility for
| solving the problem. The 2nd highest answer on the page
| details exactly that. So what is the problem? Is every
| single contributer obligated to artificially entertain the
| OP's preconceptions before giving the advice which they
| believe actually helps best? For example, if I were
| knowledgeable about XML but not regex, should I just not
| contribute in such a situation?
| dgellow wrote:
| Really, it depends the context. You might be aware that's not
| something to generally do and still want to know the answer
| to the actual question.
| josefx wrote:
| Do you educate people about the complexity of the physics and
| bureaucracy involved with defining the current time every
| time someone asks you "what time is it?" Or do avoid going
| onto irrelevant tangents that get you labeled as crazy and
| just tell them the current time?
| IgorPartola wrote:
| What time is it isn't an invalid question. "How do I make
| my hamster grow wings and fly?" is. How to parse HTML with
| a RegEx is an in-between. For a specialized case, why not?
| Answer that question, then provide a counter example to
| show how it will be very fragile, then explain the theory,
| then show a better way. IME that tends to work better to
| teach someone what you think they should know.
| jasode wrote:
| _> Do you educate people about the complexity of the
| physics and bureaucracy involved with defining the current
| time every time someone asks you "what time is it?"_
|
| Maybe you're (inadvertently) making a caricature by using a
| simple _" what time is it?"_ question but many user
| questions are _under-specified_.
|
| Because of that, Stackoverflow answerers in particular do
| go into the extra complexities because it's part of its
| _editorial DNA_ to restate the q &a so it's a high-quality
| _community knowledgebase_ instead of just answering the
| direct question as stated. I tried to explain this hard-to-
| grasp nuance previously:
| https://news.ycombinator.com/item?id=21115438
|
| But sometimes, this X-Y problem editorializing mechanism
| gets so enthusiastic that it can detract from a correct
| answer. Here's a famous example of a string bytes
| extraction question with smart people arguing with the
| correct answers from user541686 (was Mehrdad) and Michael
| Buen:
|
| + correct answer has lots of X-Y pushback in the comments:
| https://stackoverflow.com/questions/472906/how-do-i-get-a-
| co...
|
| + another correct answer from Buen that emphasizes
| user541686/Mehrdad works for broken unpaired surrogates:
| https://stackoverflow.com/questions/472906/how-do-i-get-a-
| co...
|
| The meta layer issue is that the question is
| _underspecified_ which causes 2 sides with very intelligent
| people arguing whether or not it 's an X-Y problem!
| shawnz wrote:
| I think the top answer in your example is highly
| misleading and deserves to have the caveats highlighted
| more clearly even though it's not "wrong". It is saying,
| "you don't need to worry about encoding", but really the
| point it is proving is "if you just use ONLY toCharArray
| and BlockCopy on ONLY one system and framework version
| then you can be sure they always use the same encoding as
| one another, so in that situation you don't need to
| worry".
|
| So, the solution works, but only in specific situations
| which are not clearly explained and might be totally
| unrelated from OP's situation, and furthermore it doesn't
| really address the second part of OP's question "why take
| encoding into consideration?" I wouldn't necessarily call
| the problems with that answer just "XY pushback".
| anon946 wrote:
| If the question were about full validated parsing of HTML
| with a regex, then I'd agree that "You can't do that" might
| be part of a valid answer. But finding tags is not doing a
| full validating parse.
|
| Note that the set of valid C programs is not a context-free
| language. Yet it's common to use a context free-based
| approach to parsing. You just add additional code to handle
| the context-sensitive aspects (such as a symbol table).
| bill_mon wrote:
| I disagree, I see it as saying "you don't know what you
| really want, but I can read your mind". It's disrespectful
| and not giving the benefit of the doubt.
| lordgrenville wrote:
| Strongly disagree. The point of SO is for experts to answer
| questions. They've learned things the hard way and would
| like to help others do better. They're not being paid. As
| such, telling the questioner that their whole approach is
| wrong is appropriate and even preferable.
|
| From what I've heard Jeff Atwood and Joel Spolsky had
| different views on this and Spolsky's more tolerant, "no
| such thing as a stupid question" approach won out within
| the company, but is less popular among the people who write
| answers.
| shawnz wrote:
| I don't think it is disrespectful to suggest someone is
| falling victim to the XY problem.
|
| Actually I think it is a common and expected outcome that
| when investigating a new problem, we often get stuck in "XY
| problem" traps while researching the solution.
|
| I very much value any feedback that suggests I should
| rethink the entire problem with a simpler model, because
| without experience it's hard to know what the simplest
| models are.
| anaerobicover wrote:
| Absolutely agree. In my experience this is one of the
| more valuable features of asking someone to discuss a
| problem I'm mired in. Because they haven't been looking
| under every rock and studying the bark of every tree like
| I have, they're very likely to quickly see when I've
| wandered into entirely the wrong part of the forest.
| ivanbakel wrote:
| >It's disrespectful and not giving the benefit of the
| doubt.
|
| Unfortunately, a good number of users who post questions on
| StackOverflow have _not_ earned the benefit of the doubt.
| Browsing the site, you will occasionally come across
| questions which are the tech equivalent of asking "Which
| screwdriver is the right size to stick in this electrical
| socket?"
|
| Frame challenges are a necessary part of learning, so they
| belong on a Q&A site. If a user doesn't want their problem
| to be challenged, the onus is on them to clarify in the
| question why their particular approach is the necessary
| one. It's only possible to respond with alternative
| solutions when the problem is not specified enough.
| kemitche wrote:
| The problem is, the answers are useful to more than just
| the original questioner. Sure, the questioner may be
| doing things vastly wrong - but the people who land on
| that question's page via search may have legitimate
| reasons for doing things a certain way.
|
| The silent majority of viewers will benefit from an
| answer that does both of (1) explaining why the answer is
| probably not what is wanted, and (2) answering the
| initial question _as written_ anyway, for future viewers.
| ivanbakel wrote:
| Then those other viewers will either benefit in the same
| way from the frame challenge as a learning experience, or
| they will have a sufficiently-specific problem that they
| can ask their own questions with more justification for
| taking a specific approach.
|
| Answering the question as written has the risk that any
| solution will be blindly applied without appreciating why
| the approach itself should be avoided. This is especially
| true for those users who see SO as a "write my code"
| site, and copy-paste anything in backticks.
| jstanley wrote:
| > Which screwdriver is the right size to stick in this
| electrical socket?
|
| Note that this is a legitimate technique in UK sockets.
|
| The live and neutral pins have a little gate over them
| that is retracted when you insert the earth pin, so you
| need to first stick a screwdriver into the earth pin in
| order to get your fingers into the live pin.
| jerven wrote:
| I can't parse if this is humour or a mistake. Putting
| your fingers on the live pin is not a great idea, trying
| this to get an euro 15 plug into an UK socket, also not
| great but in a different category.
| Keyframe wrote:
| Well, there's also a mains tester screwdriver which is a
| legit tool that you stick into a socket and also
| participate in the electrict current loop for the light
| on it to light up.
| bill_mon wrote:
| Good points.
|
| I'm not so sure benefit of the doubt must be earned. More
| like, any participant in a discussion forum must show it
| when answering, and do proper research before asking
| anything. If all questions are good questions, there's no
| problem. But, as you say, they really aren't. I think
| poor question should be down voted with a brief
| explanation instead of trying to answer the "real"
| question. Or moved to a Frame challenge forum.
|
| Are we trying to answer the question or to solve the
| problem?
| shawnz wrote:
| > do proper research before asking anything.
|
| Asking on SO is itself research. It is good to review the
| existing literature before taking contributors time, of
| course, but if the problem is not solved in the existing
| literature, then perhaps the framing issue isn't
| addressed by the existing literature either. In that case
| how could the learner know the best way to frame the
| problem in advance?
|
| > I think poor question should be down voted with a brief
| explanation instead of trying to answer the "real"
| question. Or moved to a Frame challenge forum.
|
| This precludes the possibility that some contributors
| might want to address the framing problem, whereas others
| might want to address the specific question as asked.
| They may have different opinions about whether it is
| framed wrong at all. It also means the OP is losing karma
| or getting penalized for no fault of their own.
| tester34 wrote:
| >It's disrespectful and not giving the benefit of the
| doubt.
|
| So what?
|
| If actually OP knows that this is bad approach, then OP
| will clarify that he's aware and yada yada.
|
| What's the problem? lack of thick skin?
| bryanrasmussen wrote:
| unfortunately sometimes people who ask questions are really
| junior, and need to be told they are going to have an
| unpleasant surprise if they go down the path they are
| planning on going.
|
| sometimes people who ask questions know the pitfalls but
| don't clarify that they know adequately because they are
| pressed for time. in this case those people unfortunately
| run the risk of being talked down to and they should accept
| that.
|
| on the other hand if they have clarified adequately that
| they know what they're doing and they still want to do
| something that might seem weird then I agree it is
| disrespectful. Which is a thing you see often enough on
| StackOverflow to be notable.
| umanwizard wrote:
| Maybe so, but what about the non-junior person who needs
| to do something weird for an actual valid reason and
| stumbles on the refusal to answer the question years
| later? StackOverflow answers aren't just for the original
| asker.
| andybak wrote:
| I think in that case - the new person should probably
| post a new question.
|
| The point is that the original question - as framed - was
| better served by saying "if you go back a step and
| reexamine your assumptions, you'll find there is a better
| path to your intended goal".
|
| The new person has a different goal or a different set of
| constraints.
| detaro wrote:
| Because asking new questions and getting them closed as
| duplicate because they sound vaguely similar to an
| existing question is sooo helpful an experience...
| andybak wrote:
| Yeah - but I'm just playing with hypothetical ideal cases
| here. "Annoying flawed habits of Stack Overflow
| moderators" isn't something that's on my list of things
| I'm thinking about. ;-)
|
| EDIT - which got me thinking. Maybe the "correct" thing
| to do is answer the original question _as asked_ but
| gently point out to the person asking it that there is
| probably a better solution for them if only they had
| asked a different question.
|
| The original question still stands and has an answer
| useful for other people. The original questioner has the
| opportunity to learn and ask the question they should
| have asked in the first place.
|
| It's going to be annoying for someone - so it should at
| least be the person that kicked things off in the first
| place.
| andybak wrote:
| > It's disrespectful
|
| How do you respectful tell someone you think they are
| mistaken? I'd rather not be pussyfooted around by someone
| if I'm in the role of "person who has asked a question
| based on a faulty assumption". Don't be rude but don't
| avoid trying to answer truthfully to the best of your
| ability.
| bill_mon wrote:
| > How do you respectful tell someone you think they are
| mistaken?
|
| How about "you're mistaken"?
|
| The problem is with "You don't know what you're talking
| about, but I do, so let me answer your real question".
| andybak wrote:
| The wording used was "You're asking the wrong question"
| not "You don't know what you're talking about".
|
| I find that perfectly fine. It was slightly disingenuous
| to reword it.
| andybak wrote:
| https://en.wikipedia.org/wiki/XY_problem
|
| https://xyproblem.info/
|
| https://meta.stackexchange.com/questions/66377/what-is-
| the-x...
| mumblemumble wrote:
| Only when you know enough about the person's context to be
| able to tell them what question they should be asking
| instead.
|
| If you don't have that context, then the correct thing to do
| is to ask for more information, or say, "did you consider
| this", or find some other way to come up with a constructive
| response. You don't just assume you know what the person
| really wants to do and then try to mainsplain it to them.
| EamonnMR wrote:
| I find this type of answer infinitely more paletable than "your
| question is answered here" or "comments are not for extended
| discussion, this conversation has been moved to chat"
| armada651 wrote:
| Actually, if you read his rant all the way to the end he does
| offer a helpful suggestion:
|
| > Have you tried using an XML parser instead?
| andybak wrote:
| Except he's wrong in this case. The OP could use a regex in
| this specific scenario.
| chrismorgan wrote:
| Not true. The questioner has not provided anywhere near
| enough detail to determine if regular expressions are
| sufficient. For example: should <br> match, or not? Its
| semantics are identical to <br />. To determine if regular
| expressions are enough, you would need to know exactly what
| markup you're dealing with, and that has not been provided.
| andybak wrote:
| Yeah. I guess in other parts of this discussion I'm
| arguing for always probing hidden assumptions and missing
| background whilst here I'm saying "let's interpret the
| question in the most charitable way possible".
|
| Plus - Stack Overflow is about trying to generalize any
| given question to maximize it's wider usefulness.
| bhaak wrote:
| > Plus - Stack Overflow is about trying to generalize any
| given question to maximize it's wider usefulness.
|
| Since when? You don't get extra points if you write stuff
| that doesn't concern OP's problem. Most SO problems don't
| get viral and get lots of upvotes from other people. From
| a game theory perspective, it doesn't make sense to add
| more to an answer than to make it the accepted one.
|
| If you have slightly different constraints you are
| encouraged to open another question. Discussions are
| frowned upon and sometimes even interrupted by admins so
| you can't discuss if your situation is different from
| OP's situation and so could warrant a different answer.
| chrismorgan wrote:
| This has always been the _intent_ of Stack Overflow, from
| its very earliest days. One of the stock reasons for
| closing a question used to be "too specific--this
| question is unlikely to help anyone else" (or words to
| that effect), though that has been removed now (I think
| because it upset too many people who took its blunt
| message the wrong way). People have always been nudged
| towards adjusting questions so that they'll be generally
| useful.
|
| Discussions on questions are routinely about unrelated or
| not-closely-related matters, and quite apart from that
| Stack Overflow wants to be a Q&A platform, not a
| discussion platform.
| commandlinefan wrote:
| https://www.aprogrammerlife.com/top-rated/stack-overflow-how...
| omginternets wrote:
| In all fairness, he does provide hints as to _why_ regex will
| not work (namely: HTML is not a regular grammar).
|
| Sure, it's somewhat obscured by the humorous rant, but it's not
| that bad an answer, either.
|
| More to the point: I'm not sure I want to suck the humor out of
| everything. I agree that SO has problems, but humor and poetry
| are worthwhile things in otherewise serious places. It's all
| about quantity.
| seiferteric wrote:
| Yes, I have been down-voted and scolded for answering a
| question as literally described simply because others would
| rather assume the ignorance of the questioner. Yes I know
| people will often ask a question due to not understanding what
| they are doing, but when 10 other people have already responded
| with "don't do it that way!" I think it can be useful to
| actually answer the question as stated (if possible).
| a1369209993 wrote:
| You can't parse _fully-general_ HTML with regex, but unless you
| 're writing a web browser of something, that's not what you're
| trying to do; you're trying to parse the particular _subset_ of
| HTML that happens to be emitted by this particular website that
| you got the HTML to be parsed from. And, much like the halting
| problem or integer factorization, despite the general case being
| difficult or impossible, the overwhelming majority of _specific_
| cases are easy.
| loxias wrote:
| This never gets old.
| JackeJR wrote:
| Counter argument: Oh Yes You Can Use Regexes to Parse HTML!
|
| https://stackoverflow.com/questions/4231382/what-to-do-regul...
|
| Discussion: https://news.ycombinator.com/item?id=26357237
| lucideer wrote:
| Counter-counter-argument, one of the comments underneath that
| answer:
|
| > _what you have written is not really a regular expression
| (modern, regular, or otherwise), but rather a Perl program that
| uses regular expressions heavily. Does your post really support
| the claim that regular expressions can parse HTML correctly? Or
| is it more like evidence that Perl can parse HTML correctly?_
| harveywi wrote:
| Do the Halting Problem next.
| xg15 wrote:
| Yes, it's hopeless to try and parse arbitrary _HTML_ with regexes
| (or with anything really. I believe, the full HTML5 parsing
| algorithm is not even type-2, it 's more or less turing-complete.
| In-the-wild HTML can also interact with scripting in all kinds of
| entertaining ways, to the point that a conforming HTML5 parser
| has to be able to execute javascript _while parsing_ - and the
| javascript can inject additional tokens into the parser 's input
| stream. It's possible to create a HTML5 document that only
| validates on every second thursday of the month.)
|
| However, if we know the document is well-formed(!) XHTML,
| shouldn't it be possible? This would mean the document is valid
| XML and XML was specifically designed to be regex-friendly, I
| believe.
|
| At least, out of my head, the only gotchas that have to be
| accounted for are comments and CDATA sections - those may contain
| arbitrary unescaped text, including angle brackets. However they
| also have unambigous start and end markers and can't be nested,
| so a regex could account for them.
|
| Attribute values should not be a problem, as angle brackets must
| be escaped inside those to be valid XML.
|
| I'm not sure about processing instructions and doctypes though.
| chubot wrote:
| Yes that's basically what I wrote about here:
|
| https://news.ycombinator.com/item?id=26359556
|
| I listed 3 or 4 caveats. CDATA might be another one since I'm
| not handling those... I've never used it so I left it out.
|
| Actually I remember that regular languages weren't entirely
| enough; I think the .*? non-greedy matching was useful, e.g.
| for finding --> at the end of comments.
| theandrewbailey wrote:
| On my blog, I write my posts in markdown. After it's converted
| to HTML, I use regex to search and replace images (for high-res
| and alternative formats), and get the first paragraph (for a
| 'preview'). I've been doing this for years, so the 'never use
| regex over HTML' advice isn't holding up for me.
| taylodl wrote:
| 2nd answer: _" While arbitrary HTML with only a regex is
| impossible, it's sometimes appropriate to use them for parsing a
| limited, known set of HTML."_
|
| This. So much this. Yes, you can't parse arbitrary, unknown XML
| with regex. But I don't find myself parsing arbitrary, unknown
| XML very often. Usually I know _exactly_ what I 'm expecting and
| if I can't find the information I need then it's a problem. Regex
| parsing is _perfect_ for this scenario - and much, much faster. I
| created a regex parser for Java that even handles namespaces and
| relative paths. Can it parse _every_ XML file? No - you can 't
| parse XML with regex. But I can parse everything I need to parse
| - and if I can't? I can always fall back on full-featured XML
| parsers.
| chubot wrote:
| All you need to parse HTML is regular expressions (to recognize
| tags) and a stack (to match tags).
|
| Your programming language has a stack -- a call stack.
|
| So in practice all you really need is regular expressions.
| (Which I tend to call "regular languages" to make a distinction
| with Perl-style regexes [1], although they work fine too in
| practice for this case)
|
| Using the call stack in a more functional style is nicer than
| using the OOP style that s in the Python standard library,
| which is probably inherited from Java, etc.
|
| I have done this with a bunch of HTML processors for the Oil
| blog and doc toolchain:
|
| https://github.com/oilshell/oil/tree/master/doctools
|
| It works well in practice and is correct and fast.
|
| Big caveat: this style is only for HTML that I generate myself,
| e.g. the blog and docs. There are a bunch of rules around
| matching tags in HTML5 that are subtle. Although one of the
| points here is that you don't have to do a full DOM-style parse
| and can ignore those rules for many useful cases.
|
| The other caveat is that HTML has a bunch of rules for what
| happens when you see a stray < or > that isn't part of a tag.
| This style makes it a hard syntax error, so it's really a
| subset of HTML (which has no syntax errors). For my purposes
| that is a feature rather than a bug, basically following
| Postel's law.
|
| I meant to write a blog post titled "why/when you can parse
| HTML with regexes" about this but didn't get around to it.
|
| There is also a caveat where parsing arbitrary name=value pairs
| with regexes isn't ergonomic, because it's hard to capture a
| variable number of pairs. However the point is that I wrote 5
| or 6 useful and compact HTML processors that don't need that.
| In practice when you parse HTML you often have a "fixed
| schema".
|
| Concrete examples are generating a TOC from <h1>, <h2>, etc.
| and syntax highlighting <pre><code> blocks. Those all work
| great with the regex + call stack style.
|
| [1] http://www.oilshell.org/blog/2020/07/eggex-theory.html
|
| edit: for completeness, another caveat is that the stack-based
| style is particularly bad for C++/Rust and arbitrary input
| because you could blow the stack, although we already limited
| the problem to "HTML generated ourselves"
| aidenn0 wrote:
| Yes, HTML is a deterministic context-free language, so you
| can parse with a DPDA.
|
| In addition, tokens are regular (as it is for many
| languages), so you can use a regex for tokenization.
|
| All you need for writing a recursive descent parser with
| backtracking is a call-stack, so are all LL(k) and most
| practical CFGs also parseable with regex?
| chubot wrote:
| I am not sure if you can say that precisely -- I added the
| caveats about the rules for end tags, and the rules for
| stray < and >.
|
| But certainly a useful subset of HTML can be parsed with a
| DPDA. (I'd be interested in more analysis of that;
| arbitrary tags are another factor)
|
| It's a matter of opinion but I would say recursive descent
| is "nontrivial", whereas matching tags is "trivial".
|
| Recursive descent involves some choice around lookahead or
| backtracking. It can be slow if you do the wrong thing,
| hard to debug, etc. It takes a little practice, and
| correspondence with a grammar is important.
|
| Whereas matching HTML tags requires no lookahead, and I
| would say anyone can figure it out just with a simple code
| reading. Even the "inverted" OOP style is "simple", but
| annoying for me to read and correctly modify. The call
| stack reads much better.
| c-cube wrote:
| What you're describing is using regex to _lex_ html, not
| parse it. The parser is easy to build once you have a lexer
| (after all most of XML's unpleasantness comes from escaping).
| It still isn't the same thing as parsing with regular
| expressions.
|
| Similarly you can't parse S-expressions with regular
| expressions, but if you have the lexer (e.g. with `lex` or
| other languages' equivalent) then the parser on top of it
| becomes absolutely trivial.
| chubot wrote:
| Right exactly, in fact I call the library "lazylex" because
| it only does the minimum to find the <> structure.
|
| If you want to recognize more, then you invoke a attribute
| "name=value" lexer on the tag, but usually you don't. This
| makes it quite fast (and speed is useful because most doc
| toolchains are slow).
|
| The lexing is the "generic" part and parsing is specific to
| the task at hand.
|
| In the case of making an HTML table of contents, you
| literally just find <h1>, <h2> with regexes, and don't do
| all this DOM nonsense. It's easier to write than the SAX
| style with an explicit stack.
|
| So yes the point is that parsing HTML is trivial for most
| purposes: use the call stack. It helps if your language has
| exceptions to indicate errors.
| polote wrote:
| When I was working on a link parser in python for a crawler. I
| had two choices:
|
| 1. use some form of regex
|
| 2. use libxml and find links
|
| 1. was faster than 2. by a factor or two
|
| Does a link-only parser would have been faster ? Yeah probably
| but it is much more complex to do
| benlivengood wrote:
| What was the average difference in accuracy?
|
| It's a fair tradeoff especially for a crawler where it's
| never guaranteed to reach all documents anyway.
| contravariant wrote:
| Maybe I'm confused about the problem you're solving but
| regexp was faster than just: from lxml
| import html doc = html.from_string(response.content)
| doc.xpath('//a/@href') ?
| xurukefi wrote:
| In my experience, a regex can be a much more robust solution
| than an XML parser depending on the use case. Back in the days
| when I did some webscraping I often had parsers throwing
| exceptions becacuse of ivalid HTML. More often than not I
| switched to regular expressions, which always worked out
| flawlessly.
| metalliqaz wrote:
| "always flawless" is a bold claim
| mumblemumble wrote:
| It's hard to imagine any web scraping being described as
| "flawless". But there's a whole lot of room for "worked
| fine for my purposes."
|
| I think that it's actually a pretty great example of a case
| where capturing data from HTML may not be best modeled as a
| parsing task. You might not need a whole parse tree just to
| match some pattern and grab an associated string. And
| skipping the parse may enable you to get useful data out of
| a file that technically can't be parsed due to syntactic
| errors. It's a fairly classic precision/recall tradeoff
| situation.
| oeiiooeieo wrote:
| "I've never lost playing Russian Roulette"
| btilly wrote:
| Yes. HTML isn't XML.
|
| As you discovered, HTML in the real world is allowed to be
| malformed in various ways, while XML is not. A compliant
| parser MUST barf on various kinds of malformed text. (Search
| for "fatal error" in https://www.w3.org/TR/REC-xml/ to
| verify.) This makes XML parsers inappropriate for HTML
| parsing.
|
| Interestingly, even perfectly valid HTML may not be valid
| XML. To see that consider <b>this <i>example</b>
| carefully</i>.
| dragontamer wrote:
| I found a one-pass top-down "XML parser". Not like a proper SAX
| parser, no... the XML had to be specially formatted, almost
| like TOML. <whatever> <--- Parser parsed this
| as a new section <foo attrib=bar> <--- All "XML" had to
| be one-per-line </whatever> <---- parser ignored this
|
| It was an "XML parser" per se, but it really was just a linear
| one-pass parser that tricked me into thinking it was XML.
|
| So really, it was more like TOML (or .INI files) than like XML.
| But I guess the advantage of making it "bastard XML" instead of
| TOML is that maybe this worked with XML-editors or something. I
| dunno...
| cbsks wrote:
| I've written an XML parser like this for a toy project. I
| passed the XML through a prettifier first so that it was in a
| standard format. It wouldn't work for every XML file, but it
| worked on the files that I needed it for.
|
| I have also had success searching through html files with
| grep after passing them through a prettifier. It's ugly, but
| 90% of the time, it works every time!
| dragontamer wrote:
| The specific project I saw that did this had ALL of its
| configuration files in this "Bastard XML" format.
|
| Its ugly, but when 100% of your files you interpret match
| the one-per-line (and other clearly made-up rules), then it
| works 100% of the time!!
| gostsamo wrote:
| I get why all the complains against the top answer. At the same
| time one should appreciate its literary qualities in regard to
| structure and style.
| darkwater wrote:
| My 5yo daughter can already write more or less ok-ish but she has
| big problems reading that back, especially when she does spelling
| mistakes.
|
| I feel more or less the same when I write regular expressions.
| goldsteinq wrote:
| You can't parse HTML with regex, but PCRE is not regex. I'm not
| sure if you can parse HTML with PCRE.
| mumblemumble wrote:
| This pops up every so often, and it sort of irritates me every
| time. Partially because it's overly simplistic, but, even more
| so, because, while it's cute and humorous, it's not actually very
| good advice and it doesn't actually answer the question. No, you
| can't _parse_ html with regex. But go look at the question. The
| author is just trying to detect some tags. That 's not _exactly_
| parsing.
|
| It's true that there are some complications around things like
| "What if > appears in an attribute's value?" If you know your
| input well enough, or you don't need perfection, that might be a
| problem you can ignore. Alternatively, you can still use regex,
| if you use a sufficiently powerful regular expression tool.
| .NET's regular expressions, for example, have a concept of
| balancing groups that will let you do this.
|
| I would also point out that a lot of open source HTML parsing
| libraries are even more dangerous than regular expressions for
| parsing unknown HTML, because they use recursive descent. Where
| you have recursion, you have the potential for a stack overflow.
| With a regex library, you do have to be careful about
| catastrophic backtracking, but that's at least something you can
| usually handle in your own code, or, in the worst case, defend
| against with timeouts.
|
| A parser that's capable of blowing the call stack, and has been
| exposed to input from the Internet, though, is capable of taking
| down your process in a way you can't defend against in most
| languages. And it's difficult to patch up a parser like that
| without more-or-less rewriting it. I absolutely have had to deal
| with html handling code getting into situations like that in the
| past. Malicious input is real. So is plain old bad input. Reading
| the code before you use it is often a good idea.
| h2odragon wrote:
| This argument is common, and this is a good answer; but so often
| people aren't "parsing" XML but extracting a few bits of it and
| would have benefited from less cargo cult and more thought in the
| answer cited.
|
| As it is, I've seen this article used to scare people away from
| "can i make a game behave differently?" efforts that would have
| been trivial to do and likely given these people a gateway "i
| _can_ try to be a programmer " experience.
| csours wrote:
| To me, this typifies working with technology and programming.
| Computer programs only ever look like they are working, because
| they have not encountered problem data or conditions.
|
| Aka, it works on the happy path.
|
| Software engineering is how we balance how much of the unhappy
| path and corner cases we take care of, and how we handle them,
| imo.
| indymike wrote:
| Funny thing. Email addresses need a rant like this too. Yes, you
| can parse 99% or so with a regex, but like HTML or XML, you
| really need an email address parser. RFC 2822 was designed to be
| parsed using string processing (in C no less) and requires some
| complexity that most regexes fail on. Here's a discussion about
| using the simpler, older RFC (822) and regex:
| https://stackoverflow.com/questions/20771794/mailrfc822addre...
| mumblemumble wrote:
| For most purposes, if you're trying to use parsing to achieve
| email address validation perfection, you've already lost the
| battle.
|
| A valid email address typically isn't just a syntactically
| correct one; it's also one that can be used to get an email to
| the recipient. The only way to test that is to send an email
| and see if it gets to the recipient. Which is why it's much
| more common to see some minimal client-side validation that
| uses a simple regex that will (ideally) match all valid email
| addresses but only catch gross syntactic errors like typing #
| where you meant @, more for the sake of decent UX than anything
| else, and rely on asking the user to double-type their email
| address and sending an activation email to deal with finger-
| grained syntactic errors and the whole universe of non-
| syntactic errors.
| rswskg wrote:
| Old, but gold.
| 8lall0 wrote:
| TONY THE PONY.
| llimos wrote:
| Would be interesting to know how many up and down votes HN is
| sending that answer.
| daniel-thompson wrote:
| The answer had 4440 upvotes and 27 downvotes at the time it was
| locked (click on it to reveal the breakdown, if you have
| sufficient SO "reputation").
| tomashubelbauer wrote:
| The question is locked, so it cannot be voted on.
| legec wrote:
| Where are all the comments gone ?
|
| (note : I mean the comments on StackOverflow, not the comments
| here in Hacker News ... )
| The_rationalist wrote:
| Aren't html code highlighters using regex? Isn't vscode using
| TextMate regex for color highlithing?
| chrismorgan wrote:
| Most code highlighters do. They're normally close enough to
| accurate for highlighting purposes (though they will commonly
| have some uncommon constructs that they get wrong), but they
| tend to fall apart when you try to use that for much more; for
| example, indentation when you use regular expressions to parse
| your HTML tends to start falling apart if you take what XML
| users might consider "shortcuts" (such as omitting optional end
| tags).
| The_rationalist wrote:
| Valid HTML allow optional end tags? For example?
| lifthrasiir wrote:
| `<script>` is an usual example that you can't self-close
| and absolutely need to be followed by `</script>` in HTML5.
|
| In general though self-closing tag has no effect in HTML5
| anyway, `<script>` is just an example where the usual
| heuristic specified by HTML5 doesn't help you at all (since
| it switches the lexer state).
| [deleted]
| layer8 wrote:
| You can usually use regexes for tokenization, which is
| sufficient for syntax highlighting, but you generally can't use
| regexes for parsing (nested structures).
| zihotki wrote:
| Should 2011 (when answer was first provided) or 2009 (when
| question was posted) be added to the title?
| usrusr wrote:
| Clearly it should be 20(?:09|11)
|
| On the other hand, the inclusion of [X] in the title is more
| than enough to establish the historical setting.
| lifthrasiir wrote:
| I had once tweeted related quizzes. Can you guess parse trees (or
| reserializations) for following HTML fragments without invoking
| browsers? Assume that everything gets pasted right after the
| document body. 1. <a b="42>c">d 2.
| <a/b/c=d/e>f 3. <a/="42>b 4. <a x=&0>&0</a>
| 5. a<!--->b<!--+->c<!-->d
|
| Really, don't try to answer and just use complaint HTML parsers.
| bhaak wrote:
| What kind of HTML parser? A SGML one or a HTML5 one?
|
| I'm really sad that they didn't go with a XML base for HTML5.
| lifthrasiir wrote:
| > What kind of HTML parser? A SGML one or a HTML5 one?
|
| I intended the latter. In fact I'm a bit surprised that I
| have ever been asked for this, I thought "HTML" nowadays
| exclusively refers to HTML5...
| bhaak wrote:
| I would have expected the former. Given that one of the
| rationales of HTML5 was "simpler parsing". I'm obviously
| not uptodate with the HTML5 parsing.
|
| But why should I? Who writes HTML by hand these days?
|
| The SGML heritage of HTML 4.01 and earlier lead to some
| gruesome legal constructs that look surprisingly similar to
| your examples. Looks like every generation has to make
| their own mistakes.
| lucideer wrote:
| > _I 'm really sad that they didn't go with a XML base for
| HTML5._
|
| I'm really sad that they didn't evangelise an XML base for
| HTML5, and that many HTML5-ish tools don't explicitly support
| XML, but it's not strictly true that they didn't go for an
| XML base for HTML5[0][1]
|
| [0] https://html.spec.whatwg.org/multipage/xhtml.html
|
| [1] https://www.w3.org/TR/html-polyglot/
| bhaak wrote:
| That's just HTML5 rewritten as a well formed XML document.
| HTML5 does not describe well formed XML.
|
| That you can put the same information of a HTML5 document
| into a XML document doesn't help much if most of the HTML5
| documents out there are not polygot.
| lucideer wrote:
| It doesn't help from a client perspective, but depending
| on your page delivery pipeline could be of potential help
| for some from a server perspective.
| chrismorgan wrote:
| But the point is that XML syntax is still a thing,
| supported by all browsers (and it's reasonable to expect
| that support to remain as long as HTML remains). See also
| https://html.spec.whatwg.org/multipage/introduction.html#
| htm...:
|
| > _When a document is transmitted with an XML MIME type,
| such as application /xhtml+xml, then it is treated as an
| XML document by web browsers, to be parsed by an XML
| processor. Authors are reminded that the processing for
| XML and HTML differs; in particular, even minor syntax
| errors will prevent a document labeled as XML from being
| rendered fully, whereas they would be ignored in the HTML
| syntax._
|
| HTML _does_ use an XML base (elements, attributes,
| namespaces, _& c._), it just doesn't use an XML _parser_
| most of the time. But the XMLness is easily observed in
| various DOM APIs.
___________________________________________________________________
(page generated 2021-03-05 23:02 UTC)