improve parsing whitespace after end tag names - webdump - HTML to plain-text converter for webpages
(HTM) git clone git://git.codemadness.org/webdump
(DIR) Log
(DIR) Files
(DIR) Refs
(DIR) README
(DIR) LICENSE
---
(DIR) commit 72b23084b7c64c298c6b90ae6ad9f53f497cec57
(DIR) parent a0118e672fd3fa0004ccf2850eaef4ec4bc6fb39
(HTM) Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat, 29 Jun 2024 18:29:21 +0200
improve parsing whitespace after end tag names
Real site example:
https://www.gnupg.org/gph/en/manual.html
Has HTML such as:
<P
CLASS="COPYRIGHT"
>Copyright © 1999 by <SPAN
CLASS="HOLDER"
>The Free Software Foundation</SPAN
></P
>
...
This incorrectly showed ">" in the end tag as data.
Reported by Jason Hood, thanks!
Diffstat:
M xml.c | 2 ++
1 file changed, 2 insertions(+), 0 deletions(-)
---
(DIR) diff --git a/xml.c b/xml.c
@@ -386,6 +386,8 @@ xml_parse(XMLParser *x)
else if (c == '>' || ISSPACE(c)) {
x->tag[x->taglen] = '\0';
if (isend) { /* end tag, starts with </ */
+ while (c != '>' && c != EOF) /* skip until > */
+ c = GETNEXT();
if (x->xmltagend)
x->xmltagend(x, x->tag, x->taglen, x->isshorttag);
x->tag[0] = '\0';