codemadness.org

       improve parsing whitespace after end tag names - webdump - HTML to plain-text converter for webpages
 (HTM) git clone git://git.codemadness.org/webdump
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
 (DIR) commit 72b23084b7c64c298c6b90ae6ad9f53f497cec57
 (DIR) parent a0118e672fd3fa0004ccf2850eaef4ec4bc6fb39
 (HTM) Author: Hiltjo Posthuma <hiltjo@codemadness.org>
       Date:   Sat, 29 Jun 2024 18:29:21 +0200
       
       improve parsing whitespace after end tag names
       
       Real site example:
       
               https://www.gnupg.org/gph/en/manual.html
       
       Has HTML such as:
       
       <P
       CLASS="COPYRIGHT"
       >Copyright &copy; 1999 by <SPAN
       CLASS="HOLDER"
       >The Free Software Foundation</SPAN
       ></P
       >
       ...
       
       This incorrectly showed ">" in the end tag as data.
       
       Reported by Jason Hood, thanks!
       
       Diffstat:
         M xml.c                               |       2 ++
       
       1 file changed, 2 insertions(+), 0 deletions(-)
       ---
 (DIR) diff --git a/xml.c b/xml.c
       @@ -386,6 +386,8 @@ xml_parse(XMLParser *x)
                                                else if (c == '>' || ISSPACE(c)) {
                                                        x->tag[x->taglen] = '\0';
                                                        if (isend) { /* end tag, starts with </ */
       +                                                        while (c != '>' && c != EOF) /* skip until > */
       +                                                                c = GETNEXT();
                                                                if (x->xmltagend)
                                                                        x->xmltagend(x, x->tag, x->taglen, x->isshorttag);
                                                                x->tag[0] = '\0';