xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence - xmlparser - XML parser
 (HTM) git clone git://git.codemadness.org/xmlparser
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
 (DIR) commit 2e33c882b88eebdaefb0477658a9cbb79d57e2b1
 (DIR) parent 6d001c968814d93492e5925f63ede6aa94c12552
 (HTM) Author: Hiltjo Posthuma <hiltjo@codemadness.org>
       Date:   Fri, 22 Jan 2021 13:37:47 +0100
       
       xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence
       
       In sfeed a simple way to reproduce:
       
               printf '<item><title>&#xdc00;</title></item>' | sfeed | iconv -t utf-8
       
       Result:
               iconv: (stdin):1:8: cannot convert
       
       Output result:
       
               printf '<item><title>&#xdc00;</title></item>' | sfeed
       
       Before:
       
       00000000  09 ed b0 80 09 09 09 09  09 09 09 0a              |............|
       0000000c
       
       After:
       
       00000000  09 26 23 78 64 63 30 30  3b 09 09 09 09 09 09 09  |.&#xdc00;.......|
       00000010  0a                                                |.|
       00000011
       
       The entity is output as a literal string. This allows to see more easily whats
       wrong and debug the feed and it is consistent with the current behaviour of
       invalid named entities (&bla;). An alternative could be a UTF-8 replacement
       symbol (codepoint 0xfffd).
       
       Reference: https://unicode.org/faq/utf_bom.html , specificly:
       
       "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? "
       "A: A different issue arises if an unpaired surrogate is encountered when
       converting ill-formed UTF-16 data. By representing such an unpaired surrogate
       on its own as a 3-byte sequence, the resulting UTF-8 data stream would become
       ill-formed. While it faithfully reflects the nature of the input, Unicode
       conformance requires that encoding form conversion always results in a valid
       data stream. Therefore a converter must treat this as an error. [AF]"
       
       Diffstat:
         M xml.c                               |       3 ++-
       
       1 file changed, 2 insertions(+), 1 deletion(-)
       ---
 (DIR) diff --git a/xml.c b/xml.c
       @@ -287,7 +287,8 @@ numericentitytostr(const char *e, char *buf, size_t bufsiz)
                else
                        l = strtol(e, &end, 10);
                /* invalid value or not a well-formed entity or invalid code point */
       -        if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff)
       +        if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff ||
       +            (l >= 0xd800 && l <= 0xdffff))
                        return -1;
                len = codepointtoutf8(l, buf);
                buf[len] = '\0';