codemadness.org

       xml.html - www.codemadness.org - www.codemadness.org saait content files
 (HTM) git clone git://git.codemadness.org/www.codemadness.org
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       xml.html (13108B)
       ---
            1 <!DOCTYPE html>
            2 <html dir="ltr" lang="en">
            3 <head>
            4         <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
            5         <meta http-equiv="Content-Language" content="en" />
            6         <meta name="viewport" content="width=device-width" />
            7         <meta name="keywords" content="XML, XMHELL, RSS, Atom, parser" />
            8         <meta name="description" content="xml.c and xml.h: XML parser for some of my projects" />
            9         <meta name="author" content="Hiltjo" />
           10         <meta name="generator" content="Static content generated using saait: https://codemadness.org/saait.html" />
           11         <title>xml.c and xml.h: XML parser - Codemadness</title>
           12         <link rel="stylesheet" href="style.css" type="text/css" media="screen" />
           13         <link rel="stylesheet" href="print.css" type="text/css" media="print" />
           14         <link rel="alternate" href="atom.xml" type="application/atom+xml" title="Codemadness Atom Feed" />
           15         <link rel="alternate" href="atom_content.xml" type="application/atom+xml" title="Codemadness Atom Feed with content" />
           16         <link rel="icon" href="/favicon.png" type="image/png" />
           17 </head>
           18 <body>
           19         <nav id="menuwrap">
           20                 <table id="menu" width="100%" border="0">
           21                 <tr>
           22                         <td id="links" align="left">
           23                                 <a href="index.html">Blog</a> |
           24                                 <a href="/git/" title="Git repository with some of my projects">Git</a> |
           25                                 <a href="/releases/">Releases</a> |
           26                                 <a href="gopher://codemadness.org">Gopherhole</a>
           27                         </td>
           28                         <td id="links-contact" align="right">
           29                                 <span class="hidden"> | </span>
           30                                 <a href="/donate/">Donate</a> |
           31                                 <a href="feeds.html">Feeds</a> |
           32                                 <a href="pgp.asc">PGP</a> |
           33                                 <a href="mailto:hiltjo@AT@codemadness.DOT.org">Mail</a>
           34                         </td>
           35                 </tr>
           36                 </table>
           37         </nav>
           38         <hr class="hidden" />
           39         <main id="mainwrap">
           40                 <div id="main">
           41                         <article>
           42 <header>
           43         <h1>xml.c and xml.h: XML parser</h1>
           44         <p>
           45         <strong>Last modification on </strong> <time>2023-11-20</time>
           46         </p>
           47 </header>
           48 
           49 <h2>Why</h2>
           50 <p>This XML parser was first developed for use with my RSS/Atom parser
           51 <a href="https://codemadness.org/sfeed.html">sfeed</a>.</p>
           52 <p>In the first few versions of sfeed it didn't have any real XML parser it just
           53 did a simple string search for the XML tag names.</p>
           54 <p>Then I changed it to use libexpat. One of the issues I ran into with expat is
           55 it parses XML in a strict mode. Some RSS/Atom feeds have some quirks or small
           56 XML errors. libexpat over time also had many security vulnerabilities.  Some
           57 examples:</p>
           58 <ul>
           59 <li><a href="https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=expat">https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=expat</a></li>
           60 <li>Section "security fixes" in the Changes file:<br />  
           61 <a href="https://github.com/libexpat/libexpat/blob/master/expat/Changes">https://github.com/libexpat/libexpat/blob/master/expat/Changes</a></li>
           62 </ul>
           63 <p>One other thing is portability and reducing the amount of dependencies.  I also
           64 want to fully understand what most of the parts of my program are doing and to
           65 also keep it relatively simple.</p>
           66 <p>I've used my XML parser for some time now in my projects, but there are also
           67 many written tests.  Fuzzing was used to test strange psuedo-random input data.
           68 For fuzzing the tool <a href="https://lcamtuf.coredump.cx/afl/">afl-fuzz</a> was used.</p>
           69 <p>It was tested on different platforms which have different characteristics.
           70 But of course I'm only human and it will still have bugs: please report them!</p>
           71 <h2>What is it good for?</h2>
           72 <p>... <a href="https://www.youtube.com/embed/hZJRJpbGkG4">absolutely nothing</a>.
           73 Just kidding, as mentioned some of my projects use it:</p>
           74 <ul>
           75 <li><p><a href="https://codemadness.org/sfeed.html">sfeed</a>:<br />  
           76 It is used to parse RSS/Atom for newsfeeds.<br />  </p>
           77 <p>Repository: <a href="https://git.codemadness.org/sfeed/">https://git.codemadness.org/sfeed/</a><br />  </p>
           78 </li>
           79 <li><p>osm-zipcodes:<br />  
           80 A project to extract Dutch zipcodes and addresses and their latitude,
           81 longitude from the .osm XML.
           82 The code is quite ugly and it uses mmap() as a reader and ugly hacks
           83 to improve the speed of parsing the XML.<br />  </p>
           84 <p>Repository: <a href="https://git.codemadness.org/osm-zipcodes/">https://git.codemadness.org/osm-zipcodes/</a></p>
           85 </li>
           86 <li><p>webdump:<br />  
           87 It is used to parse HTML/XHTML. It has some modifications to handle
           88 HTML and a list of the many HTML named entities were added.<br />  </p>
           89 <p>Repository: <a href="https://git.codemadness.org/webdump/">https://git.codemadness.org/webdump/</a></p>
           90 </li>
           91 <li><p><a href="https://codemadness.org/idiotbox.html">Youtube HTML parser and front-ends</a>:<br />  
           92 It is used to parse HTML and extract the relevant JSON meta-data from the
           93 page. The Youtube HTML is (intentionally by Google) crapified auto-generated
           94 HTML. I guess it is a good benchmark for the crappy webworld we live in
           95 today :)<br />  </p>
           96 <p>Repository: <a href="https://git.codemadness.org/frontends/">https://git.codemadness.org/frontends/</a> in the "youtube/"
           97 directory.<br />  
           98 Also a link to my JSON parser: <a href="https://codemadness.org/json2tsv.html">https://codemadness.org/json2tsv.html</a></p>
           99 </li>
          100 <li><p>Dutch BAG Kadaster parser (extract/subset):<br />  
          101 "Basisregistratie Adressen en Gebouwen (BAG)". Translated from Dutch to English
          102 this means something like: "Base registration of address and buildings".</p>
          103 <p>Parse the public-available Dutch BAG Kadaster XML files. It is used to extract
          104 the summary of BAG information of an address.  In particular the address,
          105 allocated purpose of the address and the area size (in squared meters).</p>
          106 <p>Repository: <a href="https://git.codemadness.org/bag/">https://git.codemadness.org/bag/</a></p>
          107 <p>BAG Kadaster XML data source: <a href="https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract">https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract</a></p>
          108 </li>
          109 </ul>
          110 <h2>Features</h2>
          111 <ul>
          112 <li>Relatively small parser, not that many lines of code to understand.</li>
          113 <li>Pretty simple and easy to use callback-based API.</li>
          114 <li>Pretty fast.</li>
          115 <li>Portable and with few dependencies: it can be compiled with an ANSI C89
          116 compiler and works on many platforms and compilers.</li>
          117 <li>No dynamic memory allocation.</li>
          118 <li>Suitable for low-resource environments.</li>
          119 </ul>
          120 <h2>Supports</h2>
          121 <ul>
          122 <li>Tags in short-form (&lt;img src="lolcat.jpg" title="Meow" /&gt;).</li>
          123 <li>Tag attributes.</li>
          124 <li>Short attributes without an explicitly set value (&lt;input type="checkbox" checked /&gt;).</li>
          125 <li>Comments</li>
          126 <li>CDATA sections.</li>
          127 <li>Helper function (xml_entitytostr) to convert XML 1.0 / HTML 2.0 named
          128 entities and numeric entities to UTF-8.</li>
          129 <li>Reading XML from a file descriptor, mmap, string buffer or implement a
          130 custom reader: see: XMLParser.getnext or GETNEXT() macro.
          131 The reader function can be easily customized. This function expects to read
          132 a byte or return EOF on EOF or an error. This way you can use
          133 getchar/getchar_unlocked, mmap(), using a memory buffer or reading in many
          134 other ways.</li>
          135 </ul>
          136 <h2>Parser design decisions</h2>
          137 <ul>
          138 <li>It supports parsing a subset of XML:
          139 It is not a fully compliant XML parser.</li>
          140 <li>There is no direct support for namespaces. For example a tag "ns:sometag" is
          141 just parsed as the tag name "ns:sometag".</li>
          142 <li>There is no resolving or loading of external DTD's for parsing the XML data.
          143 This is also for security and simplicity reasons.</li>
          144 <li>Entity expansions are not parsed as well as DOCTYPE, ATTLIST etc.
          145 It is not allowed to define or redefine entities. This prevents XML entity
          146 bombs or "billion laughs attack":
          147 <a href="https://en.wikipedia.org/wiki/Billion_laughs_attack">https://en.wikipedia.org/wiki/Billion_laughs_attack</a> and
          148 <a href="https://en.wikipedia.org/wiki/XML_external_entity_attack">https://en.wikipedia.org/wiki/XML_external_entity_attack</a>.</li>
          149 <li>There is no character-decoding for the input. It is assumed to be UTF-8
          150 compatible. The data can be decoded or translated to UTF-8 before parsing
          151 it. For example using iconv.
          152 <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html">https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html</a>.</li>
          153 </ul>
          154 <h2>Trade-offs</h2>
          155 <p>These are trade-offs and can be considered cons:</p>
          156 <ul>
          157 <li>Performance: data is buffered even if a handler is not set. To make the
          158 parsing faster you can change this code from xml.c if necessary.</li>
          159 <li>The XML is not checked for errors so it will continue parsing XML data, this
          160 is by design.</li>
          161 <li>Internally fixed-size buffers are used, callbacks like XMLParser.xmldata are
          162 called multiple times for the same tag if the data size is bigger than the
          163 internal buffer size (sizeof(XMLParser.data)). To differentiate between new
          164 calls for data you can use the xml*start and xml*end handlers.</li>
          165 <li>It does not handle XML white-space rules for tag data. The raw values
          166 including white-space is passed. This is useful in some cases, like for
          167 HTML &lt;pre&gt; tags.</li>
          168 <li>The XML specification has no limits on tag and attribute names. For
          169 simplicity/sanity sake this XML parser takes some liberties. Tag and
          170 attribute names are truncated if they are excessively long.</li>
          171 </ul>
          172 <h2>Clone</h2>
          173 <pre><code>git clone git://git.codemadness.org/xmlparser
          174 </code></pre>
          175 <h2>Browse</h2>
          176 <p>You can browse the source-code at:</p>
          177 <ul>
          178 <li><a href="https://git.codemadness.org/xmlparser/">https://git.codemadness.org/xmlparser/</a></li>
          179 <li><a href="gopher://codemadness.org/1/git/xmlparser">gopher://codemadness.org/1/git/xmlparser</a></li>
          180 </ul>
          181 <h2>Example program</h2>
          182 <p>This is from skeleton.c in the repository. It can be used as a template file
          183 to quickly create a small program that parses XML.</p>
          184 <p>From: <a href="https://git.codemadness.org/xmlparser/file/skeleton.c.html">https://git.codemadness.org/xmlparser/file/skeleton.c.html</a></p>
          185 <pre><code>#include &lt;stdio.h&gt;
          186 
          187 #include "xml.h"
          188 
          189 void
          190 xmlattr(XMLParser *x, const char *t, size_t tl,
          191         const char *a, size_t al, const char *v, size_t vl)
          192 {
          193 }
          194 
          195 void
          196 xmlattrentity(XMLParser *x, const char *t, size_t tl,
          197               const char *a, size_t al, const char *v, size_t vl)
          198 {
          199         char buf[16];
          200         int len;
          201 
          202         /* try to translate entity, else just pass as data to
          203          * xmlattr handler. */
          204         if ((len = xml_entitytostr(v, buf, sizeof(buf))) &gt; 0)
          205                 xmlattr(x, t, tl, a, al, buf, (size_t)len);
          206         else
          207                 xmlattr(x, t, tl, a, al, v, vl);
          208 }
          209 
          210 void
          211 xmlattrend(XMLParser *x, const char *t, size_t tl,
          212            const char *a, size_t al)
          213 {
          214 }
          215 
          216 void
          217 xmlattrstart(XMLParser *x, const char *t, size_t tl,
          218              const char *a, size_t al)
          219 {
          220 }
          221 
          222 void
          223 xmlcdatastart(XMLParser *x)
          224 {
          225 }
          226 
          227 void
          228 xmlcdata(XMLParser *x, const char *d, size_t dl)
          229 {
          230 }
          231 
          232 void
          233 xmlcdataend(XMLParser *x)
          234 {
          235 }
          236 
          237 void
          238 xmlcommentstart(XMLParser *x)
          239 {
          240 }
          241 
          242 void
          243 xmlcomment(XMLParser *x, const char *c, size_t cl)
          244 {
          245 }
          246 
          247 void
          248 xmlcommentend(XMLParser *x)
          249 {
          250 }
          251 
          252 void
          253 xmldata(XMLParser *x, const char *d, size_t dl)
          254 {
          255 }
          256 
          257 void
          258 xmldataend(XMLParser *x)
          259 {
          260 }
          261 
          262 void
          263 xmldataentity(XMLParser *x, const char *d, size_t dl)
          264 {
          265         char buf[16];
          266         int len;
          267 
          268         /* try to translate entity, else just pass as data to
          269          * xmldata handler. */
          270         if ((len = xml_entitytostr(d, buf, sizeof(buf))) &gt; 0)
          271                 xmldata(x, buf, (size_t)len);
          272         else
          273                 xmldata(x, d, dl);
          274 }
          275 
          276 void
          277 xmldatastart(XMLParser *x)
          278 {
          279 }
          280 
          281 void
          282 xmltagend(XMLParser *x, const char *t, size_t tl, int isshort)
          283 {
          284 }
          285 
          286 void
          287 xmltagstart(XMLParser *x, const char *t, size_t tl)
          288 {
          289 }
          290 
          291 void
          292 xmltagstartparsed(XMLParser *x, const char *t, size_t tl, int isshort)
          293 {
          294 }
          295 
          296 int
          297 main(void)
          298 {
          299         XMLParser x = { 0 };
          300 
          301         x.xmlattr = xmlattr;
          302         x.xmlattrend = xmlattrend;
          303         x.xmlattrstart = xmlattrstart;
          304         x.xmlattrentity = xmlattrentity;
          305         x.xmlcdatastart = xmlcdatastart;
          306         x.xmlcdata = xmlcdata;
          307         x.xmlcdataend = xmlcdataend;
          308         x.xmlcommentstart = xmlcommentstart;
          309         x.xmlcomment = xmlcomment;
          310         x.xmlcommentend = xmlcommentend;
          311         x.xmldata = xmldata;
          312         x.xmldataend = xmldataend;
          313         x.xmldataentity = xmldataentity;
          314         x.xmldatastart = xmldatastart;
          315         x.xmltagend = xmltagend;
          316         x.xmltagstart = xmltagstart;
          317         x.xmltagstartparsed = xmltagstartparsed;
          318 
          319         x.getnext = getchar;
          320 
          321         xml_parse(&amp;x);
          322 
          323         return 0;
          324 }
          325 </code></pre>
          326 <p>As you can see the important functions of the parser itself are xml_parse()
          327 and xml_entitytostr().</p>
          328 <p>XMLParser is a structure of the context and it contains pointers to the
          329 callback functions.</p>
          330 <p>This is a verbose example. All the callbacks that are unused could be removed.
          331 If the callback is set to NULL then it is unused.</p>
          332 <h1>References</h1>
          333 <ul>
          334 <li>AFL (American fuzzy lop): afl-fuzz: <a href="https://lcamtuf.coredump.cx/afl/">https://lcamtuf.coredump.cx/afl/</a>.</li>
          335 <li>iconv: character-set conversion.</li>
          336 <li>sfeed: <a href="https://codemadness.org/sfeed.html">https://codemadness.org/sfeed.html</a>.</li>
          337 <li>libexpat XML parser: <a href="https://github.com/libexpat/libexpat/">https://github.com/libexpat/libexpat/</a>.</li>
          338 <li>libxml2 XML parser: <a href="https://github.com/GNOME/libxml2">https://github.com/GNOME/libxml2</a>.</li>
          339 </ul>
          340 <h1>End</h1>
          341 <p>I hope this write up is useful or xml.{c,h} can be useful in your project.</p>
          342 
          343                         </article>
          344                 </div>
          345         </main>
          346 </body>
          347 </html>