xml.md - www.codemadness.org - www.codemadness.org saait content files
 (HTM) git clone git://git.codemadness.org/www.codemadness.org
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       xml.md (9610B)
       ---
            1 ## Why
            2 
            3 This XML parser was first developed for use with my RSS/Atom parser
            4 [sfeed](https://codemadness.org/sfeed.html).
            5 
            6 In the first few versions of sfeed it didn't have any real XML parser it just
            7 did a simple string search for the XML tag names.
            8 
            9 Then I changed it to use libexpat. One of the issues I ran into with expat is
           10 it parses XML in a strict mode. Some RSS/Atom feeds have some quirks or small
           11 XML errors. libexpat over time also had many security vulnerabilities.  Some
           12 examples:
           13 
           14 * <https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=expat>
           15 * Section "security fixes" in the Changes file:  
           16   <https://github.com/libexpat/libexpat/blob/master/expat/Changes>
           17 
           18 One other thing is portability and reducing the amount of dependencies.  I also
           19 want to fully understand what most of the parts of my program are doing and to
           20 also keep it relatively simple.
           21 
           22 I've used my XML parser for some time now in my projects, but there are also
           23 many written tests.  Fuzzing was used to test strange psuedo-random input data.
           24 For fuzzing the tool [afl-fuzz](https://lcamtuf.coredump.cx/afl/) was used.
           25 
           26 It was tested on different platforms which have different characteristics.
           27 But of course I'm only human and it will still have bugs: please report them!
           28 
           29 
           30 ## What is it good for?
           31 
           32 ... [absolutely nothing](https://www.youtube.com/embed/hZJRJpbGkG4).
           33 Just kidding, as mentioned some of my projects use it:
           34 
           35 * [sfeed](https://codemadness.org/sfeed.html):  
           36   It is used to parse RSS/Atom for newsfeeds.  
           37   
           38   Repository: <https://git.codemadness.org/sfeed/>  
           39   
           40 * osm-zipcodes:  
           41   A project to extract Dutch zipcodes and addresses and their latitude,
           42   longitude from the .osm XML.
           43   The code is quite ugly and it uses mmap() as a reader and ugly hacks
           44   to improve the speed of parsing the XML.  
           45   
           46   Repository: <https://git.codemadness.org/osm-zipcodes/>
           47   
           48 * webdump:  
           49   It is used to parse HTML/XHTML. It has some modifications to handle
           50   HTML and a list of the many HTML named entities were added.  
           51           
           52   Repository: <https://git.codemadness.org/webdump/>
           53 
           54 * [Youtube HTML parser and front-ends](https://codemadness.org/idiotbox.html):  
           55   It is used to parse HTML and extract the relevant JSON meta-data from the
           56   page. The Youtube HTML is (intentionally by Google) crapified auto-generated
           57   HTML. I guess it is a good benchmark for the crappy webworld we live in
           58   today :)  
           59   
           60   Repository: <https://git.codemadness.org/frontends/> in the "youtube/"
           61   directory.  
           62   Also a link to my JSON parser: <https://codemadness.org/json2tsv.html>
           63 
           64 * Dutch BAG Kadaster parser (extract/subset):  
           65   "Basisregistratie Adressen en Gebouwen (BAG)". Translated from Dutch to English
           66   this means something like: "Base registration of address and buildings".
           67   
           68   Parse the public-available Dutch BAG Kadaster XML files. It is used to extract
           69   the summary of BAG information of an address.  In particular the address,
           70   allocated purpose of the address and the area size (in squared meters).
           71           
           72   Repository: <https://git.codemadness.org/bag/>
           73   
           74   BAG Kadaster XML data source: <https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract>
           75 
           76 
           77 ## Features
           78 
           79 * Relatively small parser, not that many lines of code to understand.
           80 * Pretty simple and easy to use callback-based API.
           81 * Pretty fast.
           82 * Portable and with few dependencies: it can be compiled with an ANSI C89
           83   compiler and works on many platforms and compilers.
           84 * No dynamic memory allocation.
           85 * Suitable for low-resource environments.
           86 
           87 
           88 ## Supports
           89 
           90 * Tags in short-form (<img src="lolcat.jpg" title="Meow" />).
           91 * Tag attributes.
           92 * Short attributes without an explicitly set value (<input type="checkbox" checked />).
           93 * Comments
           94 * CDATA sections.
           95 * Helper function (xml\_entitytostr) to convert XML 1.0 / HTML 2.0 named
           96   entities and numeric entities to UTF-8.
           97 * Reading XML from a file descriptor, mmap, string buffer or implement a
           98   custom reader: see: XMLParser.getnext or GETNEXT() macro.
           99   The reader function can be easily customized. This function expects to read
          100   a byte or return EOF on EOF or an error. This way you can use
          101   getchar/getchar\_unlocked, mmap(), using a memory buffer or reading in many
          102   other ways.
          103 
          104 
          105 ## Parser design decisions
          106 
          107 * It supports parsing a subset of XML:
          108   It is not a fully compliant XML parser.
          109 * There is no direct support for namespaces. For example a tag "ns:sometag" is
          110   just parsed as the tag name "ns:sometag".
          111 * There is no resolving or loading of external DTD's for parsing the XML data.
          112   This is also for security and simplicity reasons.
          113 * Entity expansions are not parsed as well as DOCTYPE, ATTLIST etc.
          114   It is not allowed to define or redefine entities. This prevents XML entity
          115   bombs or "billion laughs attack":
          116   <https://en.wikipedia.org/wiki/Billion_laughs_attack> and
          117   <https://en.wikipedia.org/wiki/XML_external_entity_attack>.
          118 * There is no character-decoding for the input. It is assumed to be UTF-8
          119   compatible. The data can be decoded or translated to UTF-8 before parsing
          120   it. For example using iconv.
          121   <https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html>.
          122 
          123 
          124 ## Trade-offs
          125 
          126 These are trade-offs and can be considered cons:
          127 
          128 * Performance: data is buffered even if a handler is not set. To make the
          129   parsing faster you can change this code from xml.c if necessary.
          130 * The XML is not checked for errors so it will continue parsing XML data, this
          131   is by design.
          132 * Internally fixed-size buffers are used, callbacks like XMLParser.xmldata are
          133   called multiple times for the same tag if the data size is bigger than the
          134   internal buffer size (sizeof(XMLParser.data)). To differentiate between new
          135   calls for data you can use the xml\*start and xml\*end handlers.
          136 * It does not handle XML white-space rules for tag data. The raw values
          137   including white-space is passed. This is useful in some cases, like for
          138   HTML <pre> tags.
          139 * The XML specification has no limits on tag and attribute names. For
          140   simplicity/sanity sake this XML parser takes some liberties. Tag and
          141   attribute names are truncated if they are excessively long.
          142 
          143 
          144 ## Clone
          145 
          146         git clone git://git.codemadness.org/xmlparser
          147 
          148 
          149 ## Browse
          150 
          151 You can browse the source-code at:
          152 
          153 * <https://git.codemadness.org/xmlparser/>
          154 * <gopher://codemadness.org/1/git/xmlparser>
          155 
          156 
          157 ## Example program
          158 
          159 This is from skeleton.c in the repository. It can be used as a template file
          160 to quickly create a small program that parses XML.
          161 
          162 From: <https://git.codemadness.org/xmlparser/file/skeleton.c.html>
          163 
          164         #include <stdio.h>
          165         
          166         #include "xml.h"
          167         
          168         void
          169         xmlattr(XMLParser *x, const char *t, size_t tl,
          170                 const char *a, size_t al, const char *v, size_t vl)
          171         {
          172         }
          173         
          174         void
          175         xmlattrentity(XMLParser *x, const char *t, size_t tl,
          176                       const char *a, size_t al, const char *v, size_t vl)
          177         {
          178                 char buf[16];
          179                 int len;
          180         
          181                 /* try to translate entity, else just pass as data to
          182                  * xmlattr handler. */
          183                 if ((len = xml_entitytostr(v, buf, sizeof(buf))) > 0)
          184                         xmlattr(x, t, tl, a, al, buf, (size_t)len);
          185                 else
          186                         xmlattr(x, t, tl, a, al, v, vl);
          187         }
          188         
          189         void
          190         xmlattrend(XMLParser *x, const char *t, size_t tl,
          191                    const char *a, size_t al)
          192         {
          193         }
          194         
          195         void
          196         xmlattrstart(XMLParser *x, const char *t, size_t tl,
          197                      const char *a, size_t al)
          198         {
          199         }
          200         
          201         void
          202         xmlcdatastart(XMLParser *x)
          203         {
          204         }
          205         
          206         void
          207         xmlcdata(XMLParser *x, const char *d, size_t dl)
          208         {
          209         }
          210         
          211         void
          212         xmlcdataend(XMLParser *x)
          213         {
          214         }
          215         
          216         void
          217         xmlcommentstart(XMLParser *x)
          218         {
          219         }
          220         
          221         void
          222         xmlcomment(XMLParser *x, const char *c, size_t cl)
          223         {
          224         }
          225         
          226         void
          227         xmlcommentend(XMLParser *x)
          228         {
          229         }
          230         
          231         void
          232         xmldata(XMLParser *x, const char *d, size_t dl)
          233         {
          234         }
          235         
          236         void
          237         xmldataend(XMLParser *x)
          238         {
          239         }
          240         
          241         void
          242         xmldataentity(XMLParser *x, const char *d, size_t dl)
          243         {
          244                 char buf[16];
          245                 int len;
          246         
          247                 /* try to translate entity, else just pass as data to
          248                  * xmldata handler. */
          249                 if ((len = xml_entitytostr(d, buf, sizeof(buf))) > 0)
          250                         xmldata(x, buf, (size_t)len);
          251                 else
          252                         xmldata(x, d, dl);
          253         }
          254         
          255         void
          256         xmldatastart(XMLParser *x)
          257         {
          258         }
          259         
          260         void
          261         xmltagend(XMLParser *x, const char *t, size_t tl, int isshort)
          262         {
          263         }
          264         
          265         void
          266         xmltagstart(XMLParser *x, const char *t, size_t tl)
          267         {
          268         }
          269         
          270         void
          271         xmltagstartparsed(XMLParser *x, const char *t, size_t tl, int isshort)
          272         {
          273         }
          274         
          275         int
          276         main(void)
          277         {
          278                 XMLParser x = { 0 };
          279         
          280                 x.xmlattr = xmlattr;
          281                 x.xmlattrend = xmlattrend;
          282                 x.xmlattrstart = xmlattrstart;
          283                 x.xmlattrentity = xmlattrentity;
          284                 x.xmlcdatastart = xmlcdatastart;
          285                 x.xmlcdata = xmlcdata;
          286                 x.xmlcdataend = xmlcdataend;
          287                 x.xmlcommentstart = xmlcommentstart;
          288                 x.xmlcomment = xmlcomment;
          289                 x.xmlcommentend = xmlcommentend;
          290                 x.xmldata = xmldata;
          291                 x.xmldataend = xmldataend;
          292                 x.xmldataentity = xmldataentity;
          293                 x.xmldatastart = xmldatastart;
          294                 x.xmltagend = xmltagend;
          295                 x.xmltagstart = xmltagstart;
          296                 x.xmltagstartparsed = xmltagstartparsed;
          297         
          298                 x.getnext = getchar;
          299         
          300                 xml_parse(&x);
          301         
          302                 return 0;
          303         }
          304 
          305 As you can see the important functions of the parser itself are xml\_parse()
          306 and xml\_entitytostr().
          307 
          308 XMLParser is a structure of the context and it contains pointers to the
          309 callback functions.
          310 
          311 This is a verbose example. All the callbacks that are unused could be removed.
          312 If the callback is set to NULL then it is unused.
          313 
          314 
          315 # References
          316 
          317 * AFL (American fuzzy lop): afl-fuzz: <https://lcamtuf.coredump.cx/afl/>.
          318 * iconv: character-set conversion.
          319 * sfeed: <https://codemadness.org/sfeed.html>.
          320 * libexpat XML parser: <https://github.com/libexpat/libexpat/>.
          321 * libxml2 XML parser: <https://github.com/GNOME/libxml2>.
          322 
          323 
          324 # End
          325 
          326 I hope this write up is useful or xml.{c,h} can be useful in your project.