xml - www.codemadness.org - www.codemadness.org saait content files
 (HTM) git clone git://git.codemadness.org/www.codemadness.org
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       xml (19194B)
       ---
            1 1<- Back        /        codemadness.org        70
            2 i                codemadness.org        70
            3 i                codemadness.org        70
            4 i# xml.c and xml.h: XML parser                codemadness.org        70
            5 i                codemadness.org        70
            6 iLast modification on 2023-11-20                codemadness.org        70
            7 i                codemadness.org        70
            8 i## Why                codemadness.org        70
            9 i                codemadness.org        70
           10 iThis XML parser was first developed for use with my RSS/Atom parser                codemadness.org        70
           11 hsfeed.        URL:https://codemadness.org/sfeed.html        codemadness.org        70
           12 i                codemadness.org        70
           13 iIn the first few versions of sfeed it didn't have any real XML parser it just                codemadness.org        70
           14 idid a simple string search for the XML tag names.                codemadness.org        70
           15 i                codemadness.org        70
           16 iThen I changed it to use libexpat. One of the issues I ran into with expat is                codemadness.org        70
           17 iit parses XML in a strict mode. Some RSS/Atom feeds have some quirks or small                codemadness.org        70
           18 iXML errors. libexpat over time also had many security vulnerabilities.  Some                codemadness.org        70
           19 iexamples:                codemadness.org        70
           20 i                codemadness.org        70
           21 h* https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=expat        URL:https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=expat        codemadness.org        70
           22 i* Section "security fixes" in the Changes file:                  codemadness.org        70
           23 h  https://github.com/libexpat/libexpat/blob/master/expat/Changes        URL:https://github.com/libexpat/libexpat/blob/master/expat/Changes        codemadness.org        70
           24 i                codemadness.org        70
           25 iOne other thing is portability and reducing the amount of dependencies.  I also                codemadness.org        70
           26 iwant to fully understand what most of the parts of my program are doing and to                codemadness.org        70
           27 ialso keep it relatively simple.                codemadness.org        70
           28 i                codemadness.org        70
           29 iI've used my XML parser for some time now in my projects, but there are also                codemadness.org        70
           30 imany written tests.  Fuzzing was used to test strange psuedo-random input data.                codemadness.org        70
           31 hFor fuzzing the tool »afl-fuzz« was used.        URL:https://lcamtuf.coredump.cx/afl/        codemadness.org        70
           32 i                codemadness.org        70
           33 iIt was tested on different platforms which have different characteristics.                codemadness.org        70
           34 iBut of course I'm only human and it will still have bugs: please report them!                codemadness.org        70
           35 i                codemadness.org        70
           36 i                codemadness.org        70
           37 i## What is it good for?                codemadness.org        70
           38 i                codemadness.org        70
           39 h... absolutely nothing.        URL:https://www.youtube.com/embed/hZJRJpbGkG4        codemadness.org        70
           40 iJust kidding, as mentioned some of my projects use it:                codemadness.org        70
           41 i                codemadness.org        70
           42 h* sfeed:          URL:https://codemadness.org/sfeed.html        codemadness.org        70
           43 i  It is used to parse RSS/Atom for newsfeeds.                  codemadness.org        70
           44 i                  codemadness.org        70
           45 h  Repository: »https://git.codemadness.org/sfeed/«          URL:https://git.codemadness.org/sfeed/        codemadness.org        70
           46 i                  codemadness.org        70
           47 i* osm-zipcodes:                  codemadness.org        70
           48 i  A project to extract Dutch zipcodes and addresses and their latitude,                codemadness.org        70
           49 i  longitude from the .osm XML.                codemadness.org        70
           50 i  The code is quite ugly and it uses mmap() as a reader and ugly hacks                codemadness.org        70
           51 i  to improve the speed of parsing the XML.                  codemadness.org        70
           52 i                  codemadness.org        70
           53 h  Repository: »https://git.codemadness.org/osm-zipcodes/«        URL:https://git.codemadness.org/osm-zipcodes/        codemadness.org        70
           54 i                  codemadness.org        70
           55 i* webdump:                  codemadness.org        70
           56 i  It is used to parse HTML/XHTML. It has some modifications to handle                codemadness.org        70
           57 i  HTML and a list of the many HTML named entities were added.                  codemadness.org        70
           58 i                          codemadness.org        70
           59 h  Repository: »https://git.codemadness.org/webdump/«        URL:https://git.codemadness.org/webdump/        codemadness.org        70
           60 i                codemadness.org        70
           61 h* Youtube HTML parser and front-ends:          URL:https://codemadness.org/idiotbox.html        codemadness.org        70
           62 i  It is used to parse HTML and extract the relevant JSON meta-data from the                codemadness.org        70
           63 i  page. The Youtube HTML is (intentionally by Google) crapified auto-generated                codemadness.org        70
           64 i  HTML. I guess it is a good benchmark for the crappy webworld we live in                codemadness.org        70
           65 i  today :)                  codemadness.org        70
           66 i                  codemadness.org        70
           67 h  Repository: »https://git.codemadness.org/frontends/« in the "youtube/"        URL:https://git.codemadness.org/frontends/        codemadness.org        70
           68 i  directory.                  codemadness.org        70
           69 h  Also a link to my JSON parser: »https://codemadness.org/json2tsv.html«        URL:https://codemadness.org/json2tsv.html        codemadness.org        70
           70 i                codemadness.org        70
           71 i* Dutch BAG Kadaster parser (extract/subset):                  codemadness.org        70
           72 i  "Basisregistratie Adressen en Gebouwen (BAG)". Translated from Dutch to English                codemadness.org        70
           73 i  this means something like: "Base registration of address and buildings".                codemadness.org        70
           74 i                  codemadness.org        70
           75 i  Parse the public-available Dutch BAG Kadaster XML files. It is used to extract                codemadness.org        70
           76 i  the summary of BAG information of an address.  In particular the address,                codemadness.org        70
           77 i  allocated purpose of the address and the area size (in squared meters).                codemadness.org        70
           78 i                          codemadness.org        70
           79 h  Repository: »https://git.codemadness.org/bag/«        URL:https://git.codemadness.org/bag/        codemadness.org        70
           80 i                  codemadness.org        70
           81 h  BAG Kadaster XML data source: »https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract«        URL:https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract        codemadness.org        70
           82 i                codemadness.org        70
           83 i                codemadness.org        70
           84 i## Features                codemadness.org        70
           85 i                codemadness.org        70
           86 i* Relatively small parser, not that many lines of code to understand.                codemadness.org        70
           87 i* Pretty simple and easy to use callback-based API.                codemadness.org        70
           88 i* Pretty fast.                codemadness.org        70
           89 i* Portable and with few dependencies: it can be compiled with an ANSI C89                codemadness.org        70
           90 i  compiler and works on many platforms and compilers.                codemadness.org        70
           91 i* No dynamic memory allocation.                codemadness.org        70
           92 i* Suitable for low-resource environments.                codemadness.org        70
           93 i                codemadness.org        70
           94 i                codemadness.org        70
           95 i## Supports                codemadness.org        70
           96 i                codemadness.org        70
           97 i* Tags in short-form (<img src="lolcat.jpg" title="Meow" />).                codemadness.org        70
           98 i* Tag attributes.                codemadness.org        70
           99 i* Short attributes without an explicitly set value (<input type="checkbox" checked />).                codemadness.org        70
          100 i* Comments                codemadness.org        70
          101 i* CDATA sections.                codemadness.org        70
          102 i* Helper function (xml_entitytostr) to convert XML 1.0 / HTML 2.0 named                codemadness.org        70
          103 i  entities and numeric entities to UTF-8.                codemadness.org        70
          104 i* Reading XML from a file descriptor, mmap, string buffer or implement a                codemadness.org        70
          105 i  custom reader: see: XMLParser.getnext or GETNEXT() macro.                codemadness.org        70
          106 i  The reader function can be easily customized. This function expects to read                codemadness.org        70
          107 i  a byte or return EOF on EOF or an error. This way you can use                codemadness.org        70
          108 i  getchar/getchar_unlocked, mmap(), using a memory buffer or reading in many                codemadness.org        70
          109 i  other ways.                codemadness.org        70
          110 i                codemadness.org        70
          111 i                codemadness.org        70
          112 i## Parser design decisions                codemadness.org        70
          113 i                codemadness.org        70
          114 i* It supports parsing a subset of XML:                codemadness.org        70
          115 i  It is not a fully compliant XML parser.                codemadness.org        70
          116 i* There is no direct support for namespaces. For example a tag "ns:sometag" is                codemadness.org        70
          117 i  just parsed as the tag name "ns:sometag".                codemadness.org        70
          118 i* There is no resolving or loading of external DTD's for parsing the XML data.                codemadness.org        70
          119 i  This is also for security and simplicity reasons.                codemadness.org        70
          120 i* Entity expansions are not parsed as well as DOCTYPE, ATTLIST etc.                codemadness.org        70
          121 i  It is not allowed to define or redefine entities. This prevents XML entity                codemadness.org        70
          122 i  bombs or "billion laughs attack":                codemadness.org        70
          123 h  »https://en.wikipedia.org/wiki/Billion_laughs_attack« and        URL:https://en.wikipedia.org/wiki/Billion_laughs_attack        codemadness.org        70
          124 h  https://en.wikipedia.org/wiki/XML_external_entity_attack.        URL:https://en.wikipedia.org/wiki/XML_external_entity_attack        codemadness.org        70
          125 i* There is no character-decoding for the input. It is assumed to be UTF-8                codemadness.org        70
          126 i  compatible. The data can be decoded or translated to UTF-8 before parsing                codemadness.org        70
          127 i  it. For example using iconv.                codemadness.org        70
          128 h  https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html.        URL:https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html        codemadness.org        70
          129 i                codemadness.org        70
          130 i                codemadness.org        70
          131 i## Trade-offs                codemadness.org        70
          132 i                codemadness.org        70
          133 iThese are trade-offs and can be considered cons:                codemadness.org        70
          134 i                codemadness.org        70
          135 i* Performance: data is buffered even if a handler is not set. To make the                codemadness.org        70
          136 i  parsing faster you can change this code from xml.c if necessary.                codemadness.org        70
          137 i* The XML is not checked for errors so it will continue parsing XML data, this                codemadness.org        70
          138 i  is by design.                codemadness.org        70
          139 i* Internally fixed-size buffers are used, callbacks like XMLParser.xmldata are                codemadness.org        70
          140 i  called multiple times for the same tag if the data size is bigger than the                codemadness.org        70
          141 i  internal buffer size (sizeof(XMLParser.data)). To differentiate between new                codemadness.org        70
          142 i  calls for data you can use the xml\*start and xml\*end handlers.                codemadness.org        70
          143 i* It does not handle XML white-space rules for tag data. The raw values                codemadness.org        70
          144 i  including white-space is passed. This is useful in some cases, like for                codemadness.org        70
          145 i  HTML <pre> tags.                codemadness.org        70
          146 i* The XML specification has no limits on tag and attribute names. For                codemadness.org        70
          147 i  simplicity/sanity sake this XML parser takes some liberties. Tag and                codemadness.org        70
          148 i  attribute names are truncated if they are excessively long.                codemadness.org        70
          149 i                codemadness.org        70
          150 i                codemadness.org        70
          151 i## Clone                codemadness.org        70
          152 i                codemadness.org        70
          153 i        git clone git://git.codemadness.org/xmlparser                codemadness.org        70
          154 i                codemadness.org        70
          155 i                codemadness.org        70
          156 i## Browse                codemadness.org        70
          157 i                codemadness.org        70
          158 iYou can browse the source-code at:                codemadness.org        70
          159 i                codemadness.org        70
          160 h* https://git.codemadness.org/xmlparser/        URL:https://git.codemadness.org/xmlparser/        codemadness.org        70
          161 1* gopher://codemadness.org/1/git/xmlparser        /git/xmlparser        codemadness.org        70
          162 i                codemadness.org        70
          163 i                codemadness.org        70
          164 i## Example program                codemadness.org        70
          165 i                codemadness.org        70
          166 iThis is from skeleton.c in the repository. It can be used as a template file                codemadness.org        70
          167 ito quickly create a small program that parses XML.                codemadness.org        70
          168 i                codemadness.org        70
          169 1From: »https://git.codemadness.org/xmlparser/file/skeleton.c.html«        /git/xmlparser/file/skeleton.c.gph        codemadness.org        70
          170 i                codemadness.org        70
          171 i        #include <stdio.h>                codemadness.org        70
          172 i                        codemadness.org        70
          173 i        #include "xml.h"                codemadness.org        70
          174 i                        codemadness.org        70
          175 i        void                codemadness.org        70
          176 i        xmlattr(XMLParser *x, const char *t, size_t tl,                codemadness.org        70
          177 i                const char *a, size_t al, const char *v, size_t vl)                codemadness.org        70
          178 i        {                codemadness.org        70
          179 i        }                codemadness.org        70
          180 i                        codemadness.org        70
          181 i        void                codemadness.org        70
          182 i        xmlattrentity(XMLParser *x, const char *t, size_t tl,                codemadness.org        70
          183 i                      const char *a, size_t al, const char *v, size_t vl)                codemadness.org        70
          184 i        {                codemadness.org        70
          185 i                char buf[16];                codemadness.org        70
          186 i                int len;                codemadness.org        70
          187 i                        codemadness.org        70
          188 i                /* try to translate entity, else just pass as data to                codemadness.org        70
          189 i                 * xmlattr handler. */                codemadness.org        70
          190 i                if ((len = xml_entitytostr(v, buf, sizeof(buf))) > 0)                codemadness.org        70
          191 i                        xmlattr(x, t, tl, a, al, buf, (size_t)len);                codemadness.org        70
          192 i                else                codemadness.org        70
          193 i                        xmlattr(x, t, tl, a, al, v, vl);                codemadness.org        70
          194 i        }                codemadness.org        70
          195 i                        codemadness.org        70
          196 i        void                codemadness.org        70
          197 i        xmlattrend(XMLParser *x, const char *t, size_t tl,                codemadness.org        70
          198 i                   const char *a, size_t al)                codemadness.org        70
          199 i        {                codemadness.org        70
          200 i        }                codemadness.org        70
          201 i                        codemadness.org        70
          202 i        void                codemadness.org        70
          203 i        xmlattrstart(XMLParser *x, const char *t, size_t tl,                codemadness.org        70
          204 i                     const char *a, size_t al)                codemadness.org        70
          205 i        {                codemadness.org        70
          206 i        }                codemadness.org        70
          207 i                        codemadness.org        70
          208 i        void                codemadness.org        70
          209 i        xmlcdatastart(XMLParser *x)                codemadness.org        70
          210 i        {                codemadness.org        70
          211 i        }                codemadness.org        70
          212 i                        codemadness.org        70
          213 i        void                codemadness.org        70
          214 i        xmlcdata(XMLParser *x, const char *d, size_t dl)                codemadness.org        70
          215 i        {                codemadness.org        70
          216 i        }                codemadness.org        70
          217 i                        codemadness.org        70
          218 i        void                codemadness.org        70
          219 i        xmlcdataend(XMLParser *x)                codemadness.org        70
          220 i        {                codemadness.org        70
          221 i        }                codemadness.org        70
          222 i                        codemadness.org        70
          223 i        void                codemadness.org        70
          224 i        xmlcommentstart(XMLParser *x)                codemadness.org        70
          225 i        {                codemadness.org        70
          226 i        }                codemadness.org        70
          227 i                        codemadness.org        70
          228 i        void                codemadness.org        70
          229 i        xmlcomment(XMLParser *x, const char *c, size_t cl)                codemadness.org        70
          230 i        {                codemadness.org        70
          231 i        }                codemadness.org        70
          232 i                        codemadness.org        70
          233 i        void                codemadness.org        70
          234 i        xmlcommentend(XMLParser *x)                codemadness.org        70
          235 i        {                codemadness.org        70
          236 i        }                codemadness.org        70
          237 i                        codemadness.org        70
          238 i        void                codemadness.org        70
          239 i        xmldata(XMLParser *x, const char *d, size_t dl)                codemadness.org        70
          240 i        {                codemadness.org        70
          241 i        }                codemadness.org        70
          242 i                        codemadness.org        70
          243 i        void                codemadness.org        70
          244 i        xmldataend(XMLParser *x)                codemadness.org        70
          245 i        {                codemadness.org        70
          246 i        }                codemadness.org        70
          247 i                        codemadness.org        70
          248 i        void                codemadness.org        70
          249 i        xmldataentity(XMLParser *x, const char *d, size_t dl)                codemadness.org        70
          250 i        {                codemadness.org        70
          251 i                char buf[16];                codemadness.org        70
          252 i                int len;                codemadness.org        70
          253 i                        codemadness.org        70
          254 i                /* try to translate entity, else just pass as data to                codemadness.org        70
          255 i                 * xmldata handler. */                codemadness.org        70
          256 i                if ((len = xml_entitytostr(d, buf, sizeof(buf))) > 0)                codemadness.org        70
          257 i                        xmldata(x, buf, (size_t)len);                codemadness.org        70
          258 i                else                codemadness.org        70
          259 i                        xmldata(x, d, dl);                codemadness.org        70
          260 i        }                codemadness.org        70
          261 i                        codemadness.org        70
          262 i        void                codemadness.org        70
          263 i        xmldatastart(XMLParser *x)                codemadness.org        70
          264 i        {                codemadness.org        70
          265 i        }                codemadness.org        70
          266 i                        codemadness.org        70
          267 i        void                codemadness.org        70
          268 i        xmltagend(XMLParser *x, const char *t, size_t tl, int isshort)                codemadness.org        70
          269 i        {                codemadness.org        70
          270 i        }                codemadness.org        70
          271 i                        codemadness.org        70
          272 i        void                codemadness.org        70
          273 i        xmltagstart(XMLParser *x, const char *t, size_t tl)                codemadness.org        70
          274 i        {                codemadness.org        70
          275 i        }                codemadness.org        70
          276 i                        codemadness.org        70
          277 i        void                codemadness.org        70
          278 i        xmltagstartparsed(XMLParser *x, const char *t, size_t tl, int isshort)                codemadness.org        70
          279 i        {                codemadness.org        70
          280 i        }                codemadness.org        70
          281 i                        codemadness.org        70
          282 i        int                codemadness.org        70
          283 i        main(void)                codemadness.org        70
          284 i        {                codemadness.org        70
          285 i                XMLParser x = { 0 };                codemadness.org        70
          286 i                        codemadness.org        70
          287 i                x.xmlattr = xmlattr;                codemadness.org        70
          288 i                x.xmlattrend = xmlattrend;                codemadness.org        70
          289 i                x.xmlattrstart = xmlattrstart;                codemadness.org        70
          290 i                x.xmlattrentity = xmlattrentity;                codemadness.org        70
          291 i                x.xmlcdatastart = xmlcdatastart;                codemadness.org        70
          292 i                x.xmlcdata = xmlcdata;                codemadness.org        70
          293 i                x.xmlcdataend = xmlcdataend;                codemadness.org        70
          294 i                x.xmlcommentstart = xmlcommentstart;                codemadness.org        70
          295 i                x.xmlcomment = xmlcomment;                codemadness.org        70
          296 i                x.xmlcommentend = xmlcommentend;                codemadness.org        70
          297 i                x.xmldata = xmldata;                codemadness.org        70
          298 i                x.xmldataend = xmldataend;                codemadness.org        70
          299 i                x.xmldataentity = xmldataentity;                codemadness.org        70
          300 i                x.xmldatastart = xmldatastart;                codemadness.org        70
          301 i                x.xmltagend = xmltagend;                codemadness.org        70
          302 i                x.xmltagstart = xmltagstart;                codemadness.org        70
          303 i                x.xmltagstartparsed = xmltagstartparsed;                codemadness.org        70
          304 i                        codemadness.org        70
          305 i                x.getnext = getchar;                codemadness.org        70
          306 i                        codemadness.org        70
          307 i                xml_parse(&x);                codemadness.org        70
          308 i                        codemadness.org        70
          309 i                return 0;                codemadness.org        70
          310 i        }                codemadness.org        70
          311 i                codemadness.org        70
          312 iAs you can see the important functions of the parser itself are xml_parse()                codemadness.org        70
          313 iand xml_entitytostr().                codemadness.org        70
          314 i                codemadness.org        70
          315 iXMLParser is a structure of the context and it contains pointers to the                codemadness.org        70
          316 icallback functions.                codemadness.org        70
          317 i                codemadness.org        70
          318 iThis is a verbose example. All the callbacks that are unused could be removed.                codemadness.org        70
          319 iIf the callback is set to NULL then it is unused.                codemadness.org        70
          320 i                codemadness.org        70
          321 i                codemadness.org        70
          322 i# References                codemadness.org        70
          323 i                codemadness.org        70
          324 h* AFL (American fuzzy lop): afl-fuzz: »https://lcamtuf.coredump.cx/afl/«.        URL:https://lcamtuf.coredump.cx/afl/        codemadness.org        70
          325 i* iconv: character-set conversion.                codemadness.org        70
          326 h* sfeed: »https://codemadness.org/sfeed.html«.        URL:https://codemadness.org/sfeed.html        codemadness.org        70
          327 h* libexpat XML parser: »https://github.com/libexpat/libexpat/«.        URL:https://github.com/libexpat/libexpat/        codemadness.org        70
          328 h* libxml2 XML parser: »https://github.com/GNOME/libxml2«.        URL:https://github.com/GNOME/libxml2        codemadness.org        70
          329 i                codemadness.org        70
          330 i                codemadness.org        70
          331 i# End                codemadness.org        70
          332 i                codemadness.org        70
          333 iI hope this write up is useful or xml.{c,h} can be useful in your project.                codemadness.org        70
          334 .