xml.html - www.codemadness.org - www.codemadness.org saait content files
(HTM) git clone git://git.codemadness.org/www.codemadness.org
(DIR) Log
(DIR) Files
(DIR) Refs
(DIR) README
(DIR) LICENSE
---
xml.html (13108B)
---
1 <!DOCTYPE html>
2 <html dir="ltr" lang="en">
3 <head>
4 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
5 <meta http-equiv="Content-Language" content="en" />
6 <meta name="viewport" content="width=device-width" />
7 <meta name="keywords" content="XML, XMHELL, RSS, Atom, parser" />
8 <meta name="description" content="xml.c and xml.h: XML parser for some of my projects" />
9 <meta name="author" content="Hiltjo" />
10 <meta name="generator" content="Static content generated using saait: https://codemadness.org/saait.html" />
11 <title>xml.c and xml.h: XML parser - Codemadness</title>
12 <link rel="stylesheet" href="style.css" type="text/css" media="screen" />
13 <link rel="stylesheet" href="print.css" type="text/css" media="print" />
14 <link rel="alternate" href="atom.xml" type="application/atom+xml" title="Codemadness Atom Feed" />
15 <link rel="alternate" href="atom_content.xml" type="application/atom+xml" title="Codemadness Atom Feed with content" />
16 <link rel="icon" href="/favicon.png" type="image/png" />
17 </head>
18 <body>
19 <nav id="menuwrap">
20 <table id="menu" width="100%" border="0">
21 <tr>
22 <td id="links" align="left">
23 <a href="index.html">Blog</a> |
24 <a href="/git/" title="Git repository with some of my projects">Git</a> |
25 <a href="/releases/">Releases</a> |
26 <a href="gopher://codemadness.org">Gopherhole</a>
27 </td>
28 <td id="links-contact" align="right">
29 <span class="hidden"> | </span>
30 <a href="/donate/">Donate</a> |
31 <a href="feeds.html">Feeds</a> |
32 <a href="pgp.asc">PGP</a> |
33 <a href="mailto:hiltjo@AT@codemadness.DOT.org">Mail</a>
34 </td>
35 </tr>
36 </table>
37 </nav>
38 <hr class="hidden" />
39 <main id="mainwrap">
40 <div id="main">
41 <article>
42 <header>
43 <h1>xml.c and xml.h: XML parser</h1>
44 <p>
45 <strong>Last modification on </strong> <time>2023-11-20</time>
46 </p>
47 </header>
48
49 <h2>Why</h2>
50 <p>This XML parser was first developed for use with my RSS/Atom parser
51 <a href="https://codemadness.org/sfeed.html">sfeed</a>.</p>
52 <p>In the first few versions of sfeed it didn't have any real XML parser it just
53 did a simple string search for the XML tag names.</p>
54 <p>Then I changed it to use libexpat. One of the issues I ran into with expat is
55 it parses XML in a strict mode. Some RSS/Atom feeds have some quirks or small
56 XML errors. libexpat over time also had many security vulnerabilities. Some
57 examples:</p>
58 <ul>
59 <li><a href="https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=expat">https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=expat</a></li>
60 <li>Section "security fixes" in the Changes file:<br />
61 <a href="https://github.com/libexpat/libexpat/blob/master/expat/Changes">https://github.com/libexpat/libexpat/blob/master/expat/Changes</a></li>
62 </ul>
63 <p>One other thing is portability and reducing the amount of dependencies. I also
64 want to fully understand what most of the parts of my program are doing and to
65 also keep it relatively simple.</p>
66 <p>I've used my XML parser for some time now in my projects, but there are also
67 many written tests. Fuzzing was used to test strange psuedo-random input data.
68 For fuzzing the tool <a href="https://lcamtuf.coredump.cx/afl/">afl-fuzz</a> was used.</p>
69 <p>It was tested on different platforms which have different characteristics.
70 But of course I'm only human and it will still have bugs: please report them!</p>
71 <h2>What is it good for?</h2>
72 <p>... <a href="https://www.youtube.com/embed/hZJRJpbGkG4">absolutely nothing</a>.
73 Just kidding, as mentioned some of my projects use it:</p>
74 <ul>
75 <li><p><a href="https://codemadness.org/sfeed.html">sfeed</a>:<br />
76 It is used to parse RSS/Atom for newsfeeds.<br /> </p>
77 <p>Repository: <a href="https://git.codemadness.org/sfeed/">https://git.codemadness.org/sfeed/</a><br /> </p>
78 </li>
79 <li><p>osm-zipcodes:<br />
80 A project to extract Dutch zipcodes and addresses and their latitude,
81 longitude from the .osm XML.
82 The code is quite ugly and it uses mmap() as a reader and ugly hacks
83 to improve the speed of parsing the XML.<br /> </p>
84 <p>Repository: <a href="https://git.codemadness.org/osm-zipcodes/">https://git.codemadness.org/osm-zipcodes/</a></p>
85 </li>
86 <li><p>webdump:<br />
87 It is used to parse HTML/XHTML. It has some modifications to handle
88 HTML and a list of the many HTML named entities were added.<br /> </p>
89 <p>Repository: <a href="https://git.codemadness.org/webdump/">https://git.codemadness.org/webdump/</a></p>
90 </li>
91 <li><p><a href="https://codemadness.org/idiotbox.html">Youtube HTML parser and front-ends</a>:<br />
92 It is used to parse HTML and extract the relevant JSON meta-data from the
93 page. The Youtube HTML is (intentionally by Google) crapified auto-generated
94 HTML. I guess it is a good benchmark for the crappy webworld we live in
95 today :)<br /> </p>
96 <p>Repository: <a href="https://git.codemadness.org/frontends/">https://git.codemadness.org/frontends/</a> in the "youtube/"
97 directory.<br />
98 Also a link to my JSON parser: <a href="https://codemadness.org/json2tsv.html">https://codemadness.org/json2tsv.html</a></p>
99 </li>
100 <li><p>Dutch BAG Kadaster parser (extract/subset):<br />
101 "Basisregistratie Adressen en Gebouwen (BAG)". Translated from Dutch to English
102 this means something like: "Base registration of address and buildings".</p>
103 <p>Parse the public-available Dutch BAG Kadaster XML files. It is used to extract
104 the summary of BAG information of an address. In particular the address,
105 allocated purpose of the address and the area size (in squared meters).</p>
106 <p>Repository: <a href="https://git.codemadness.org/bag/">https://git.codemadness.org/bag/</a></p>
107 <p>BAG Kadaster XML data source: <a href="https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract">https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract</a></p>
108 </li>
109 </ul>
110 <h2>Features</h2>
111 <ul>
112 <li>Relatively small parser, not that many lines of code to understand.</li>
113 <li>Pretty simple and easy to use callback-based API.</li>
114 <li>Pretty fast.</li>
115 <li>Portable and with few dependencies: it can be compiled with an ANSI C89
116 compiler and works on many platforms and compilers.</li>
117 <li>No dynamic memory allocation.</li>
118 <li>Suitable for low-resource environments.</li>
119 </ul>
120 <h2>Supports</h2>
121 <ul>
122 <li>Tags in short-form (<img src="lolcat.jpg" title="Meow" />).</li>
123 <li>Tag attributes.</li>
124 <li>Short attributes without an explicitly set value (<input type="checkbox" checked />).</li>
125 <li>Comments</li>
126 <li>CDATA sections.</li>
127 <li>Helper function (xml_entitytostr) to convert XML 1.0 / HTML 2.0 named
128 entities and numeric entities to UTF-8.</li>
129 <li>Reading XML from a file descriptor, mmap, string buffer or implement a
130 custom reader: see: XMLParser.getnext or GETNEXT() macro.
131 The reader function can be easily customized. This function expects to read
132 a byte or return EOF on EOF or an error. This way you can use
133 getchar/getchar_unlocked, mmap(), using a memory buffer or reading in many
134 other ways.</li>
135 </ul>
136 <h2>Parser design decisions</h2>
137 <ul>
138 <li>It supports parsing a subset of XML:
139 It is not a fully compliant XML parser.</li>
140 <li>There is no direct support for namespaces. For example a tag "ns:sometag" is
141 just parsed as the tag name "ns:sometag".</li>
142 <li>There is no resolving or loading of external DTD's for parsing the XML data.
143 This is also for security and simplicity reasons.</li>
144 <li>Entity expansions are not parsed as well as DOCTYPE, ATTLIST etc.
145 It is not allowed to define or redefine entities. This prevents XML entity
146 bombs or "billion laughs attack":
147 <a href="https://en.wikipedia.org/wiki/Billion_laughs_attack">https://en.wikipedia.org/wiki/Billion_laughs_attack</a> and
148 <a href="https://en.wikipedia.org/wiki/XML_external_entity_attack">https://en.wikipedia.org/wiki/XML_external_entity_attack</a>.</li>
149 <li>There is no character-decoding for the input. It is assumed to be UTF-8
150 compatible. The data can be decoded or translated to UTF-8 before parsing
151 it. For example using iconv.
152 <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html">https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html</a>.</li>
153 </ul>
154 <h2>Trade-offs</h2>
155 <p>These are trade-offs and can be considered cons:</p>
156 <ul>
157 <li>Performance: data is buffered even if a handler is not set. To make the
158 parsing faster you can change this code from xml.c if necessary.</li>
159 <li>The XML is not checked for errors so it will continue parsing XML data, this
160 is by design.</li>
161 <li>Internally fixed-size buffers are used, callbacks like XMLParser.xmldata are
162 called multiple times for the same tag if the data size is bigger than the
163 internal buffer size (sizeof(XMLParser.data)). To differentiate between new
164 calls for data you can use the xml*start and xml*end handlers.</li>
165 <li>It does not handle XML white-space rules for tag data. The raw values
166 including white-space is passed. This is useful in some cases, like for
167 HTML <pre> tags.</li>
168 <li>The XML specification has no limits on tag and attribute names. For
169 simplicity/sanity sake this XML parser takes some liberties. Tag and
170 attribute names are truncated if they are excessively long.</li>
171 </ul>
172 <h2>Clone</h2>
173 <pre><code>git clone git://git.codemadness.org/xmlparser
174 </code></pre>
175 <h2>Browse</h2>
176 <p>You can browse the source-code at:</p>
177 <ul>
178 <li><a href="https://git.codemadness.org/xmlparser/">https://git.codemadness.org/xmlparser/</a></li>
179 <li><a href="gopher://codemadness.org/1/git/xmlparser">gopher://codemadness.org/1/git/xmlparser</a></li>
180 </ul>
181 <h2>Example program</h2>
182 <p>This is from skeleton.c in the repository. It can be used as a template file
183 to quickly create a small program that parses XML.</p>
184 <p>From: <a href="https://git.codemadness.org/xmlparser/file/skeleton.c.html">https://git.codemadness.org/xmlparser/file/skeleton.c.html</a></p>
185 <pre><code>#include <stdio.h>
186
187 #include "xml.h"
188
189 void
190 xmlattr(XMLParser *x, const char *t, size_t tl,
191 const char *a, size_t al, const char *v, size_t vl)
192 {
193 }
194
195 void
196 xmlattrentity(XMLParser *x, const char *t, size_t tl,
197 const char *a, size_t al, const char *v, size_t vl)
198 {
199 char buf[16];
200 int len;
201
202 /* try to translate entity, else just pass as data to
203 * xmlattr handler. */
204 if ((len = xml_entitytostr(v, buf, sizeof(buf))) > 0)
205 xmlattr(x, t, tl, a, al, buf, (size_t)len);
206 else
207 xmlattr(x, t, tl, a, al, v, vl);
208 }
209
210 void
211 xmlattrend(XMLParser *x, const char *t, size_t tl,
212 const char *a, size_t al)
213 {
214 }
215
216 void
217 xmlattrstart(XMLParser *x, const char *t, size_t tl,
218 const char *a, size_t al)
219 {
220 }
221
222 void
223 xmlcdatastart(XMLParser *x)
224 {
225 }
226
227 void
228 xmlcdata(XMLParser *x, const char *d, size_t dl)
229 {
230 }
231
232 void
233 xmlcdataend(XMLParser *x)
234 {
235 }
236
237 void
238 xmlcommentstart(XMLParser *x)
239 {
240 }
241
242 void
243 xmlcomment(XMLParser *x, const char *c, size_t cl)
244 {
245 }
246
247 void
248 xmlcommentend(XMLParser *x)
249 {
250 }
251
252 void
253 xmldata(XMLParser *x, const char *d, size_t dl)
254 {
255 }
256
257 void
258 xmldataend(XMLParser *x)
259 {
260 }
261
262 void
263 xmldataentity(XMLParser *x, const char *d, size_t dl)
264 {
265 char buf[16];
266 int len;
267
268 /* try to translate entity, else just pass as data to
269 * xmldata handler. */
270 if ((len = xml_entitytostr(d, buf, sizeof(buf))) > 0)
271 xmldata(x, buf, (size_t)len);
272 else
273 xmldata(x, d, dl);
274 }
275
276 void
277 xmldatastart(XMLParser *x)
278 {
279 }
280
281 void
282 xmltagend(XMLParser *x, const char *t, size_t tl, int isshort)
283 {
284 }
285
286 void
287 xmltagstart(XMLParser *x, const char *t, size_t tl)
288 {
289 }
290
291 void
292 xmltagstartparsed(XMLParser *x, const char *t, size_t tl, int isshort)
293 {
294 }
295
296 int
297 main(void)
298 {
299 XMLParser x = { 0 };
300
301 x.xmlattr = xmlattr;
302 x.xmlattrend = xmlattrend;
303 x.xmlattrstart = xmlattrstart;
304 x.xmlattrentity = xmlattrentity;
305 x.xmlcdatastart = xmlcdatastart;
306 x.xmlcdata = xmlcdata;
307 x.xmlcdataend = xmlcdataend;
308 x.xmlcommentstart = xmlcommentstart;
309 x.xmlcomment = xmlcomment;
310 x.xmlcommentend = xmlcommentend;
311 x.xmldata = xmldata;
312 x.xmldataend = xmldataend;
313 x.xmldataentity = xmldataentity;
314 x.xmldatastart = xmldatastart;
315 x.xmltagend = xmltagend;
316 x.xmltagstart = xmltagstart;
317 x.xmltagstartparsed = xmltagstartparsed;
318
319 x.getnext = getchar;
320
321 xml_parse(&x);
322
323 return 0;
324 }
325 </code></pre>
326 <p>As you can see the important functions of the parser itself are xml_parse()
327 and xml_entitytostr().</p>
328 <p>XMLParser is a structure of the context and it contains pointers to the
329 callback functions.</p>
330 <p>This is a verbose example. All the callbacks that are unused could be removed.
331 If the callback is set to NULL then it is unused.</p>
332 <h1>References</h1>
333 <ul>
334 <li>AFL (American fuzzy lop): afl-fuzz: <a href="https://lcamtuf.coredump.cx/afl/">https://lcamtuf.coredump.cx/afl/</a>.</li>
335 <li>iconv: character-set conversion.</li>
336 <li>sfeed: <a href="https://codemadness.org/sfeed.html">https://codemadness.org/sfeed.html</a>.</li>
337 <li>libexpat XML parser: <a href="https://github.com/libexpat/libexpat/">https://github.com/libexpat/libexpat/</a>.</li>
338 <li>libxml2 XML parser: <a href="https://github.com/GNOME/libxml2">https://github.com/GNOME/libxml2</a>.</li>
339 </ul>
340 <h1>End</h1>
341 <p>I hope this write up is useful or xml.{c,h} can be useful in your project.</p>
342
343 </article>
344 </div>
345 </main>
346 </body>
347 </html>