xml.md - www.codemadness.org - www.codemadness.org saait content files
(HTM) git clone git://git.codemadness.org/www.codemadness.org
(DIR) Log
(DIR) Files
(DIR) Refs
(DIR) README
(DIR) LICENSE
---
xml.md (9610B)
---
1 ## Why
2
3 This XML parser was first developed for use with my RSS/Atom parser
4 [sfeed](https://codemadness.org/sfeed.html).
5
6 In the first few versions of sfeed it didn't have any real XML parser it just
7 did a simple string search for the XML tag names.
8
9 Then I changed it to use libexpat. One of the issues I ran into with expat is
10 it parses XML in a strict mode. Some RSS/Atom feeds have some quirks or small
11 XML errors. libexpat over time also had many security vulnerabilities. Some
12 examples:
13
14 * <https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=expat>
15 * Section "security fixes" in the Changes file:
16 <https://github.com/libexpat/libexpat/blob/master/expat/Changes>
17
18 One other thing is portability and reducing the amount of dependencies. I also
19 want to fully understand what most of the parts of my program are doing and to
20 also keep it relatively simple.
21
22 I've used my XML parser for some time now in my projects, but there are also
23 many written tests. Fuzzing was used to test strange psuedo-random input data.
24 For fuzzing the tool [afl-fuzz](https://lcamtuf.coredump.cx/afl/) was used.
25
26 It was tested on different platforms which have different characteristics.
27 But of course I'm only human and it will still have bugs: please report them!
28
29
30 ## What is it good for?
31
32 ... [absolutely nothing](https://www.youtube.com/embed/hZJRJpbGkG4).
33 Just kidding, as mentioned some of my projects use it:
34
35 * [sfeed](https://codemadness.org/sfeed.html):
36 It is used to parse RSS/Atom for newsfeeds.
37
38 Repository: <https://git.codemadness.org/sfeed/>
39
40 * osm-zipcodes:
41 A project to extract Dutch zipcodes and addresses and their latitude,
42 longitude from the .osm XML.
43 The code is quite ugly and it uses mmap() as a reader and ugly hacks
44 to improve the speed of parsing the XML.
45
46 Repository: <https://git.codemadness.org/osm-zipcodes/>
47
48 * webdump:
49 It is used to parse HTML/XHTML. It has some modifications to handle
50 HTML and a list of the many HTML named entities were added.
51
52 Repository: <https://git.codemadness.org/webdump/>
53
54 * [Youtube HTML parser and front-ends](https://codemadness.org/idiotbox.html):
55 It is used to parse HTML and extract the relevant JSON meta-data from the
56 page. The Youtube HTML is (intentionally by Google) crapified auto-generated
57 HTML. I guess it is a good benchmark for the crappy webworld we live in
58 today :)
59
60 Repository: <https://git.codemadness.org/frontends/> in the "youtube/"
61 directory.
62 Also a link to my JSON parser: <https://codemadness.org/json2tsv.html>
63
64 * Dutch BAG Kadaster parser (extract/subset):
65 "Basisregistratie Adressen en Gebouwen (BAG)". Translated from Dutch to English
66 this means something like: "Base registration of address and buildings".
67
68 Parse the public-available Dutch BAG Kadaster XML files. It is used to extract
69 the summary of BAG information of an address. In particular the address,
70 allocated purpose of the address and the area size (in squared meters).
71
72 Repository: <https://git.codemadness.org/bag/>
73
74 BAG Kadaster XML data source: <https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract>
75
76
77 ## Features
78
79 * Relatively small parser, not that many lines of code to understand.
80 * Pretty simple and easy to use callback-based API.
81 * Pretty fast.
82 * Portable and with few dependencies: it can be compiled with an ANSI C89
83 compiler and works on many platforms and compilers.
84 * No dynamic memory allocation.
85 * Suitable for low-resource environments.
86
87
88 ## Supports
89
90 * Tags in short-form (<img src="lolcat.jpg" title="Meow" />).
91 * Tag attributes.
92 * Short attributes without an explicitly set value (<input type="checkbox" checked />).
93 * Comments
94 * CDATA sections.
95 * Helper function (xml\_entitytostr) to convert XML 1.0 / HTML 2.0 named
96 entities and numeric entities to UTF-8.
97 * Reading XML from a file descriptor, mmap, string buffer or implement a
98 custom reader: see: XMLParser.getnext or GETNEXT() macro.
99 The reader function can be easily customized. This function expects to read
100 a byte or return EOF on EOF or an error. This way you can use
101 getchar/getchar\_unlocked, mmap(), using a memory buffer or reading in many
102 other ways.
103
104
105 ## Parser design decisions
106
107 * It supports parsing a subset of XML:
108 It is not a fully compliant XML parser.
109 * There is no direct support for namespaces. For example a tag "ns:sometag" is
110 just parsed as the tag name "ns:sometag".
111 * There is no resolving or loading of external DTD's for parsing the XML data.
112 This is also for security and simplicity reasons.
113 * Entity expansions are not parsed as well as DOCTYPE, ATTLIST etc.
114 It is not allowed to define or redefine entities. This prevents XML entity
115 bombs or "billion laughs attack":
116 <https://en.wikipedia.org/wiki/Billion_laughs_attack> and
117 <https://en.wikipedia.org/wiki/XML_external_entity_attack>.
118 * There is no character-decoding for the input. It is assumed to be UTF-8
119 compatible. The data can be decoded or translated to UTF-8 before parsing
120 it. For example using iconv.
121 <https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html>.
122
123
124 ## Trade-offs
125
126 These are trade-offs and can be considered cons:
127
128 * Performance: data is buffered even if a handler is not set. To make the
129 parsing faster you can change this code from xml.c if necessary.
130 * The XML is not checked for errors so it will continue parsing XML data, this
131 is by design.
132 * Internally fixed-size buffers are used, callbacks like XMLParser.xmldata are
133 called multiple times for the same tag if the data size is bigger than the
134 internal buffer size (sizeof(XMLParser.data)). To differentiate between new
135 calls for data you can use the xml\*start and xml\*end handlers.
136 * It does not handle XML white-space rules for tag data. The raw values
137 including white-space is passed. This is useful in some cases, like for
138 HTML <pre> tags.
139 * The XML specification has no limits on tag and attribute names. For
140 simplicity/sanity sake this XML parser takes some liberties. Tag and
141 attribute names are truncated if they are excessively long.
142
143
144 ## Clone
145
146 git clone git://git.codemadness.org/xmlparser
147
148
149 ## Browse
150
151 You can browse the source-code at:
152
153 * <https://git.codemadness.org/xmlparser/>
154 * <gopher://codemadness.org/1/git/xmlparser>
155
156
157 ## Example program
158
159 This is from skeleton.c in the repository. It can be used as a template file
160 to quickly create a small program that parses XML.
161
162 From: <https://git.codemadness.org/xmlparser/file/skeleton.c.html>
163
164 #include <stdio.h>
165
166 #include "xml.h"
167
168 void
169 xmlattr(XMLParser *x, const char *t, size_t tl,
170 const char *a, size_t al, const char *v, size_t vl)
171 {
172 }
173
174 void
175 xmlattrentity(XMLParser *x, const char *t, size_t tl,
176 const char *a, size_t al, const char *v, size_t vl)
177 {
178 char buf[16];
179 int len;
180
181 /* try to translate entity, else just pass as data to
182 * xmlattr handler. */
183 if ((len = xml_entitytostr(v, buf, sizeof(buf))) > 0)
184 xmlattr(x, t, tl, a, al, buf, (size_t)len);
185 else
186 xmlattr(x, t, tl, a, al, v, vl);
187 }
188
189 void
190 xmlattrend(XMLParser *x, const char *t, size_t tl,
191 const char *a, size_t al)
192 {
193 }
194
195 void
196 xmlattrstart(XMLParser *x, const char *t, size_t tl,
197 const char *a, size_t al)
198 {
199 }
200
201 void
202 xmlcdatastart(XMLParser *x)
203 {
204 }
205
206 void
207 xmlcdata(XMLParser *x, const char *d, size_t dl)
208 {
209 }
210
211 void
212 xmlcdataend(XMLParser *x)
213 {
214 }
215
216 void
217 xmlcommentstart(XMLParser *x)
218 {
219 }
220
221 void
222 xmlcomment(XMLParser *x, const char *c, size_t cl)
223 {
224 }
225
226 void
227 xmlcommentend(XMLParser *x)
228 {
229 }
230
231 void
232 xmldata(XMLParser *x, const char *d, size_t dl)
233 {
234 }
235
236 void
237 xmldataend(XMLParser *x)
238 {
239 }
240
241 void
242 xmldataentity(XMLParser *x, const char *d, size_t dl)
243 {
244 char buf[16];
245 int len;
246
247 /* try to translate entity, else just pass as data to
248 * xmldata handler. */
249 if ((len = xml_entitytostr(d, buf, sizeof(buf))) > 0)
250 xmldata(x, buf, (size_t)len);
251 else
252 xmldata(x, d, dl);
253 }
254
255 void
256 xmldatastart(XMLParser *x)
257 {
258 }
259
260 void
261 xmltagend(XMLParser *x, const char *t, size_t tl, int isshort)
262 {
263 }
264
265 void
266 xmltagstart(XMLParser *x, const char *t, size_t tl)
267 {
268 }
269
270 void
271 xmltagstartparsed(XMLParser *x, const char *t, size_t tl, int isshort)
272 {
273 }
274
275 int
276 main(void)
277 {
278 XMLParser x = { 0 };
279
280 x.xmlattr = xmlattr;
281 x.xmlattrend = xmlattrend;
282 x.xmlattrstart = xmlattrstart;
283 x.xmlattrentity = xmlattrentity;
284 x.xmlcdatastart = xmlcdatastart;
285 x.xmlcdata = xmlcdata;
286 x.xmlcdataend = xmlcdataend;
287 x.xmlcommentstart = xmlcommentstart;
288 x.xmlcomment = xmlcomment;
289 x.xmlcommentend = xmlcommentend;
290 x.xmldata = xmldata;
291 x.xmldataend = xmldataend;
292 x.xmldataentity = xmldataentity;
293 x.xmldatastart = xmldatastart;
294 x.xmltagend = xmltagend;
295 x.xmltagstart = xmltagstart;
296 x.xmltagstartparsed = xmltagstartparsed;
297
298 x.getnext = getchar;
299
300 xml_parse(&x);
301
302 return 0;
303 }
304
305 As you can see the important functions of the parser itself are xml\_parse()
306 and xml\_entitytostr().
307
308 XMLParser is a structure of the context and it contains pointers to the
309 callback functions.
310
311 This is a verbose example. All the callbacks that are unused could be removed.
312 If the callback is set to NULL then it is unused.
313
314
315 # References
316
317 * AFL (American fuzzy lop): afl-fuzz: <https://lcamtuf.coredump.cx/afl/>.
318 * iconv: character-set conversion.
319 * sfeed: <https://codemadness.org/sfeed.html>.
320 * libexpat XML parser: <https://github.com/libexpat/libexpat/>.
321 * libxml2 XML parser: <https://github.com/GNOME/libxml2>.
322
323
324 # End
325
326 I hope this write up is useful or xml.{c,h} can be useful in your project.