xml - www.codemadness.org - www.codemadness.org saait content files
(HTM) git clone git://git.codemadness.org/www.codemadness.org
(DIR) Log
(DIR) Files
(DIR) Refs
(DIR) README
(DIR) LICENSE
---
xml (19194B)
---
1 1<- Back / codemadness.org 70
2 i codemadness.org 70
3 i codemadness.org 70
4 i# xml.c and xml.h: XML parser codemadness.org 70
5 i codemadness.org 70
6 iLast modification on 2023-11-20 codemadness.org 70
7 i codemadness.org 70
8 i## Why codemadness.org 70
9 i codemadness.org 70
10 iThis XML parser was first developed for use with my RSS/Atom parser codemadness.org 70
11 hsfeed. URL:https://codemadness.org/sfeed.html codemadness.org 70
12 i codemadness.org 70
13 iIn the first few versions of sfeed it didn't have any real XML parser it just codemadness.org 70
14 idid a simple string search for the XML tag names. codemadness.org 70
15 i codemadness.org 70
16 iThen I changed it to use libexpat. One of the issues I ran into with expat is codemadness.org 70
17 iit parses XML in a strict mode. Some RSS/Atom feeds have some quirks or small codemadness.org 70
18 iXML errors. libexpat over time also had many security vulnerabilities. Some codemadness.org 70
19 iexamples: codemadness.org 70
20 i codemadness.org 70
21 h* https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=expat URL:https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=expat codemadness.org 70
22 i* Section "security fixes" in the Changes file: codemadness.org 70
23 h https://github.com/libexpat/libexpat/blob/master/expat/Changes URL:https://github.com/libexpat/libexpat/blob/master/expat/Changes codemadness.org 70
24 i codemadness.org 70
25 iOne other thing is portability and reducing the amount of dependencies. I also codemadness.org 70
26 iwant to fully understand what most of the parts of my program are doing and to codemadness.org 70
27 ialso keep it relatively simple. codemadness.org 70
28 i codemadness.org 70
29 iI've used my XML parser for some time now in my projects, but there are also codemadness.org 70
30 imany written tests. Fuzzing was used to test strange psuedo-random input data. codemadness.org 70
31 hFor fuzzing the tool »afl-fuzz« was used. URL:https://lcamtuf.coredump.cx/afl/ codemadness.org 70
32 i codemadness.org 70
33 iIt was tested on different platforms which have different characteristics. codemadness.org 70
34 iBut of course I'm only human and it will still have bugs: please report them! codemadness.org 70
35 i codemadness.org 70
36 i codemadness.org 70
37 i## What is it good for? codemadness.org 70
38 i codemadness.org 70
39 h... absolutely nothing. URL:https://www.youtube.com/embed/hZJRJpbGkG4 codemadness.org 70
40 iJust kidding, as mentioned some of my projects use it: codemadness.org 70
41 i codemadness.org 70
42 h* sfeed: URL:https://codemadness.org/sfeed.html codemadness.org 70
43 i It is used to parse RSS/Atom for newsfeeds. codemadness.org 70
44 i codemadness.org 70
45 h Repository: »https://git.codemadness.org/sfeed/« URL:https://git.codemadness.org/sfeed/ codemadness.org 70
46 i codemadness.org 70
47 i* osm-zipcodes: codemadness.org 70
48 i A project to extract Dutch zipcodes and addresses and their latitude, codemadness.org 70
49 i longitude from the .osm XML. codemadness.org 70
50 i The code is quite ugly and it uses mmap() as a reader and ugly hacks codemadness.org 70
51 i to improve the speed of parsing the XML. codemadness.org 70
52 i codemadness.org 70
53 h Repository: »https://git.codemadness.org/osm-zipcodes/« URL:https://git.codemadness.org/osm-zipcodes/ codemadness.org 70
54 i codemadness.org 70
55 i* webdump: codemadness.org 70
56 i It is used to parse HTML/XHTML. It has some modifications to handle codemadness.org 70
57 i HTML and a list of the many HTML named entities were added. codemadness.org 70
58 i codemadness.org 70
59 h Repository: »https://git.codemadness.org/webdump/« URL:https://git.codemadness.org/webdump/ codemadness.org 70
60 i codemadness.org 70
61 h* Youtube HTML parser and front-ends: URL:https://codemadness.org/idiotbox.html codemadness.org 70
62 i It is used to parse HTML and extract the relevant JSON meta-data from the codemadness.org 70
63 i page. The Youtube HTML is (intentionally by Google) crapified auto-generated codemadness.org 70
64 i HTML. I guess it is a good benchmark for the crappy webworld we live in codemadness.org 70
65 i today :) codemadness.org 70
66 i codemadness.org 70
67 h Repository: »https://git.codemadness.org/frontends/« in the "youtube/" URL:https://git.codemadness.org/frontends/ codemadness.org 70
68 i directory. codemadness.org 70
69 h Also a link to my JSON parser: »https://codemadness.org/json2tsv.html« URL:https://codemadness.org/json2tsv.html codemadness.org 70
70 i codemadness.org 70
71 i* Dutch BAG Kadaster parser (extract/subset): codemadness.org 70
72 i "Basisregistratie Adressen en Gebouwen (BAG)". Translated from Dutch to English codemadness.org 70
73 i this means something like: "Base registration of address and buildings". codemadness.org 70
74 i codemadness.org 70
75 i Parse the public-available Dutch BAG Kadaster XML files. It is used to extract codemadness.org 70
76 i the summary of BAG information of an address. In particular the address, codemadness.org 70
77 i allocated purpose of the address and the area size (in squared meters). codemadness.org 70
78 i codemadness.org 70
79 h Repository: »https://git.codemadness.org/bag/« URL:https://git.codemadness.org/bag/ codemadness.org 70
80 i codemadness.org 70
81 h BAG Kadaster XML data source: »https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract« URL:https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract codemadness.org 70
82 i codemadness.org 70
83 i codemadness.org 70
84 i## Features codemadness.org 70
85 i codemadness.org 70
86 i* Relatively small parser, not that many lines of code to understand. codemadness.org 70
87 i* Pretty simple and easy to use callback-based API. codemadness.org 70
88 i* Pretty fast. codemadness.org 70
89 i* Portable and with few dependencies: it can be compiled with an ANSI C89 codemadness.org 70
90 i compiler and works on many platforms and compilers. codemadness.org 70
91 i* No dynamic memory allocation. codemadness.org 70
92 i* Suitable for low-resource environments. codemadness.org 70
93 i codemadness.org 70
94 i codemadness.org 70
95 i## Supports codemadness.org 70
96 i codemadness.org 70
97 i* Tags in short-form (<img src="lolcat.jpg" title="Meow" />). codemadness.org 70
98 i* Tag attributes. codemadness.org 70
99 i* Short attributes without an explicitly set value (<input type="checkbox" checked />). codemadness.org 70
100 i* Comments codemadness.org 70
101 i* CDATA sections. codemadness.org 70
102 i* Helper function (xml_entitytostr) to convert XML 1.0 / HTML 2.0 named codemadness.org 70
103 i entities and numeric entities to UTF-8. codemadness.org 70
104 i* Reading XML from a file descriptor, mmap, string buffer or implement a codemadness.org 70
105 i custom reader: see: XMLParser.getnext or GETNEXT() macro. codemadness.org 70
106 i The reader function can be easily customized. This function expects to read codemadness.org 70
107 i a byte or return EOF on EOF or an error. This way you can use codemadness.org 70
108 i getchar/getchar_unlocked, mmap(), using a memory buffer or reading in many codemadness.org 70
109 i other ways. codemadness.org 70
110 i codemadness.org 70
111 i codemadness.org 70
112 i## Parser design decisions codemadness.org 70
113 i codemadness.org 70
114 i* It supports parsing a subset of XML: codemadness.org 70
115 i It is not a fully compliant XML parser. codemadness.org 70
116 i* There is no direct support for namespaces. For example a tag "ns:sometag" is codemadness.org 70
117 i just parsed as the tag name "ns:sometag". codemadness.org 70
118 i* There is no resolving or loading of external DTD's for parsing the XML data. codemadness.org 70
119 i This is also for security and simplicity reasons. codemadness.org 70
120 i* Entity expansions are not parsed as well as DOCTYPE, ATTLIST etc. codemadness.org 70
121 i It is not allowed to define or redefine entities. This prevents XML entity codemadness.org 70
122 i bombs or "billion laughs attack": codemadness.org 70
123 h »https://en.wikipedia.org/wiki/Billion_laughs_attack« and URL:https://en.wikipedia.org/wiki/Billion_laughs_attack codemadness.org 70
124 h https://en.wikipedia.org/wiki/XML_external_entity_attack. URL:https://en.wikipedia.org/wiki/XML_external_entity_attack codemadness.org 70
125 i* There is no character-decoding for the input. It is assumed to be UTF-8 codemadness.org 70
126 i compatible. The data can be decoded or translated to UTF-8 before parsing codemadness.org 70
127 i it. For example using iconv. codemadness.org 70
128 h https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html. URL:https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html codemadness.org 70
129 i codemadness.org 70
130 i codemadness.org 70
131 i## Trade-offs codemadness.org 70
132 i codemadness.org 70
133 iThese are trade-offs and can be considered cons: codemadness.org 70
134 i codemadness.org 70
135 i* Performance: data is buffered even if a handler is not set. To make the codemadness.org 70
136 i parsing faster you can change this code from xml.c if necessary. codemadness.org 70
137 i* The XML is not checked for errors so it will continue parsing XML data, this codemadness.org 70
138 i is by design. codemadness.org 70
139 i* Internally fixed-size buffers are used, callbacks like XMLParser.xmldata are codemadness.org 70
140 i called multiple times for the same tag if the data size is bigger than the codemadness.org 70
141 i internal buffer size (sizeof(XMLParser.data)). To differentiate between new codemadness.org 70
142 i calls for data you can use the xml\*start and xml\*end handlers. codemadness.org 70
143 i* It does not handle XML white-space rules for tag data. The raw values codemadness.org 70
144 i including white-space is passed. This is useful in some cases, like for codemadness.org 70
145 i HTML <pre> tags. codemadness.org 70
146 i* The XML specification has no limits on tag and attribute names. For codemadness.org 70
147 i simplicity/sanity sake this XML parser takes some liberties. Tag and codemadness.org 70
148 i attribute names are truncated if they are excessively long. codemadness.org 70
149 i codemadness.org 70
150 i codemadness.org 70
151 i## Clone codemadness.org 70
152 i codemadness.org 70
153 i git clone git://git.codemadness.org/xmlparser codemadness.org 70
154 i codemadness.org 70
155 i codemadness.org 70
156 i## Browse codemadness.org 70
157 i codemadness.org 70
158 iYou can browse the source-code at: codemadness.org 70
159 i codemadness.org 70
160 h* https://git.codemadness.org/xmlparser/ URL:https://git.codemadness.org/xmlparser/ codemadness.org 70
161 1* gopher://codemadness.org/1/git/xmlparser /git/xmlparser codemadness.org 70
162 i codemadness.org 70
163 i codemadness.org 70
164 i## Example program codemadness.org 70
165 i codemadness.org 70
166 iThis is from skeleton.c in the repository. It can be used as a template file codemadness.org 70
167 ito quickly create a small program that parses XML. codemadness.org 70
168 i codemadness.org 70
169 1From: »https://git.codemadness.org/xmlparser/file/skeleton.c.html« /git/xmlparser/file/skeleton.c.gph codemadness.org 70
170 i codemadness.org 70
171 i #include <stdio.h> codemadness.org 70
172 i codemadness.org 70
173 i #include "xml.h" codemadness.org 70
174 i codemadness.org 70
175 i void codemadness.org 70
176 i xmlattr(XMLParser *x, const char *t, size_t tl, codemadness.org 70
177 i const char *a, size_t al, const char *v, size_t vl) codemadness.org 70
178 i { codemadness.org 70
179 i } codemadness.org 70
180 i codemadness.org 70
181 i void codemadness.org 70
182 i xmlattrentity(XMLParser *x, const char *t, size_t tl, codemadness.org 70
183 i const char *a, size_t al, const char *v, size_t vl) codemadness.org 70
184 i { codemadness.org 70
185 i char buf[16]; codemadness.org 70
186 i int len; codemadness.org 70
187 i codemadness.org 70
188 i /* try to translate entity, else just pass as data to codemadness.org 70
189 i * xmlattr handler. */ codemadness.org 70
190 i if ((len = xml_entitytostr(v, buf, sizeof(buf))) > 0) codemadness.org 70
191 i xmlattr(x, t, tl, a, al, buf, (size_t)len); codemadness.org 70
192 i else codemadness.org 70
193 i xmlattr(x, t, tl, a, al, v, vl); codemadness.org 70
194 i } codemadness.org 70
195 i codemadness.org 70
196 i void codemadness.org 70
197 i xmlattrend(XMLParser *x, const char *t, size_t tl, codemadness.org 70
198 i const char *a, size_t al) codemadness.org 70
199 i { codemadness.org 70
200 i } codemadness.org 70
201 i codemadness.org 70
202 i void codemadness.org 70
203 i xmlattrstart(XMLParser *x, const char *t, size_t tl, codemadness.org 70
204 i const char *a, size_t al) codemadness.org 70
205 i { codemadness.org 70
206 i } codemadness.org 70
207 i codemadness.org 70
208 i void codemadness.org 70
209 i xmlcdatastart(XMLParser *x) codemadness.org 70
210 i { codemadness.org 70
211 i } codemadness.org 70
212 i codemadness.org 70
213 i void codemadness.org 70
214 i xmlcdata(XMLParser *x, const char *d, size_t dl) codemadness.org 70
215 i { codemadness.org 70
216 i } codemadness.org 70
217 i codemadness.org 70
218 i void codemadness.org 70
219 i xmlcdataend(XMLParser *x) codemadness.org 70
220 i { codemadness.org 70
221 i } codemadness.org 70
222 i codemadness.org 70
223 i void codemadness.org 70
224 i xmlcommentstart(XMLParser *x) codemadness.org 70
225 i { codemadness.org 70
226 i } codemadness.org 70
227 i codemadness.org 70
228 i void codemadness.org 70
229 i xmlcomment(XMLParser *x, const char *c, size_t cl) codemadness.org 70
230 i { codemadness.org 70
231 i } codemadness.org 70
232 i codemadness.org 70
233 i void codemadness.org 70
234 i xmlcommentend(XMLParser *x) codemadness.org 70
235 i { codemadness.org 70
236 i } codemadness.org 70
237 i codemadness.org 70
238 i void codemadness.org 70
239 i xmldata(XMLParser *x, const char *d, size_t dl) codemadness.org 70
240 i { codemadness.org 70
241 i } codemadness.org 70
242 i codemadness.org 70
243 i void codemadness.org 70
244 i xmldataend(XMLParser *x) codemadness.org 70
245 i { codemadness.org 70
246 i } codemadness.org 70
247 i codemadness.org 70
248 i void codemadness.org 70
249 i xmldataentity(XMLParser *x, const char *d, size_t dl) codemadness.org 70
250 i { codemadness.org 70
251 i char buf[16]; codemadness.org 70
252 i int len; codemadness.org 70
253 i codemadness.org 70
254 i /* try to translate entity, else just pass as data to codemadness.org 70
255 i * xmldata handler. */ codemadness.org 70
256 i if ((len = xml_entitytostr(d, buf, sizeof(buf))) > 0) codemadness.org 70
257 i xmldata(x, buf, (size_t)len); codemadness.org 70
258 i else codemadness.org 70
259 i xmldata(x, d, dl); codemadness.org 70
260 i } codemadness.org 70
261 i codemadness.org 70
262 i void codemadness.org 70
263 i xmldatastart(XMLParser *x) codemadness.org 70
264 i { codemadness.org 70
265 i } codemadness.org 70
266 i codemadness.org 70
267 i void codemadness.org 70
268 i xmltagend(XMLParser *x, const char *t, size_t tl, int isshort) codemadness.org 70
269 i { codemadness.org 70
270 i } codemadness.org 70
271 i codemadness.org 70
272 i void codemadness.org 70
273 i xmltagstart(XMLParser *x, const char *t, size_t tl) codemadness.org 70
274 i { codemadness.org 70
275 i } codemadness.org 70
276 i codemadness.org 70
277 i void codemadness.org 70
278 i xmltagstartparsed(XMLParser *x, const char *t, size_t tl, int isshort) codemadness.org 70
279 i { codemadness.org 70
280 i } codemadness.org 70
281 i codemadness.org 70
282 i int codemadness.org 70
283 i main(void) codemadness.org 70
284 i { codemadness.org 70
285 i XMLParser x = { 0 }; codemadness.org 70
286 i codemadness.org 70
287 i x.xmlattr = xmlattr; codemadness.org 70
288 i x.xmlattrend = xmlattrend; codemadness.org 70
289 i x.xmlattrstart = xmlattrstart; codemadness.org 70
290 i x.xmlattrentity = xmlattrentity; codemadness.org 70
291 i x.xmlcdatastart = xmlcdatastart; codemadness.org 70
292 i x.xmlcdata = xmlcdata; codemadness.org 70
293 i x.xmlcdataend = xmlcdataend; codemadness.org 70
294 i x.xmlcommentstart = xmlcommentstart; codemadness.org 70
295 i x.xmlcomment = xmlcomment; codemadness.org 70
296 i x.xmlcommentend = xmlcommentend; codemadness.org 70
297 i x.xmldata = xmldata; codemadness.org 70
298 i x.xmldataend = xmldataend; codemadness.org 70
299 i x.xmldataentity = xmldataentity; codemadness.org 70
300 i x.xmldatastart = xmldatastart; codemadness.org 70
301 i x.xmltagend = xmltagend; codemadness.org 70
302 i x.xmltagstart = xmltagstart; codemadness.org 70
303 i x.xmltagstartparsed = xmltagstartparsed; codemadness.org 70
304 i codemadness.org 70
305 i x.getnext = getchar; codemadness.org 70
306 i codemadness.org 70
307 i xml_parse(&x); codemadness.org 70
308 i codemadness.org 70
309 i return 0; codemadness.org 70
310 i } codemadness.org 70
311 i codemadness.org 70
312 iAs you can see the important functions of the parser itself are xml_parse() codemadness.org 70
313 iand xml_entitytostr(). codemadness.org 70
314 i codemadness.org 70
315 iXMLParser is a structure of the context and it contains pointers to the codemadness.org 70
316 icallback functions. codemadness.org 70
317 i codemadness.org 70
318 iThis is a verbose example. All the callbacks that are unused could be removed. codemadness.org 70
319 iIf the callback is set to NULL then it is unused. codemadness.org 70
320 i codemadness.org 70
321 i codemadness.org 70
322 i# References codemadness.org 70
323 i codemadness.org 70
324 h* AFL (American fuzzy lop): afl-fuzz: »https://lcamtuf.coredump.cx/afl/«. URL:https://lcamtuf.coredump.cx/afl/ codemadness.org 70
325 i* iconv: character-set conversion. codemadness.org 70
326 h* sfeed: »https://codemadness.org/sfeed.html«. URL:https://codemadness.org/sfeed.html codemadness.org 70
327 h* libexpat XML parser: »https://github.com/libexpat/libexpat/«. URL:https://github.com/libexpat/libexpat/ codemadness.org 70
328 h* libxml2 XML parser: »https://github.com/GNOME/libxml2«. URL:https://github.com/GNOME/libxml2 codemadness.org 70
329 i codemadness.org 70
330 i codemadness.org 70
331 i# End codemadness.org 70
332 i codemadness.org 70
333 iI hope this write up is useful or xml.{c,h} can be useful in your project. codemadness.org 70
334 .