README - xmlparser - XML parser
(HTM) git clone git://git.codemadness.org/xmlparser
(DIR) Log
(DIR) Files
(DIR) Refs
(DIR) README
(DIR) LICENSE
---
README (2303B)
---
1 XML parser
2 ----------
3
4 A small XML parser.
5
6
7 Dependencies
8 ------------
9
10 - C compiler (ANSI).
11
12
13 Features
14 --------
15
16 - Relatively small parser.
17 - Simple API using callback functions.
18 - Fast
19 - Portable
20 - No dynamic memory allocation.
21
22
23 Supports
24 --------
25
26 - Tags in short-form (<img src="lolcat.jpg" title="Meow" />).
27 - Tag attributes.
28 - Short attributes without an explicitly set value (<input type="checkbox" checked />).
29 - Comments
30 - CDATA sections.
31 - Helper function (xml_entitytostr) to convert XML 1.0 / HTML 2.0 named entities
32 and numeric entities to UTF-8.
33 - Reading XML from a file descriptor, string buffer or any custom reader:
34 see: XMLParser.getnext or GETNEXT() macro.
35
36
37 Design choices and scope
38 ------------------------
39
40 - Compliance: it is not a fully compliant XML parser, but it supports reading
41 XML data for many practical use-cases, some are:
42 - RSS reader (sfeed)
43 - HTML to plain-text converter (webdump).
44 - HTML extractor for websites (idiotbox/tscrape/frontends).
45 - The XML data is not checked for errors so it will continue parsing XML data.
46 However the parser should not crash, hang, etc.
47 - Performance: data is buffered even if a handler is not set: to make parsing
48 faster change this code from xml.c.
49 - Internally fixed-size buffers are used, callbacks like XMLParser.xmldata are
50 called multiple times for the same tag if the data size is bigger than the
51 internal buffer size (sizeof(XMLParser.data)). To differentiate between new
52 calls for data the xml*start and xml*end handlers can be used.
53 - It does not handle XML white-space rules for tag data. The raw values
54 including white-space is passed. This is useful in some cases, like for
55 parsing HTML <pre> tags.
56 - The XML specification has no limits on tag and attribute names. For
57 simplicity/sanity sake this XML parser takes some liberties. Tag and
58 attribute names are truncated if they are excessively long.
59 - Security: entity expansions are not handled (can cause "billion laughs
60 attack").
61 - DOCTYPE, ATTLIST or DTD declarations are ignored.
62
63
64 Files used
65 ----------
66
67 xml.c and xml.h
68
69
70 Interface / API
71 ---------------
72
73 Should be trivial, see xml.c and xml.h and the examples below.
74
75
76 Examples
77 --------
78
79 See skeleton.c for a base program to start quickly.
80
81
82 License
83 -------
84
85 ISC, see LICENSE file.