CROSSBOW-FORMAT(5) File Formats Manual (urm) CROSSBOW-FORMAT(5)
NAME
crossbow-format ? format string reference
DESCRIPTION
The crossbow(1) feed aggregator processes each new entry by means of a
handler, according to the feed settings in the crossbow.conf(5)
configuration file.
The print handler prints a textual representation of an entry on
stdout(3). The optional format setting can define a template to be used
in place of the default. The exec and pipe handlers process individual
entries by invoking an external program. They both require the
definition of a command setting, to expand into the argument vector of
the invoked subprocess.
The values of format and command are interpreted as format strings: they
are allowed to contain placeholders that are replaced with the fields of
the processed entry. All the available placeholders are listed below
(see Supported placeholders).
The placeholders syntax resembles the printf(3) function, with some
differences and simplifications:
? The "%" character can only be followed by one or more alphabetic
characters. There is no support for field width or other modifiers.
? A raw "%" character is represented with "\%" instead of "%%".
? The backslash character "\" escapes the subsequent character:
- Some escape sequences are recognized as in the C programming
language: "\n" (new line), "\r" (carriage return), "\t"
(horizontal tab) and "\\" (literal backslash).
- If the handler is exec or pipe, the "\ " string is interpreted as
space character, as opposed to a non-escaped white-space that is
interpreted as argument separator. Escaping white-spaces has no
effect if the handler is print, since in this case they do not
have any special meaning.
- The "\:" string is interpreted as a zero-width break (see Zero-
width break below).
Argument vector expansion
In the context of the exec or pipe handlers, the argument vector of a
subprocess is constructed on the command setting.
The value is split on white-spaces into an intermediate array of tokens.
When a new entry is processed, each element of the intermediate array is
further evaluated for placeholder expansion. The obtained array is then
used as argument vector for the subprocess. The first of its elements
determines the command to execute.
Details:
? Whites-spaces are intended as in isspace(3);
? Since the tokenization happens before the placeholder expansion, an
expanded field can never be part of two arguments, even if it
contains white-spaces.
? Multiple white-spaces are considered part of the same delimiter. In
case it is needed, it is still possible to pass around an empty
argument by using a zero-width break (See Zero-width break below).
? The subprocess is invoked via execvp(3). The PATH environment
variable is honoured.
? It is not possible to violate the convention by which the value of
?argv[0]? corresponds to the name of the executable.
? For security reasons the command line is not parsed by a shell
interpreter (see Security below). Special operators, such as input
redirection, are therefore not available.
Zero-width break
The language recognized by the output format parser allows the
placeholders to be composed by multiple characters. While this feature
makes it easier to have mnemonic placeholders (such as "%a" for "Author"
and "%am" for "Author eMail"), it introduces some additional edge cases.
The zero-width break sequence ("\:") has been introduced to cover a case
of ambiguity which can be easily explained by means of an example.
Let "%x" and "%xn" be valid placeholders. In such case obtaining the
expansion of "%x" followed by a literal "n" would be impossible, as the
sequence "%x\n" would be rendered as the expansion of "%x" followed by a
new-line, while "%xn" would be rendered as the expansion of "%xn". Using
the backslash would have worked if "\n" wasn't a recognized escape
sequence.
The zero-width break can be used to force the termination of an escape
sequence, so that whatever follows can be interpreted independently of
it.
The behaviour is summarized by the following table, where expand(x)
expresses the expansion of the "%x" placeholder into the string
representation of the corresponding field, and the "." operator expresses
string concatenation.
Format Expansion Notes
%x expand(x)
%xn expand(xn)
%x\m expand(x) . "m" Works because "\m" is not a recognized
escape sequence.
%x\n expand(x) . "\n"
%x\:n expand(x) . "n"
Additionally, although not originally intended for this purpose, the
zero-width break can be used to pass empty strings as subprocess
arguments. This is demonstrated in the following example, where the
configuration prints the entry author, followed by a space character, and
by the entry title:
feed foobar
url https://example.conf/feed.xml
handler exec
command printf \%s\%s\%s\\n %a \: %t
Note: the literal "%s" is intended to be interpreted by the printf(1)
command. The corresponding percent character is escaped, since it is
meant to be a literal percent, and not to be expanded by crossbow(1).
Supported placeholders
The following table shows the supported placeholders and the
corresponding entry properties for the RSS and Atom feed formats.
Depending on the feed format, some placeholders may refer to unavailable
entry properties. If this is the case they are expanded with an empty
string, with some exceptions (see Notes).
P.holder RSS Atom
%a author author.name
%am - author.email
%au - author.uri
%c - contributor.name
%cm - contributor.email
%cu - contributor.uri
%co comments -
%cr - rights
%ct - rights.type
%d description summary / content (1)
%dt - summary.type / content.type (1)
%en enclosure -
%el enclosure.length -
%et enclosure.type -
%eu enclosure.url -
%g guid id
%gp guid_isPermaLink (2) - (3)
%l link link[-1] (4)
%pd pubDate published / updated / issued (5)
%sr source -
%su source_url -
%t title title
%tt - title.type
Notes:
(1) ?summary? wins over ?content? if both are present. See BUGS.
(2) Boolean, represented with "0" or "1" (as strings).
(3) Boolean, undefined for Atom, therefore always "0".
(4) The Atom standard allows multiple links. Only the last link
listed in the XML is made available. See BUGS.
(5) Only the last one appearing in the XML is made available. See
BUGS.
Some additional placeholders, not referring to any entry field, are also
available:
%fi The unique name of the feed, as by configuration. See
crossbow.conf(5).
%ft The feed title, that is ?channel.title? in RSS and
?feed.title? in Atom.
%n A per-feed six digit incremental number. This value is
initialized to zero for new feeds, and gets incremented for
every new feed entry.
This is an important security feature: the value of this
number is not controlled by the feed content, thus it can be
used safely as filename. See Security.
The value is padded with zero to make lexicographical order
trivial. Increment happens even if the execution of a
subprocess fails, so that the same value is never used
twice.
Security
Since the exec and pipe handlers process entries by passing parameters to
a subprocess, it is important to keep security in mind when configuring
the corresponding command setting.
1. Inserting an interpreted code snippet within a command template
is strongly discouraged, as it might constitute an easy target
for code injection.
Consider, for example, the following configuration:
feed foobar
url https://example.conf/feed.xml
handler exec
command sh -c echo\ "%t"\ |\ wc\ -c
The provided command is dangerous in that the entry title,
expanded in place of "%t", might be exploited by a malicious XML
like the following:
-
inject
"; echo pwned; echo "
The correct way of achieving the same result consists in moving
the code into a shell script, and making the entry properties
available to it by means of a safe, uninterpreted, parameter
passing:
feed foobar
url https://example.conf/feed.xml
handler exec
command /usr/local/bin/count_bytes %t
2. Extra care should be taken when the expansion of a placeholder
directly determines the name of a file. Consider, for example,
the following configuration:
feed foobar
url https://example.conf/feed.xml
handler pipe
command sed -n w%t
This is an effective (yet dangerous) way of dumping the entry
contents into a file named after the entry title. A specially
crafted XML can exploit a similar configuration to attempt the
replacement of sensitive files:
-
shenanigans
.bashrc
echo pwned
-
shenanigans_1
../.bashrc
echo pwned
-
shenanigan's_2
../../.bashrc
echo pwned
...
The correct way of achieving the same result consists in using
the "%n" placeholder in place of "%t", obtaining a safer
(although admittedly less descriptive) file naming.
EXAMPLES
See crossbow-cookbook(7).
SEE ALSO
crossbow(1), crossbow-conf(5), crossbow-cookbook(7)
AUTHORS
Giovanni Simoni
BUGS
crossbow(1) is currently relying on libmrss for the parsing of XML files.
While it is great to have a library to do the heavy lifting, it has also
found to be buggy and not very reliable. To be fair, it is also true,
the Atom and RSS formats are arguably more complex than what I deem
sensible.
On the long run, it is the author's intention to either ditch or fix
libmrss and libnxml. In the meanwhile there might be some quirks to keep
in mind while using crossbow(1).
Any report about bad behaviours seen in the wild is welcome.
September 30, 2021