webdump.html - www.codemadness.org - www.codemadness.org saait content files
(HTM) git clone git://git.codemadness.org/www.codemadness.org
(DIR) Log
(DIR) Files
(DIR) Refs
(DIR) README
(DIR) LICENSE
---
webdump.html (7375B)
---
1 <!DOCTYPE html>
2 <html dir="ltr" lang="en">
3 <head>
4 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
5 <meta http-equiv="Content-Language" content="en" />
6 <meta name="viewport" content="width=device-width" />
7 <meta name="keywords" content="webdump, HTML to plain-text, converter, formatter" />
8 <meta name="description" content="webdump HTML to plain-text converter" />
9 <meta name="author" content="Hiltjo" />
10 <meta name="generator" content="Static content generated using saait: https://codemadness.org/saait.html" />
11 <title>webdump HTML to plain-text converter - Codemadness</title>
12 <link rel="stylesheet" href="style.css" type="text/css" media="screen" />
13 <link rel="stylesheet" href="print.css" type="text/css" media="print" />
14 <link rel="alternate" href="atom.xml" type="application/atom+xml" title="Codemadness Atom Feed" />
15 <link rel="alternate" href="atom_content.xml" type="application/atom+xml" title="Codemadness Atom Feed with content" />
16 <link rel="icon" href="/favicon.png" type="image/png" />
17 </head>
18 <body>
19 <nav id="menuwrap">
20 <table id="menu" width="100%" border="0">
21 <tr>
22 <td id="links" align="left">
23 <a href="index.html">Blog</a> |
24 <a href="/git/" title="Git repository with some of my projects">Git</a> |
25 <a href="/releases/">Releases</a> |
26 <a href="gopher://codemadness.org">Gopherhole</a>
27 </td>
28 <td id="links-contact" align="right">
29 <span class="hidden"> | </span>
30 <a href="feeds.html">Feeds</a> |
31 <a href="pgp.asc">PGP</a> |
32 <a href="mailto:hiltjo@AT@codemadness.DOT.org">Mail</a>
33 </td>
34 </tr>
35 </table>
36 </nav>
37 <hr class="hidden" />
38 <main id="mainwrap">
39 <div id="main">
40 <article>
41 <header>
42 <h1>webdump HTML to plain-text converter</h1>
43 <p>
44 <strong>Last modification on </strong> <time>2025-04-25</time>
45 </p>
46 </header>
47
48 <p>webdump is (yet another) HTML to plain-text converter tool.</p>
49 <p>It reads HTML in UTF-8 from stdin and writes plain-text to stdout.</p>
50 <h2>Goals and scope</h2>
51 <p>The main goal of this tool for me is to use it for converting HTML mails to
52 plain-text and to convert HTML content in RSS feeds to plain-text.</p>
53 <p>The tool will only convert HTML to stdout, similarly to links -dump or lynx
54 -dump but simpler and more secure.</p>
55 <ul>
56 <li>HTML and XHTML will be supported.</li>
57 <li>There will be some workarounds and quirks for broken and legacy HTML code.</li>
58 <li>It will be usable and secure for reading HTML from mails and RSS/Atom feeds.</li>
59 <li>No remote resources which are part of the HTML will be downloaded:
60 images, video, audio, etc. But these may be visible as a link reference.</li>
61 <li>Data will be written to stdout. Intended for plain-text or a text terminal.</li>
62 <li>No support for Javascript, CSS, frame rendering or form processing.</li>
63 <li>No HTTP or network protocol handling: HTML data is read from stdin.</li>
64 <li>Listings for references and some options to extract them in a list that is
65 usable for scripting. Some references are: link anchors, images, audio, video,
66 HTML (i)frames, etc.</li>
67 <li>Security: on OpenBSD it uses pledge("stdio", NULL).</li>
68 <li>Keep the code relatively small, simple and hackable.</li>
69 </ul>
70 <h2>Features</h2>
71 <ul>
72 <li>Support for word-wrapping.</li>
73 <li>A mode to enable basic markup: bold, underline, italic and blink ;)</li>
74 <li>Indentation of headers, paragraphs, pre and list items.</li>
75 <li>Basic support to query elements or hide them.</li>
76 <li>Show link references.</li>
77 <li>Show link references and resources such as img, video, audio, subtitles.</li>
78 <li>Export link references and resources to a TAB-separated format.</li>
79 </ul>
80 <h2>Usage examples</h2>
81 <pre><code>url='https://codemadness.org/sfeed.html'
82
83 curl -s "$url" | webdump -r -b "$url" | less
84
85 curl -s "$url" | webdump -8 -a -i -l -r -b "$url" | less -R
86
87 curl -s "$url" | webdump -s 'main' -8 -a -i -l -r -b "$url" | less -R
88 </code></pre>
89 <p>Yes, all these option flags look ugly, a shellscript wrapper could be used :)</p>
90 <h2>Practical examples</h2>
91 <p>To use webdump as a HTML to text filter for example in the mutt mail client,
92 change in ~/.mailcap:</p>
93 <pre><code>text/html; webdump -i -l -r < %s; needsterminal; copiousoutput
94 </code></pre>
95 <p>In mutt you should then add:</p>
96 <pre><code>auto_view text/html
97 </code></pre>
98 <p>Using webdump as a HTML to text filter for sfeed_curses (otherwise the default is lynx):</p>
99 <pre><code>SFEED_HTMLCONV="webdump -d -8 -r -i -l -a" sfeed_curses ~/.sfeed/feeds/*
100 </code></pre>
101 <h1>Query/selector examples</h1>
102 <p>The query syntax using the -s option is a bit inspired by CSS (but much more limited).</p>
103 <p>To get the title from a HTML page:</p>
104 <pre><code>url='https://codemadness.org/sfeed.html'
105
106 title=$(curl -s "$url" | webdump -s 'title')
107 printf '%s\n' "$title"
108 </code></pre>
109 <p>List audio and video-related content from a HTML page, redirect fd 3 to fd 1 (stdout):</p>
110 <pre><code>url="https://media.ccc.de/v/051_Recent_features_to_OpenBSD-ntpd_and_bgpd"
111 curl -s "$url" | webdump -x -s 'audio,video' -b "$url" 3>&1 >/dev/null | cut -f 2
112 </code></pre>
113 <h2>Clone</h2>
114 <pre><code>git clone git://git.codemadness.org/webdump
115 </code></pre>
116 <h2>Browse</h2>
117 <p>You can browse the source-code at:</p>
118 <ul>
119 <li><a href="https://git.codemadness.org/webdump/">https://git.codemadness.org/webdump/</a></li>
120 <li><a href="gopher://codemadness.org/1/git/webdump">gopher://codemadness.org/1/git/webdump</a></li>
121 </ul>
122 <h2>Download releases</h2>
123 <p>Releases are available at:</p>
124 <ul>
125 <li><a href="https://codemadness.org/releases/webdump/">https://codemadness.org/releases/webdump/</a></li>
126 <li><a href="gopher://codemadness.org/1/releases/webdump">gopher://codemadness.org/1/releases/webdump</a></li>
127 </ul>
128 <h2>Build and install</h2>
129 <pre><code>$ make
130 # make install
131 </code></pre>
132 <h2>Dependencies</h2>
133 <ul>
134 <li>C compiler.</li>
135 <li>libc + some BSDisms.</li>
136 </ul>
137 <h2>Trade-offs</h2>
138 <p>All software has trade-offs.</p>
139 <p>webdump processes HTML in a single-pass. It does not buffer the full DOM tree.
140 Although due to the nature of HTML/XML some parts like attributes need to be
141 buffered.</p>
142 <p>Rendering tables in webdump is very limited. Twibright Links has really nice
143 table rendering. However implementing a similar feature in the current design of
144 webdump would make the code much more complex. Twibright links
145 processes a full DOM tree and processes the tables in multiple passes (to
146 measure the table cells) etc. Of course tables can be nested also, or HTML tables
147 that are used for creating layouts (these are mostly older webpages).</p>
148 <p>These trade-offs and preferences are chosen for now. It may change in the
149 future. Fortunately there are the usual good suspects for HTML to plain-text
150 conversion, each with their own chosen trade-offs of course:</p>
151 <ul>
152 <li>twibright links: <a href="http://links.twibright.com/">http://links.twibright.com/</a></li>
153 <li>lynx: <a href="https://lynx.invisible-island.net/">https://lynx.invisible-island.net/</a></li>
154 <li>w3m: <a href="https://w3m.sourceforge.net/">https://w3m.sourceforge.net/</a></li>
155 <li>xmllint (part of libxml2): <a href="https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home">https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home</a></li>
156 <li>xmlstarlet: <a href="https://xmlstar.sourceforge.net/">https://xmlstar.sourceforge.net/</a></li>
157 </ul>
158
159 </article>
160 </div>
161 </main>
162 </body>
163 </html>