webdump - www.codemadness.org - www.codemadness.org saait content files
(HTM) git clone git://git.codemadness.org/www.codemadness.org
(DIR) Log
(DIR) Files
(DIR) Refs
(DIR) README
(DIR) LICENSE
---
webdump (8291B)
---
1 1<- Back / codemadness.org 70
2 i codemadness.org 70
3 i codemadness.org 70
4 i# webdump HTML to plain-text converter codemadness.org 70
5 i codemadness.org 70
6 iLast modification on 2025-04-25 codemadness.org 70
7 i codemadness.org 70
8 iwebdump is (yet another) HTML to plain-text converter tool. codemadness.org 70
9 i codemadness.org 70
10 iIt reads HTML in UTF-8 from stdin and writes plain-text to stdout. codemadness.org 70
11 i codemadness.org 70
12 i codemadness.org 70
13 i## Goals and scope codemadness.org 70
14 i codemadness.org 70
15 iThe main goal of this tool for me is to use it for converting HTML mails to codemadness.org 70
16 iplain-text and to convert HTML content in RSS feeds to plain-text. codemadness.org 70
17 i codemadness.org 70
18 iThe tool will only convert HTML to stdout, similarly to links -dump or lynx codemadness.org 70
19 i-dump but simpler and more secure. codemadness.org 70
20 i codemadness.org 70
21 i* HTML and XHTML will be supported. codemadness.org 70
22 i* There will be some workarounds and quirks for broken and legacy HTML code. codemadness.org 70
23 i* It will be usable and secure for reading HTML from mails and RSS/Atom feeds. codemadness.org 70
24 i* No remote resources which are part of the HTML will be downloaded: codemadness.org 70
25 i images, video, audio, etc. But these may be visible as a link reference. codemadness.org 70
26 i* Data will be written to stdout. Intended for plain-text or a text terminal. codemadness.org 70
27 i* No support for Javascript, CSS, frame rendering or form processing. codemadness.org 70
28 i* No HTTP or network protocol handling: HTML data is read from stdin. codemadness.org 70
29 i* Listings for references and some options to extract them in a list that is codemadness.org 70
30 i usable for scripting. Some references are: link anchors, images, audio, video, codemadness.org 70
31 i HTML (i)frames, etc. codemadness.org 70
32 i* Security: on OpenBSD it uses pledge("stdio", NULL). codemadness.org 70
33 i* Keep the code relatively small, simple and hackable. codemadness.org 70
34 i codemadness.org 70
35 i codemadness.org 70
36 i## Features codemadness.org 70
37 i codemadness.org 70
38 i* Support for word-wrapping. codemadness.org 70
39 i* A mode to enable basic markup: bold, underline, italic and blink ;) codemadness.org 70
40 i* Indentation of headers, paragraphs, pre and list items. codemadness.org 70
41 i* Basic support to query elements or hide them. codemadness.org 70
42 i* Show link references. codemadness.org 70
43 i* Show link references and resources such as img, video, audio, subtitles. codemadness.org 70
44 i* Export link references and resources to a TAB-separated format. codemadness.org 70
45 i codemadness.org 70
46 i codemadness.org 70
47 i## Usage examples codemadness.org 70
48 i codemadness.org 70
49 i url='https://codemadness.org/sfeed.html' codemadness.org 70
50 i codemadness.org 70
51 i curl -s "$url" | webdump -r -b "$url" | less codemadness.org 70
52 i codemadness.org 70
53 i curl -s "$url" | webdump -8 -a -i -l -r -b "$url" | less -R codemadness.org 70
54 i codemadness.org 70
55 i curl -s "$url" | webdump -s 'main' -8 -a -i -l -r -b "$url" | less -R codemadness.org 70
56 i codemadness.org 70
57 iYes, all these option flags look ugly, a shellscript wrapper could be used :) codemadness.org 70
58 i codemadness.org 70
59 i codemadness.org 70
60 i## Practical examples codemadness.org 70
61 i codemadness.org 70
62 iTo use webdump as a HTML to text filter for example in the mutt mail client, codemadness.org 70
63 ichange in ~/.mailcap: codemadness.org 70
64 i codemadness.org 70
65 i text/html; webdump -i -l -r < %s; needsterminal; copiousoutput codemadness.org 70
66 i codemadness.org 70
67 iIn mutt you should then add: codemadness.org 70
68 i codemadness.org 70
69 i auto_view text/html codemadness.org 70
70 i codemadness.org 70
71 i codemadness.org 70
72 iUsing webdump as a HTML to text filter for sfeed_curses (otherwise the default is lynx): codemadness.org 70
73 i codemadness.org 70
74 i SFEED_HTMLCONV="webdump -d -8 -r -i -l -a" sfeed_curses ~/.sfeed/feeds/* codemadness.org 70
75 i codemadness.org 70
76 i codemadness.org 70
77 i# Query/selector examples codemadness.org 70
78 i codemadness.org 70
79 iThe query syntax using the -s option is a bit inspired by CSS (but much more limited). codemadness.org 70
80 i codemadness.org 70
81 iTo get the title from a HTML page: codemadness.org 70
82 i codemadness.org 70
83 i url='https://codemadness.org/sfeed.html' codemadness.org 70
84 i codemadness.org 70
85 i title=$(curl -s "$url" | webdump -s 'title') codemadness.org 70
86 i printf '%s\n' "$title" codemadness.org 70
87 i codemadness.org 70
88 iList audio and video-related content from a HTML page, redirect fd 3 to fd 1 (stdout): codemadness.org 70
89 i codemadness.org 70
90 i url="https://media.ccc.de/v/051_Recent_features_to_OpenBSD-ntpd_and_bgpd" codemadness.org 70
91 i curl -s "$url" | webdump -x -s 'audio,video' -b "$url" 3>&1 >/dev/null | cut -f 2 codemadness.org 70
92 i codemadness.org 70
93 i codemadness.org 70
94 i## Clone codemadness.org 70
95 i codemadness.org 70
96 i git clone git://git.codemadness.org/webdump codemadness.org 70
97 i codemadness.org 70
98 i codemadness.org 70
99 i## Browse codemadness.org 70
100 i codemadness.org 70
101 iYou can browse the source-code at: codemadness.org 70
102 i codemadness.org 70
103 h* https://git.codemadness.org/webdump/ URL:https://git.codemadness.org/webdump/ codemadness.org 70
104 1* gopher://codemadness.org/1/git/webdump /git/webdump codemadness.org 70
105 i codemadness.org 70
106 i codemadness.org 70
107 i## Download releases codemadness.org 70
108 i codemadness.org 70
109 iReleases are available at: codemadness.org 70
110 i codemadness.org 70
111 h* https://codemadness.org/releases/webdump/ URL:https://codemadness.org/releases/webdump/ codemadness.org 70
112 1* gopher://codemadness.org/1/releases/webdump /releases/webdump codemadness.org 70
113 i codemadness.org 70
114 i codemadness.org 70
115 i## Build and install codemadness.org 70
116 i codemadness.org 70
117 i $ make codemadness.org 70
118 i # make install codemadness.org 70
119 i codemadness.org 70
120 i codemadness.org 70
121 i## Dependencies codemadness.org 70
122 i codemadness.org 70
123 i* C compiler. codemadness.org 70
124 i* libc + some BSDisms. codemadness.org 70
125 i codemadness.org 70
126 i codemadness.org 70
127 i## Trade-offs codemadness.org 70
128 i codemadness.org 70
129 iAll software has trade-offs. codemadness.org 70
130 i codemadness.org 70
131 iwebdump processes HTML in a single-pass. It does not buffer the full DOM tree. codemadness.org 70
132 iAlthough due to the nature of HTML/XML some parts like attributes need to be codemadness.org 70
133 ibuffered. codemadness.org 70
134 i codemadness.org 70
135 iRendering tables in webdump is very limited. Twibright Links has really nice codemadness.org 70
136 itable rendering. However implementing a similar feature in the current design of codemadness.org 70
137 iwebdump would make the code much more complex. Twibright links codemadness.org 70
138 iprocesses a full DOM tree and processes the tables in multiple passes (to codemadness.org 70
139 imeasure the table cells) etc. Of course tables can be nested also, or HTML tables codemadness.org 70
140 ithat are used for creating layouts (these are mostly older webpages). codemadness.org 70
141 i codemadness.org 70
142 iThese trade-offs and preferences are chosen for now. It may change in the codemadness.org 70
143 ifuture. Fortunately there are the usual good suspects for HTML to plain-text codemadness.org 70
144 iconversion, each with their own chosen trade-offs of course: codemadness.org 70
145 i codemadness.org 70
146 h* twibright links: »http://links.twibright.com/« URL:http://links.twibright.com/ codemadness.org 70
147 h* lynx: »https://lynx.invisible-island.net/« URL:https://lynx.invisible-island.net/ codemadness.org 70
148 h* w3m: »https://w3m.sourceforge.net/« URL:https://w3m.sourceforge.net/ codemadness.org 70
149 h* xmllint (part of libxml2): »https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home« URL:https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home codemadness.org 70
150 h* xmlstarlet: »https://xmlstar.sourceforge.net/« URL:https://xmlstar.sourceforge.net/ codemadness.org 70
151 .