webdump - www.codemadness.org - www.codemadness.org saait content files
 (HTM) git clone git://git.codemadness.org/www.codemadness.org
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       webdump (8291B)
       ---
            1 1<- Back        /        codemadness.org        70
            2 i                codemadness.org        70
            3 i                codemadness.org        70
            4 i# webdump HTML to plain-text converter                codemadness.org        70
            5 i                codemadness.org        70
            6 iLast modification on 2025-04-25                codemadness.org        70
            7 i                codemadness.org        70
            8 iwebdump is (yet another) HTML to plain-text converter tool.                codemadness.org        70
            9 i                codemadness.org        70
           10 iIt reads HTML in UTF-8 from stdin and writes plain-text to stdout.                codemadness.org        70
           11 i                codemadness.org        70
           12 i                codemadness.org        70
           13 i## Goals and scope                codemadness.org        70
           14 i                codemadness.org        70
           15 iThe main goal of this tool for me is to use it for converting HTML mails to                codemadness.org        70
           16 iplain-text and to convert HTML content in RSS feeds to plain-text.                codemadness.org        70
           17 i                codemadness.org        70
           18 iThe tool will only convert HTML to stdout, similarly to links -dump or lynx                codemadness.org        70
           19 i-dump but simpler and more secure.                codemadness.org        70
           20 i                codemadness.org        70
           21 i* HTML and XHTML will be supported.                codemadness.org        70
           22 i* There will be some workarounds and quirks for broken and legacy HTML code.                codemadness.org        70
           23 i* It will be usable and secure for reading HTML from mails and RSS/Atom feeds.                codemadness.org        70
           24 i* No remote resources which are part of the HTML will be downloaded:                codemadness.org        70
           25 i  images, video, audio, etc. But these may be visible as a link reference.                codemadness.org        70
           26 i* Data will be written to stdout. Intended for plain-text or a text terminal.                codemadness.org        70
           27 i* No support for Javascript, CSS, frame rendering or form processing.                codemadness.org        70
           28 i* No HTTP or network protocol handling: HTML data is read from stdin.                codemadness.org        70
           29 i* Listings for references and some options to extract them in a list that is                codemadness.org        70
           30 i  usable for scripting. Some references are: link anchors, images, audio, video,                codemadness.org        70
           31 i  HTML (i)frames, etc.                codemadness.org        70
           32 i* Security: on OpenBSD it uses pledge("stdio", NULL).                codemadness.org        70
           33 i* Keep the code relatively small, simple and hackable.                codemadness.org        70
           34 i                codemadness.org        70
           35 i                codemadness.org        70
           36 i## Features                codemadness.org        70
           37 i                codemadness.org        70
           38 i* Support for word-wrapping.                codemadness.org        70
           39 i* A mode to enable basic markup: bold, underline, italic and blink ;)                codemadness.org        70
           40 i* Indentation of headers, paragraphs, pre and list items.                codemadness.org        70
           41 i* Basic support to query elements or hide them.                codemadness.org        70
           42 i* Show link references.                codemadness.org        70
           43 i* Show link references and resources such as img, video, audio, subtitles.                codemadness.org        70
           44 i* Export link references and resources to a TAB-separated format.                codemadness.org        70
           45 i                codemadness.org        70
           46 i                codemadness.org        70
           47 i## Usage examples                codemadness.org        70
           48 i                codemadness.org        70
           49 i        url='https://codemadness.org/sfeed.html'                codemadness.org        70
           50 i                        codemadness.org        70
           51 i        curl -s "$url" | webdump -r -b "$url" | less                codemadness.org        70
           52 i                        codemadness.org        70
           53 i        curl -s "$url" | webdump -8 -a -i -l -r -b "$url" | less -R                codemadness.org        70
           54 i                        codemadness.org        70
           55 i        curl -s "$url" | webdump -s 'main' -8 -a -i -l -r -b "$url" | less -R                codemadness.org        70
           56 i                codemadness.org        70
           57 iYes, all these option flags look ugly, a shellscript wrapper could be used :)                codemadness.org        70
           58 i                codemadness.org        70
           59 i                codemadness.org        70
           60 i## Practical examples                codemadness.org        70
           61 i                codemadness.org        70
           62 iTo use webdump as a HTML to text filter for example in the mutt mail client,                codemadness.org        70
           63 ichange in ~/.mailcap:                codemadness.org        70
           64 i                codemadness.org        70
           65 i        text/html; webdump -i -l -r < %s; needsterminal; copiousoutput                codemadness.org        70
           66 i                codemadness.org        70
           67 iIn mutt you should then add:                codemadness.org        70
           68 i                codemadness.org        70
           69 i        auto_view text/html                codemadness.org        70
           70 i                codemadness.org        70
           71 i                codemadness.org        70
           72 iUsing webdump as a HTML to text filter for sfeed_curses (otherwise the default is lynx):                codemadness.org        70
           73 i                codemadness.org        70
           74 i        SFEED_HTMLCONV="webdump -d -8 -r -i -l -a" sfeed_curses ~/.sfeed/feeds/*                codemadness.org        70
           75 i                codemadness.org        70
           76 i                codemadness.org        70
           77 i# Query/selector examples                codemadness.org        70
           78 i                codemadness.org        70
           79 iThe query syntax using the -s option is a bit inspired by CSS (but much more limited).                codemadness.org        70
           80 i                codemadness.org        70
           81 iTo get the title from a HTML page:                codemadness.org        70
           82 i                codemadness.org        70
           83 i        url='https://codemadness.org/sfeed.html'                codemadness.org        70
           84 i                        codemadness.org        70
           85 i        title=$(curl -s "$url" | webdump -s 'title')                codemadness.org        70
           86 i        printf '%s\n' "$title"                codemadness.org        70
           87 i                codemadness.org        70
           88 iList audio and video-related content from a HTML page, redirect fd 3 to fd 1 (stdout):                codemadness.org        70
           89 i                codemadness.org        70
           90 i        url="https://media.ccc.de/v/051_Recent_features_to_OpenBSD-ntpd_and_bgpd"                codemadness.org        70
           91 i        curl -s "$url" | webdump -x -s 'audio,video' -b "$url" 3>&1 >/dev/null | cut -f 2                codemadness.org        70
           92 i                codemadness.org        70
           93 i                codemadness.org        70
           94 i## Clone                codemadness.org        70
           95 i                codemadness.org        70
           96 i        git clone git://git.codemadness.org/webdump                codemadness.org        70
           97 i                codemadness.org        70
           98 i                codemadness.org        70
           99 i## Browse                codemadness.org        70
          100 i                codemadness.org        70
          101 iYou can browse the source-code at:                codemadness.org        70
          102 i                codemadness.org        70
          103 h* https://git.codemadness.org/webdump/        URL:https://git.codemadness.org/webdump/        codemadness.org        70
          104 1* gopher://codemadness.org/1/git/webdump        /git/webdump        codemadness.org        70
          105 i                codemadness.org        70
          106 i                codemadness.org        70
          107 i## Download releases                codemadness.org        70
          108 i                codemadness.org        70
          109 iReleases are available at:                codemadness.org        70
          110 i                codemadness.org        70
          111 h* https://codemadness.org/releases/webdump/        URL:https://codemadness.org/releases/webdump/        codemadness.org        70
          112 1* gopher://codemadness.org/1/releases/webdump        /releases/webdump        codemadness.org        70
          113 i                codemadness.org        70
          114 i                codemadness.org        70
          115 i## Build and install                codemadness.org        70
          116 i                codemadness.org        70
          117 i        $ make                codemadness.org        70
          118 i        # make install                codemadness.org        70
          119 i                codemadness.org        70
          120 i                codemadness.org        70
          121 i## Dependencies                codemadness.org        70
          122 i                codemadness.org        70
          123 i* C compiler.                codemadness.org        70
          124 i* libc + some BSDisms.                codemadness.org        70
          125 i                codemadness.org        70
          126 i                codemadness.org        70
          127 i## Trade-offs                codemadness.org        70
          128 i                codemadness.org        70
          129 iAll software has trade-offs.                codemadness.org        70
          130 i                codemadness.org        70
          131 iwebdump processes HTML in a single-pass. It does not buffer the full DOM tree.                codemadness.org        70
          132 iAlthough due to the nature of HTML/XML some parts like attributes need to be                codemadness.org        70
          133 ibuffered.                codemadness.org        70
          134 i                codemadness.org        70
          135 iRendering tables in webdump is very limited. Twibright Links has really nice                codemadness.org        70
          136 itable rendering. However implementing a similar feature in the current design of                codemadness.org        70
          137 iwebdump would make the code much more complex. Twibright links                codemadness.org        70
          138 iprocesses a full DOM tree and processes the tables in multiple passes (to                codemadness.org        70
          139 imeasure the table cells) etc.  Of course tables can be nested also, or HTML tables                codemadness.org        70
          140 ithat are used for creating layouts (these are mostly older webpages).                codemadness.org        70
          141 i                codemadness.org        70
          142 iThese trade-offs and preferences are chosen for now. It may change in the                codemadness.org        70
          143 ifuture.  Fortunately there are the usual good suspects for HTML to plain-text                codemadness.org        70
          144 iconversion, each with their own chosen trade-offs of course:                codemadness.org        70
          145 i                codemadness.org        70
          146 h* twibright links: »http://links.twibright.com/«        URL:http://links.twibright.com/        codemadness.org        70
          147 h* lynx: »https://lynx.invisible-island.net/«        URL:https://lynx.invisible-island.net/        codemadness.org        70
          148 h* w3m: »https://w3m.sourceforge.net/«        URL:https://w3m.sourceforge.net/        codemadness.org        70
          149 h* xmllint (part of libxml2): »https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home«        URL:https://gitlab.gnome.org/GNOME/libxml2/-/wikis/home        codemadness.org        70
          150 h* xmlstarlet: »https://xmlstar.sourceforge.net/«        URL:https://xmlstar.sourceforge.net/        codemadness.org        70
          151 .