codemadness.org

       README - webdump - HTML to plain-text converter for webpages
 (HTM) git clone git://git.codemadness.org/webdump
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       README (3219B)
       ---
            1 webdump
            2 -------
            3 
            4 HTML to plain-text converter tool.
            5 
            6 It reads HTML in UTF-8 from stdin and writes plain-text to stdout.
            7 
            8 
            9 Build and install
           10 -----------------
           11 
           12 $ make
           13 # make install
           14 
           15 
           16 Dependencies
           17 ------------
           18 
           19 - C compiler.
           20 - libc + some BSDisms.
           21 
           22 
           23 Usage
           24 -----
           25 
           26 Example:
           27 
           28         url='https://codemadness.org/sfeed.html'
           29 
           30         curl -s "$url" | webdump -r -b "$url" | less
           31 
           32         curl -s "$url" | webdump -8 -a -i -l -r -b "$url" | less -R
           33 
           34         curl -s "$url" | webdump -s 'main' -8 -a -i -l -r -b "$url" | less -R
           35 
           36 
           37 Yes, all these option flags look ugly, a shellscript wrapper could be used :)
           38 
           39 
           40 Goals / scope
           41 -------------
           42 
           43 The main goal is to use it for converting HTML mails to plain-text and to
           44 convert HTML content in RSS feeds to plain-text.
           45 
           46 The tool will only convert HTML to stdout, similarly to links -dump or lynx
           47 -dump but simpler and more secure.
           48 
           49 - HTML and XHTML will be supported.
           50 - There will be some workarounds and quirks for broken and legacy HTML code.
           51 - It will be usable and secure for reading HTML from mails and RSS/Atom feeds.
           52 - No remote resources which are part of the HTML will be downloaded:
           53   images, video, audio, etc. But these may be visible as a link reference.
           54 - Data will be written to stdout. Intended for plain-text or a text terminal.
           55 - No support for Javascript, CSS, frame rendering or form processing.
           56 - No HTTP or network protocol handling: HTML data is read from stdin.
           57 - Listings for references and some options to extract them in a list that is
           58   usable for scripting. Some references are: link anchors, images, audio, video,
           59   HTML (i)frames, etc.
           60 
           61 
           62 Features
           63 --------
           64 
           65 - Support for word-wrapping.
           66 - A mode to enable basic markup: bold, underline, italic and blink ;)
           67 - Indentation of headers, paragraphs, pre and list items.
           68 - Basic support to query an element or hide them.
           69 - Show link references.
           70 - Show link references and resources such as img, video, audio, subtitles.
           71 - Export link references and resources to a TAB-separated format.
           72 
           73 
           74 Trade-offs
           75 ----------
           76 
           77 All software has trade-offs.
           78 
           79 webdump processes HTML in a single-pass. It does not buffer the full DOM tree.
           80 Although due to the nature of HTML/XML some parts like attributes need to be
           81 buffered.
           82 
           83 Rendering tables in webdump is very limited. Twibright Links has really nice
           84 table rendering. Implementing a similar feature in the current design of
           85 webdump would make the code much more complex however. Twibright links
           86 processes a full DOM tree and processes the tables in multiple passes (to
           87 measure the table cells) etc.  Of course tables can be nested also, or is used
           88 in (older web) pages that use HTML tables for layout.
           89 
           90 These trade-offs and preferences are chosen for now. It may change in the
           91 future.  Fortunately there are the usual good suspects for HTML to plain-text
           92 conversion, (each with their own chosen trade-offs of course):
           93 
           94 For example:
           95 
           96 - twibright links
           97 - lynx
           98 - w3m
           99 
          100 
          101 Examples
          102 --------
          103 
          104 To use webdump as a HTML to text filter for example in the mutt mail client,
          105 change in ~/.mailcap:
          106 
          107         text/html; webdump -i -l -r < %s; needsterminal; copiousoutput
          108 
          109 In mutt you should then add:
          110 
          111         auto_view text/html
          112 
          113 
          114 License
          115 -------
          116 
          117 ISC, see LICENSE file.
          118 
          119 
          120 Author
          121 ------
          122 
          123 Hiltjo Posthuma <hiltjo@codemadness.org>