################################################################################ A script for reemoving HTML tags - Willow Willis (2024-07-06) ############################################################################### Just a dumb little bash script I wrote to help format a batch of articles from my website in preparation for transferring them to gopherspace. Features: * Replaces header tags with various levels of hash marks * Removes a lot of the common html special characters * Converts all titles to uppercase * Optionally adds extra newlines for every
or
I run it on a batch of .html docs at once: find /posts -name "*.html" -exec ./stripHTML {} \; Of course, the output still needs a little extra hand-formatting for consistency, but this saved me a bunch of time regardless. NOTE: this script does *not* call fold on the output, so the resulting .txt files will be too wide. It's easy to add that, but I wanted to keep a backup of each .txt file for my records before chopping them to 80 colums. ################################################################################ ### SOURCE: ### -------------------------------------------------------------------------------- #!/bin/bash filepath=$1 dir="$(dirname $filepath)" filename="$(basename $filepath)" noext="${filename%.*}" TXT="$dir/$noext.txt" cp $filepath $TXT sed -i "" 's/

//g' $TXT sed -i "" 's/<\/p>//g' $TXT sed -i "" 's/

/#### /g' $TXT sed -i "" 's/

/### /g' $TXT sed -i "" 's/

/## /g' $TXT sed -i "" 's/

/* /g' $TXT sed -i "" 's/<\/h1>/ ####/g' $TXT sed -i "" 's/<\/h2>/ ###/g' $TXT sed -i "" 's/<\/h3>/ ##/g' $TXT sed -i "" 's/<\/h4>/ */g' $TXT #sed -i "" 's//\n/g' $TXT #uncomment to replace
tags with newlines sed -i "" "s/’/'/g" $TXT sed -i "" "s/‘/'/g" $TXT sed -i "" 's/“/"/g' $TXT sed -i "" 's/”/"/g' $TXT sed -i "" 's/"/"/g' $TXT sed -i "" 's/→/->/g' $TXT sed -i "" 's/&/&/g' $TXT sed -i "" 's/–/-/g' $TXT sed -i "" 's/é/e/g' $TXT #Remove any remaining html tags that we don't care about sed -e 's/<[^>]*>//g' $TXT > bar.txt #Capitalize all the titles that we just added perl -i -pe 's/#(.+)#/#\U$1#/gi' bar.txt mv bar.txt $TXT -------------------------------------------------------------------------------- ### LICENSE: ### Released under MIT license. Do whatever you want with this.