(TXT) View source
# 2025-11-17 - Ecotopian Dungeon Scientist Word Cloud
(IMG) Word Cloud
I've been seeing word clouds for ages. Today i decided to generate
one from my gopher hole. In the process i found and fixed a number
of character encoding errors and typos. That alone made it worth the
price of admission. Below i will outline the steps i took to
generate the above image. I did this on Slackware64 15.0.
Select the content to scrape words from:
$ find public_gopher -type f -name '*.txt' -o -name '*.gph' \
>lis.txt
$ wc -l lis.txt
477
So i have 477 text files.
Parse out individual words:
$ find public_gopher -type f -name '*.txt' -o -name '*.gph' \
-print0 >0lis.txt
$ xargs -a 0lis.txt -0 cat |\
tr -s '[[:punct:][:space:]]' '\n' |\
tr A-Z a-z |\
sort >words.txt
$ wc -l words.txt
1051670 words.txt
So i have a little over a million words.
Count frequency of words:
$ uniq -c <words.txt | sort -n
...
20473 a
26006 to
26590 and
27617 of
52725 the
Filter out empty words, 2 letter words, 3 letter words, words
beginning or ending with a digit, and words that occur fewer
than 10 times.
$ cat words.txt |\
grep -v -e '^$' -e '^.$' -e '^..$' -e '^[0-9]' -e '[0-9]$' |\
uniq -c |\
sort -n |\
awk '$1 > 9 {print $2}' >words2.txt
$ wc -l words2.txt
7448 words2.txt
$ tail -5 words2.txt
for
you
that
and
the
Much better, i have a list of 7448 unique words. Now i want to
filter out boring words such as "and" and "the".
$ cp words2.txt filter.txt
$ ed filter.txt
...
I manually edited filter.txt and deleted lines with interesting
words, leaving behind only the boring words. This took a few
minutes. I saved the edited file.
Report word count, excluding filtered words:
$ cat >filter.awk <<__EOF__
BEGIN {
file = "filter.txt"
while ((getline <file) > 0) {
filter[$0] = 1
}
close(file)
}
{
# skip word if it begins or ends with a digit
if (/^[0-9]/ || /[0-9]$/) {
next
}
# skip word if it's less than 3 characters long
if (length($0) < 3) {
next
}
# skip word if it's in filter.txt
if ($0 in filter) {
next
}
words[$0]++
}
END {
for (word in words) {
count = words[word]
# skip word if it occurred fewer than 10 times
if (count < 10) {
continue
}
printf "%d %s\n", count, word
}
}
__EOF__
$ awk -f filter.awk words.txt | sort -n >words3.txt
I found the Python wordcloud generator on the following web pages.
(HTM) Create Fun Word Cloud Images Easily In Linux Terminal
(HTM) WordCloud Only Supported For TrueType Fonts
Install Python wordcloud generator. On Slackware it is necessary to
upgrade pip and Pillow first:
# pip3 install --upgrade pip
# pip3 install --upgrade Pillow
# pip3 install wordcloud
Finally, generate a word cloud:
$ wordcloud_cli --text words3.txt --background white \
--font CaslonAntique.ttf --imagefile word-cloud.png \
--width 800 --height 600
$ pngtopam word-cloud.png |\
cjpeg -optimize -quality 80 >word-cloud.jpg
That's it!
tags: bencollver,technical,unix
# Tags
(DIR) bencollver
(DIR) technical
(DIR) unix