tilde.pink

 (TXT) View source
       
       # 2025-11-17 - Ecotopian Dungeon Scientist Word Cloud
       
 (IMG) Word Cloud
       
       I've been seeing word clouds for ages. Today i decided to generate
       one from my gopher hole.  In the process i found and fixed a number
       of character encoding errors and typos.  That alone made it worth the
       price of admission.  Below i will outline the steps i took to
       generate the above image.  I did this on Slackware64 15.0.
       
       Select the content to scrape words from:
       
           $ find public_gopher -type f -name '*.txt' -o -name '*.gph' \
               >lis.txt
           
           $ wc -l lis.txt
           477
       
       So i have 477 text files.
       
       Parse out individual words:
       
           $ find public_gopher -type f -name '*.txt' -o -name '*.gph' \
               -print0 >0lis.txt
           
           $ xargs -a 0lis.txt -0 cat            |\
               tr -s '[[:punct:][:space:]]' '\n' |\
               tr A-Z a-z                        |\
               sort >words.txt
           
           $ wc -l words.txt
           1051670 words.txt
       
       So i have a little over a million words.
       
       Count frequency of words:
       
           $ uniq -c <words.txt | sort -n
           ...
             20473 a
             26006 to
             26590 and
             27617 of
             52725 the
       
       Filter out empty words, 2 letter words, 3 letter words, words
       beginning or ending with a digit, and words that occur fewer
       than 10 times.
       
           $ cat words.txt                                                |\
               grep -v -e '^$' -e '^.$' -e '^..$' -e '^[0-9]' -e '[0-9]$' |\
               uniq -c                                                    |\
               sort -n                                                    |\
               awk '$1 > 9 {print $2}' >words2.txt
           
           $ wc -l words2.txt
           7448 words2.txt
           
           $ tail -5 words2.txt
           for
           you
           that
           and
           the
       
       Much better, i have a list of 7448 unique words.  Now i want to
       filter out boring words such as "and" and "the".
       
           $ cp words2.txt filter.txt
           $ ed filter.txt
           ...
       
       I manually edited filter.txt and deleted lines with interesting
       words, leaving behind only the boring words.  This took a few
       minutes.  I saved the edited file.
       
       Report word count, excluding filtered words:
       
           $ cat >filter.awk <<__EOF__
       BEGIN {
           file = "filter.txt"
           while ((getline <file) > 0) {
               filter[$0] = 1
           }
           close(file)
       }
       
       {
           # skip word if it begins or ends with a digit
           if (/^[0-9]/ || /[0-9]$/) {
               next
           }
       
           # skip word if it's less than 3 characters long
           if (length($0) < 3) {
               next
           }
       
           # skip word if it's in filter.txt
           if ($0 in filter) {
               next
           }
       
           words[$0]++
       }
       
       END {
           for (word in words) {
               count = words[word]
       
               # skip word if it occurred fewer than 10 times
               if (count < 10) {
                   continue
               }
       
               printf "%d %s\n", count, word
           }
       }
       __EOF__
           
           $ awk -f filter.awk words.txt | sort -n >words3.txt
       
       I found the Python wordcloud generator on the following web pages.
       
 (HTM) Create Fun Word Cloud Images Easily In Linux Terminal
       
 (HTM) WordCloud Only Supported For TrueType Fonts
       
       Install Python wordcloud generator.  On Slackware it is necessary to
       upgrade pip and Pillow first:
       
           # pip3 install --upgrade pip
           # pip3 install --upgrade Pillow
           # pip3 install wordcloud
       
       Finally, generate a word cloud:
       
           $ wordcloud_cli --text words3.txt --background white    \
               --font CaslonAntique.ttf --imagefile word-cloud.png \
               --width 800 --height 600
           
           $ pngtopam word-cloud.png |\
               cjpeg -optimize -quality 80 >word-cloud.jpg
       
       That's it!
       
       tags: bencollver,technical,unix
       
       # Tags
       
 (DIR) bencollver
 (DIR) technical
 (DIR) unix