[HN Gopher] The Bluesky Dictionary
       ___________________________________________________________________
        
       The Bluesky Dictionary
        
       Author : gaws
       Score  : 48 points
       Date   : 2025-08-06 20:43 UTC (2 hours ago)
        
 (HTM) web link (www.avibagla.com)
 (TXT) w3m dump (www.avibagla.com)
        
       | neaden wrote:
       | Is this not working or am I missing something, it just shows as
       | seeing 0 words for me. Firefox on a PC.
        
         | SirFatty wrote:
         | Same... maybe you need a Bluesky account, which I don't have.
        
           | gpm wrote:
           | It doesn't... I can open it in a private browsing window.
        
         | GalaxyNova wrote:
         | It's working fine for me on Firefox
        
         | accrual wrote:
         | You may need to allow scripts from the domain avibagla.com, it
         | shows 0 when the scripts are blocked.
        
           | zem wrote:
           | ugh, it ought to be building the results on the server and
           | serving up static pages.
        
         | AgentME wrote:
         | For me it took a minute to start loading data and switch from
         | just showing 0.
        
       | GalaxyNova wrote:
       | fascinating! I think it's really cool that this is possible, and
       | at the same time kine of sad that the norm is slowly moving
       | towards more locked-down APIs.
        
         | timeon wrote:
         | > slowly moving towards
         | 
         | Depends what we accept as norm.
        
       | 75345d4c wrote:
       | I just saw it indexed "eluvium," but the post was referring to a
       | band with that same name
        
         | Kye wrote:
         | GeologySky will get to it soon enough.
        
         | atlgator wrote:
         | I checked out the author's other projects and this is common
         | issue. For example, he has a "lean checker" for bluesky that
         | claims it is right-leaning simply because of all the people
         | saying "That's right," "He was right," etc. None of the
         | supposed right-leaning posts were actually conservative in
         | nature. They just used to word right to mean correct.
        
           | avibagla1 wrote:
           | one, thank you for checking my website. two, that is the
           | joke, 100% - at the time people kept talking about how "left
           | leaning" bsky was and that idea came to mind
        
       | wantlotsofcurry wrote:
       | I'm very curious as to how this works in the backend. I realize
       | it uses Bluesky's firehose to get the posts, but I'm more curious
       | on how it's checking whether a post contains any of the available
       | words. Any guesses?
        
         | bangaladore wrote:
         | Maybe I'm being naive, but with only ~275k words to check
         | against, this doesn't seem like a particularly hard problem.
         | Ingest post, split by words, check each word via some db,
         | hashmap, etc... and update metadata.
        
         | gpm wrote:
         | Probably just a big hashtable mapping word -> the number of
         | times it's been seen, and another hashset of all the words it
         | hasn't seen. When a post comes in you hash all the words in it
         | and look them up in the hashtable, increment it, and if the old
         | value was 0 remove it from the hash set.
         | 
         | 250k words at a generous 100 bytes per word is only 25MB of
         | memory...
        
         | f311a wrote:
         | You can probably fit all words under 10-15MB of memory, but
         | memory optimisations are not even needed for 250k words...
         | 
         | Trie data structures are memory-efficient for storing such
         | dictionaries (2-4x better than hashmaps). Although not as fast
         | as hashmaps for retrieving items. You can hash the top 1k of
         | the most common words and check the rest using a trie.
         | 
         | The most CPU-intensive task here is text tokenizing, but there
         | are a ton of optimized options developed by orgs that work on
         | LLMs.
        
         | stwrzn wrote:
         | I very much hope that the backend uses one of the bluesky
         | jetstream endpoints. When you only subscribe to new posts, it
         | provides a stream of around 20mbit/s last time I checked, while
         | the firehose was ~200mbit/s.
        
           | avibagla1 wrote:
           | yes it does!
        
         | avibagla1 wrote:
         | Hey! this is my site - it's not all that complex, i'm just
         | using a sqlite db with two tables - one for stats, the other
         | for all the words that's just word | count | first use | last
         | use | post.
         | 
         | I... did not expect this to be so popular
        
       | spullara wrote:
       | I did this against a pretty large tweet archive and got hits on
       | about 125k of the words in the unix dictionary.
        
       | pona-a wrote:
       | For a moment I thought it would be an AT-Proto based Urban
       | Dictionary clone.
        
       | tough wrote:
       | Words We Haven't Seen
       | 
       | - Search unseen words
       | 
       | made me chuckle
        
       ___________________________________________________________________
       (page generated 2025-08-06 23:00 UTC)