[HN Gopher] The Small World of English
       ___________________________________________________________________
        
       The Small World of English
        
       Author : michaeld123
       Score  : 107 points
       Date   : 2025-06-03 15:14 UTC (7 hours ago)
        
 (HTM) web link (www.inotherwords.app)
 (TXT) w3m dump (www.inotherwords.app)
        
       | michaeld123 wrote:
       | We built a 1.5M word semantic network where any two words connect
       | in ~6.43 hops (76% connect in <=7). The hard part wasn't the
       | graph theory--it was getting rich, non-obvious associations.
       | GPT-4's associations were painfully generic: "coffee - beverage,
       | caffeine, morning." But we discovered LLMs excel at validation,
       | not generation. Our solution: Mine Library of Congress
       | classifications (648k of them, representing 125 years of human
       | categorization). "Coffee" appears in 2,542 different book
       | classifications--from "Coffee trade--Labor--Guatemala" to "Coffee
       | rust disease--Hawaii." Each classification became a focused
       | prompt for generating domain-specific associations. Then we
       | inverted the index: which classifications contain both
       | "algorithm" and "fractals"? Turns out: "Mathematics in art" and
       | "Algorithmic composition." This revealed connections like
       | algorithm-Fibonacci-golden ratio that pure co-occurrence or word
       | vectors miss. The "Montreal Effect" nearly tanked the project--
       | geographic contamination where "bagels" spuriously linked to
       | "Expo 67" because Montreal is famous for bagels. We used LLMs to
       | filter true semantic relationships from geographic coincidence.
       | Technical details: 80M API calls, superconnector deprecation
       | (inverse document frequency variant), morphological
       | deduplication. Built for a word game but the dataset has broader
       | applications.
        
         | marviel wrote:
         | Thanks for sharing!
         | 
         | Which embedding types did you try? I'm surprised that
         | embeddings weren't able to take you further with this.
        
           | michaeld123 wrote:
           | Early on, I tried both older (Word2Vec & GloVe), and later
           | newer (OpenAI's ada + the text-embedding-3-x).
        
         | gagzilla wrote:
         | Very cool and fascinating. I wonder if there are other insights
         | that can be drawn from what you've built. Like which two words
         | (or such pairs) have the longest sequence of hops to connect?
         | Or what are the top "superconnectors"? Or if there is a
         | plausible correlation between how well a word is connected to
         | how old it is?
        
           | michaeld123 wrote:
           | Longest paths: The tail maxes out at 15 hops. These extreme
           | paths are disappointingly mechanical--not the poetic
           | distances you'd hope for:
           | 
           | * Technical jargon - unrelated obscurities: gryllacridid
           | (cricket family) - microclots * Proper nouns - common words:
           | Trish Stratus (wrestler) - federating Numbers - anything:
           | 9451 - shoulds
           | 
           | Mostly hyper-specific terms with few inbound connections,
           | obscure conjugations, or rare idioms.
           | 
           | Superconnectors: We systematically removed generic hubs, but
           | your question prompted us to analyze which words still act as
           | natural bridges. Added it to the article with an interactive
           | explorer! Top survivors:
           | 
           | * polish (0.18% of paths) - verb/nationality homograph *
           | symbiosis (0.14%) - biology - cooperation bridge * treaty
           | (0.13%) - conflict - resolution bridge
           | 
           | Thanks for the curiosity--it led to an interesting addition.
           | Age correlation: No hard data, but I suspect you're right.
           | Older words have had centuries to accumulate meanings and
           | develop polysemous bridges.
        
       | o11c wrote:
       | Something about this website makes scrolling lag even with
       | Javascript disabled. Firefox 128 on Linux.
       | 
       | Very interesting topic though.
        
         | michaeld123 wrote:
         | thanks. Let me see if I can diagnose the scrolling lag.
        
           | michaeld123 wrote:
           | Removed an ill-concieved backdrop-filter: blur(10px);
           | downsampled the images, and some other little fixes.
        
       | trizoza wrote:
       | Any plans on launching on Android, or simply just browser based
       | web version?
        
         | michaeld123 wrote:
         | Thanks for your interest! The data is used in our iOS game and
         | visualizer.
        
       | suddenlybananas wrote:
       | I don't find many of these transitions very appealing. Sweet to
       | harmony? Seems a stretch. Nightjar to chirring to bombylious?
       | Might as well be gobbledygook.
        
         | droopyEyelids wrote:
         | It said the path between "double lock" and "dislodge" was a
         | tortured 10 word chain, it seems like you could get there much
         | faster
         | 
         | "Double lock" > "clasp" > "grab" > "dislodge"
         | 
         | It's just a quick example, but I think it follows their "rough
         | synonym" style connections, and it's not less reasonable than
         | the examples.
         | 
         | To me, it feels like this project is kind of hampered by not
         | having a rigorous definition of what is allowable, and then
         | mixing in the sort of random effects of an LLM
        
           | michaeld123 wrote:
           | Good points. The main limiter is often what words happen to
           | surface as the top 17 connections, or in those random
           | examples when there's a plural or conjugation.
           | 
           | Since this is getting eyeballs here, I will look for some
           | less tortured long-paths to add as examples.
        
           | dleeftink wrote:
           | A smaller trainable set would be a dictionary, and only
           | linking the terms as expressed in the definition, possibly
           | with substitutions. You'd miss more abstract jumps, but the
           | initial walks would be tractable.
           | 
           | (It is a game best played with a grandparent's pre-war
           | dictionary before tea-time)
        
         | michaeld123 wrote:
         | You are right. I replaced it with new paths that are +1 longer.
         | These are actual paths from our game now.
        
       | Jordan-117 wrote:
       | How is a largely text-based app 3.47 GB? Is the
       | dictionary/semantic DB just that large or is there other stuff
       | going on?
        
         | michaeld123 wrote:
         | I wish it were not so! Here's the breakdown: 1.5M headwords x
         | ~2KB average per entry >= 3GB Each entry contains: 40
         | associations in the core graph Multiple senses (up to 8) x 17
         | associations each = up to 136 more Stems and morphological
         | variants In-game clue definitions Longer definition entries
         | with several types of related word lists.
         | 
         | The only good news is it works offline.
        
           | senkora wrote:
           | I do think that as a practical matter it might be better to
           | store the word graph in the cloud and query it from the
           | client.
           | 
           | You could either store the word graph as a partitioned set of
           | S3 buckets, or have a back-end that serves individual words
           | and does rate-limiting. I guess that the back-end might be
           | better to avoid surprise egress charges from anyone trying to
           | download the entire dataset.
           | 
           | I want to try out the game but I'm discouraged by the
           | download size.
        
         | hx8 wrote:
         | The first really large text data I ever encountered was Google
         | Ngram[0], the total size of which is about 3TB. I would have
         | guessed it was closer to 3GB before I started downloading it.
         | 
         | [0]
         | https://storage.googleapis.com/books/ngrams/books/datasetsv3...
        
           | michaeld123 wrote:
           | Yes. I love Google Ngrams.
           | 
           | We use the top google Ngrams in 2 ways. (a) we share it in
           | the reference mode of our app, i.e. common words before or
           | after; (b) we use longer N-grams, where possible, like a
           | 4-gram, to choose literary examples that also show a common
           | pattern.
        
       | hftf wrote:
       | I really enjoyed the article, reading it more from the
       | perspective of what 21st-century lexicography could be, less as a
       | customer of a word game however thoughtfully designed. As a
       | Wiktionary editor (and Android user who's also grown out of bare
       | word-relationship puzzle games) though, it's sad that there seems
       | to be no way to just use the end-product network as a reference,
       | which I would love to do, but I suppose they did spend a million
       | bucks on it.
       | 
       | I'll also use this post to wish that more people would edit
       | Wiktionary. It has such a good mission (information on all words)
       | and yet there are only like 80 people editing on any given day or
       | whatever. In some languages, it's even the best or most updated
       | dictionary available. The barriers to entry and bureaucracy are
       | really not high for HN audience types.
        
         | michaeld123 wrote:
         | I second that! I have edited a few Wiktionary pages myself, and
         | find it's a better overall environment than Wikipedia, if you
         | can find something meaningful to add.
        
         | Suppafly wrote:
         | >I'll also use this post to wish that more people would edit
         | Wiktionary.
         | 
         | If it's anything like wikipedia, there is probably a reason
         | more people aren't working on it, and it's because the existing
         | people discourage it.
        
           | hftf wrote:
           | I get the impulse to assume they'd be alike, but I've found
           | that Wiktionary really isn't much like Wikipedia.
        
             | card_zero wrote:
             | The Wiktionary equivalent of citing sources confuses me.
             | 
             | https://en.wiktionary.org/wiki/Wiktionary:Criteria_for_incl
             | u...
             | 
             | Which words should be attested? Presumably only uncommon
             | ones? And how is it done, is the "quotes" section the
             | attestation? Is there vandalism to clean up, like people
             | adding their own names to define themselves as awesome?
             | Wiktionary seems to "just work", and I don't really
             | understand what holds it together.
        
         | mmooss wrote:
         | > it's sad that there seems to be no way to just use the end-
         | product network as a reference, which I would love to do, but I
         | suppose they did spend a million bucks on it.
         | 
         | From the OP: "This research and computational scale was made
         | possible by $295k NSF SBIR seed funding (#2329817) and $150k
         | Microsoft Azure compute resources." Does that NSF funding mean
         | it's open source? Also, I'm not 100% sure that the quote
         | applies to all the research rather than just one component of
         | it.
         | 
         | > I'll also use this post to wish that more people would edit
         | Wiktionary. It has such a good mission (information on all
         | words) ...
         | 
         | I support open source, contribute to it, and love the spirit of
         | Wiktionary, I don't understand the practical reality of
         | applying 'wisdom of the crowds' to a dictionary, especially the
         | English edition, for two reasons:
         | 
         | Definitions are highly accurate (complete, correct,
         | consistent), highly precise things - otherwise, what is their
         | value? Assuming Wiktionary is _descriptive_ - reporting the
         | words ' actual usage - it takes quite a bit of scholarship,
         | skill, and editorial resources not to mislead people. I can't
         | just write what I think it means - the meaning to me might not
         | match the meaning to the person at the next desk. It takes
         | quite a bit of research, using powerful (and sometimes
         | expensive) tools, and understanding of lexicography to be
         | complete and also precisely correct, including usages in places
         | and times that are mostly unknown to any particular author.
         | Also, writing definitions is tricky: You are using words -
         | which have those aformentioned problems with meaning - to
         | define words. Also, any writing anywhere can be easily
         | misinterpreted - skill and editors are needed to avoid
         | misunderstanding. How is the accuracy and precision problem
         | solved?
         | 
         | Also, in English there are already many authoritative sources,
         | many with a century of profesional lexicography behind them by
         | the best in the business. Some are free. There are also meta-
         | lookup engines such as Wordnik and OneLook. Why use Wiktionary?
         | The few times I've compared definitions or etymologies, the
         | authoritative sources almost always exceed or equal Wiktionary
         | (though online copies of older print editions suffer from the
         | minimalism caused by the constraint of printing costs).
         | Arguably, there is nothing else both unabridged and free:
         | Oxford unabridged costs $, so does Merriam-Webster (the free
         | edition is abridged); American Heritage is free, but has the
         | minimalism issue I mentioned above.
        
           | genewitch wrote:
           | I'm one of those people who says, unironically, "words have
           | meanings." I readily argue with people who present "language
           | is living and evolves" - sure, but in order to communicate we
           | have to agree on a decent subset of overall definitions.
           | 
           | I enjoy etymology, maybe too much. It's like magic, finding
           | out what a barrow was, or how filibuster has a direct lineage
           | to pirates (freebooters... In Dutch.)
           | 
           | I can't afford, really, the nicer old English, scandi,
           | frisan, Norse, etc. etymology dictionaries. I have incomplete
           | scans that were printed and bound of some of them. I still
           | have 6 etymology dictionaries, so I can be about as quick
           | getting a dictionary as getting on the computer and going to
           | !eo.
        
             | PaulDavisThe1st wrote:
             | > in order to communicate we have to agree on a decent
             | subset of overall definitions.
             | 
             | sociologically speaking, however, it is precisely that
             | agreement that is what evolves alongside changes in
             | spelling, pronounciation (and occasionally "new" words).
        
           | hftf wrote:
           | I don't think definitions "are" highly accurate precise
           | things. Sometimes yes. The same scholarship, skill, and need
           | to not mislead also applies for so many other things:
           | encyclopedic articles, taxonomies, news, maps, operating
           | systems. Do people still question the value of Wikipedia,
           | OpenStreetMap? Yeah, there are problems with them, and with
           | peer review. Using fuzzy words (or fuzzy phonetic symbols,
           | fuzzy categories, fuzzy semantic links...) to define words is
           | a problem (if at all) of literally any dictionary. I don't
           | see any of these as particularly unique obstacles for
           | Wiktionary.
           | 
           | Unabridged dictionaries take decades to release new editions
           | and are still navigating transition into the exploding
           | digital age. They are so expansive in scope, while often so
           | limited in resources, and barely accept any crowd
           | contributions. Such deliberately slow-going is often a good
           | thing, but words also change quite quickly and these sources
           | are now playing a very long game of catch-up. (Yesterday I
           | tried to verify the latter English senses of "fandango" on
           | Wiktionary with other dictionaries; OED's entry has not been
           | touched for 131 years! What am I going to do with that, I
           | need to use / understand the word now!)
           | 
           | Wiktionary is the big web-native word-resource (and is not
           | cluttered with commercial junk) - allowing links, expandable
           | quotes, images, diagrams, etc. that print's minimalism
           | suffers from as you mention. When someone in 2025 wants
           | information on a word, they'll likely use a search engine and
           | click a link to Wiktionary (where Google blurbs steal some
           | data from). Maybe they are a student wanting to confirm their
           | nonstandard pronunciation with the IPA (still rarely used in
           | mainstream English dictionaries) or if it's recognized in
           | their own dialect (mainstream dictionaries rarely provide
           | more than UK and US pronunciations) - if enough people have
           | the same question, Wiktionary seems like the best place to
           | put the answer - or see an accessible etymology tree. While
           | you probably know this, it's also worth reminding that
           | English Wiktionary isn't just for English words, it is a
           | dictionary of all languages' words, which is written in
           | English. It has metadata and links connecting languages'
           | words that you can't find elsewhere.
           | 
           | Yes, I indeed do want people to just write what they think a
           | word means - as a starting point in a collaborative refining
           | process. I believe the number of word-users in the world with
           | valuable potential contributions is a lot closer to a billion
           | than the thousand gatekeepers working hard on classical
           | dictionaries. The barrier to entry is really low, but the
           | tooling could still be much better. This is one reason i'm
           | putting my appeal under this article - because I think
           | (professional) lexicography can stand to evolve more in the
           | 21st century. (And are people today really buying enough
           | dictionaries to sustain a professional version of Wiktionary,
           | or even a professional dictionary offered in structured data
           | form?) If we don't contribute to a crowdsourced dictionary,
           | then we won't have any such thing.
           | 
           | (Meta-lookup sites are link/search engines, not dictionaries
           | and IME really don't do a good job synthesizing their
           | information or conventions.)
        
           | bloak wrote:
           | "Why use Wiktionary?"
           | 
           | I can answer that one. I have free access to the Oxford
           | English Dictionary (OED), which is brilliant and generally
           | more detailed and reliable than Wiktionary when it has the
           | word I'm looking for, but their login page is so awful that I
           | sometimes use en.wiktionary.org instead just to save my time
           | and temper. Also, en.wiktionary.org has proper nouns, other
           | languages, and occasionally it has some recent or technical
           | English word that OED does not have. So if I'm doing some
           | serious amateur research: OED. But if I'm doing a crossword
           | and want to check that a word exists and is spelt how I think
           | it is: Wiktionary.
        
         | 0cf8612b2e1e wrote:
         | Could I make a plea to make a wikitionary export easier to
         | find/use? Assuming I can even find the magical page which hosts
         | them, Wikipedia dumps are terribly documented and seem to
         | incorporate shorthand which I do not recognize.
        
           | michaeld123 wrote:
           | And they are full of wiki markup, templates, and inconsistent
           | formatting. A human brain can easily understand it, but
           | automated parsing is impossible (pre LLM).
        
       | dhashe wrote:
       | This is very cool. In puzzlehunts, we often use tools to assist
       | with solving and writing puzzles (the classic example is
       | https://nutrimatic.org ).
       | 
       | Years ago, I wrote a puzzlehunt puzzle that involved navigating
       | through words where an edge existed if the two words formed a
       | common 2-gram (that is, they often appeared one after another in
       | a text dump of Wikipedia).
       | 
       | For example, a fragment of the graph from the puzzle is: mit ->
       | press -> office <- post <- blog.
       | 
       | This work is obviously much more advanced, and it's very cool to
       | see that they managed to make it work with semantic connections.
       | I was able to get away with a much simpler approach since I only
       | cared about 2-grams over a set of about 1000 words (I literally
       | used a grep command over the entire text of the English
       | wikipedia; it took about a day to run).
       | 
       | But the core idea is shared: 1) wanting to build a graph
       | representation of word connections for a puzzle, 2) it being way
       | to much work to do that manually, 3) you would miss a bunch of
       | edges if you did do it manually, so 4) use programming tools to
       | construct a dataset, and then 5) the end result is surprisingly
       | fun for the user because the dataset is comprehensive and it
       | feels really natural.
       | 
       | If anyone is curious, the puzzlehunt puzzle is here:
       | https://dhashe.com/files/puzzles/word-wide-web.pdf
       | 
       | And the solution is here: https://dhashe.com/files/puzzles/word-
       | wide-web-sol.pdf
       | 
       | And a fair warning to anyone unfamiliar with puzzlehunt puzzles:
       | they do not come with instructions and it is very common to get
       | stuck when solving them, especially when solving them alone. You
       | have not completely solved a puzzlehunt puzzle until you extract
       | an answer word or phrase from the puzzle. This one has an extra
       | layer after filling in the words in the graph. Peeking at the
       | solution is encouraged if you get stuck.
        
         | michaeld123 wrote:
         | That's an interesting puzzle. I hope more types of word puzzles
         | continue to be created.
        
       | totaldude87 wrote:
       | I was looking for a similar app for my upcoming book! At times
       | it's very hard to get the word that we are looking for and hope
       | this solves it!
       | 
       | I know this is not related to the app but still wanted to
       | appreciate the thought
        
       | us-merul wrote:
       | I really liked this article and these types of analyses always
       | capture me. I just had to try out the game then. I nailed the
       | link to "moon" from "rise" on my first try. Then I was a bit let-
       | down for my first real task to get to "chill" starting from
       | "chain." I went first to conglomerate, then corporation, then
       | management... thinking I would at some point encounter "cold,"
       | and then "chill". Unfortunately not. Then I tried from chain to
       | something like (my memory is imperfect here), necklace, jewelry,
       | brilliance, glow, tranquil, calm-- and on a couple of other
       | tries, appease, mollify, relax-- but could never get to "chill."
       | I was able to win eventually by appealing to temperature which
       | led me to chill.
       | 
       | Is there anything the user could do to modify the next steps,
       | other than picking a word? Perhaps selecting some sort of valence
       | related to metaphor or meaning? "I want to pick 'pacify', but in
       | the sense of calming down, not to utterly destroy."
        
         | michaeld123 wrote:
         | Thanks for reporting on your experience! Those are good
         | questions, and I will think about your valence idea for the
         | future.
         | 
         | On a shorter horizon, I can tune the probability that on-path
         | terms appear in the cloud. We store a larger pool of words than
         | are displayed, and calculate lookaheads (and lookbacks from the
         | target).
        
           | mmooss wrote:
           | Maybe the user could type in their own words, and the app
           | could approve/disapprove based on the 40 word list.
           | 
           | But maybe that adds an entirely new normalization function -
           | user types 'runs' or 'ran', the app has to normalize to
           | 'run'.
           | 
           | The app could just have a 'more words' button, loading the
           | next 17.
        
           | us-merul wrote:
           | Thanks for your response. Getting feedback like "hot or cold"
           | in the algorithm's mind is exactly what I'm thinking of. It's
           | a tricky issue and reminds me a lot of this:
           | https://www.datcreativity.com/
           | 
           | I had tried hard to pick a set of fairly simple words,
           | thinking I had an intricately unique association in my head,
           | only to find out that the reported connections were nothing
           | more than average. My partner obviously landed in an
           | extremely high percentile by instantly picking the first
           | words that came to her without much thought.
        
             | michaeld123 wrote:
             | For good or bad, Semantle is able to report hot/cold
             | because it's vector-based. We tried a few types of vectors,
             | but I thought they were consistently unintuitive. So the
             | best (and most relevant) proxy is remaining shortest
             | distance-to-target, but often the player is only two hops
             | away (spanning 17^2=289 options), and when they go astray
             | and are much further, it's computationally too slow to look
             | out more than 5 hops with brute force.
        
             | michaeld123 wrote:
             | Thanks for that datcreativity.com link. My score was 94.11,
             | higher than 99.88% of the people who have completed this
             | task! I should hope so after working on relations for
             | years. ;)
             | 
             | My words were: apple, shotgun, stardust, anger, hygiene,
             | etymology, proctology, slant, dictator, and displacement.
        
       | rafram wrote:
       | I wanted to try out your app, but I cancelled the download after
       | noticing that it's 3.5 gigabytes. How?! That's by far the biggest
       | iOS app I've ever seen.
        
         | michaeld123 wrote:
         | Sorry! The problem is ~2kb of data per 1.5M headwords. We
         | already use indicies and brotli compression internally. I doubt
         | we could smush below 3GB.
        
           | rafram wrote:
           | Could you... have fewer headwords? That's like 5x the number
           | of headwords actually used in modern English. Or at least
           | download some of the data on demand?
        
           | neuroelectron wrote:
           | Wow only 3gb? Finally i have something to use the 128gb this
           | iThing came with besides Firefox and Kindle.
        
       | 6stringmerc wrote:
       | Dissociating English terms from their context and focusing on the
       | ease of relationship is a hilariously bad habit that people
       | actively are trained AWAY from using. The nuance of English is
       | absolutely going to break AI because even the example of "strong"
       | relationships are suspect in utility.
       | 
       | Seriously, when is the last time a casual speaker, writer, or
       | translator used "domicile" in place of "house" in your world?
       | It's an archaic term appropriated into legal jargon. Flattening
       | out language and drawing lines between terms is funny to me.
       | 
       | The only issue is normalizing "Thesaurus bashing" type
       | mentalities - like this - to degrade the value of coherent,
       | purposeful, meaningful use of English. It's an amalgamation
       | language with extremely difficult fluency. It's rife with idioms
       | and contradictory emotional context.
       | 
       | Oh well, I can grasp that I tend to yell at clouds when it comes
       | to this sort of thing. It doesn't change my opinion this is a
       | harmful exercise and probably should not exist. There are few
       | instances where playing a game will actually make one more
       | stupid, but here we are.
        
         | genewitch wrote:
         | I used domicile about 45 minutes ago in casual conversation
         | about fire ants in my abode. Habitation. Flat.
        
       | cadamsdotcom wrote:
       | Such an amazing data set with the amount of curation you've done
       | and the care with which it's been put together.
       | 
       | It'd be highly valuable as a thesaurus API.
        
         | michaeld123 wrote:
         | Thanks!.... Does anyone pay for thesaurus APIs anymore?
        
       | jcmeyrignac wrote:
       | Nice work! Here is a similar idea:
       | https://wordassociations.net/en
       | 
       | In french, there is a game to build relations with words (they
       | provide a word, and you have to type the most related words):
       | https://www.jeuxdemots.org They reached 677 million of relations
       | in 2024!
        
       | akudha wrote:
       | What other word games do people enjoy? My favorites on iOS
       | 
       | Alpha Omega
       | 
       | Sticky Terms (I struggle with this)
       | 
       | Typeshift
       | 
       | Blackbar (old, not maintained, but we can still play. Not a game
       | in strict sense, very enjoyable)
        
         | michaeld123 wrote:
         | And where and how do people discover new word games?
        
           | akudha wrote:
           | I found the above from iOS search. I also ask around, but not
           | many people I know are interested in word games
           | unfortunately.
           | 
           | I suppose other languages have way less word games than
           | English?
        
       | slantaclaus wrote:
       | I remember in college I got all stoned in the library and
       | determined that you could find a semantic pathway using synonyms
       | to relate completely opposite terms with only a few nodes.
       | Completely blew my mind and I still think about it sometimes.
        
       ___________________________________________________________________
       (page generated 2025-06-03 23:00 UTC)