[HN Gopher] The Small World of English
___________________________________________________________________
The Small World of English
Author : michaeld123
Score : 107 points
Date : 2025-06-03 15:14 UTC (7 hours ago)
(HTM) web link (www.inotherwords.app)
(TXT) w3m dump (www.inotherwords.app)
| michaeld123 wrote:
| We built a 1.5M word semantic network where any two words connect
| in ~6.43 hops (76% connect in <=7). The hard part wasn't the
| graph theory--it was getting rich, non-obvious associations.
| GPT-4's associations were painfully generic: "coffee - beverage,
| caffeine, morning." But we discovered LLMs excel at validation,
| not generation. Our solution: Mine Library of Congress
| classifications (648k of them, representing 125 years of human
| categorization). "Coffee" appears in 2,542 different book
| classifications--from "Coffee trade--Labor--Guatemala" to "Coffee
| rust disease--Hawaii." Each classification became a focused
| prompt for generating domain-specific associations. Then we
| inverted the index: which classifications contain both
| "algorithm" and "fractals"? Turns out: "Mathematics in art" and
| "Algorithmic composition." This revealed connections like
| algorithm-Fibonacci-golden ratio that pure co-occurrence or word
| vectors miss. The "Montreal Effect" nearly tanked the project--
| geographic contamination where "bagels" spuriously linked to
| "Expo 67" because Montreal is famous for bagels. We used LLMs to
| filter true semantic relationships from geographic coincidence.
| Technical details: 80M API calls, superconnector deprecation
| (inverse document frequency variant), morphological
| deduplication. Built for a word game but the dataset has broader
| applications.
| marviel wrote:
| Thanks for sharing!
|
| Which embedding types did you try? I'm surprised that
| embeddings weren't able to take you further with this.
| michaeld123 wrote:
| Early on, I tried both older (Word2Vec & GloVe), and later
| newer (OpenAI's ada + the text-embedding-3-x).
| gagzilla wrote:
| Very cool and fascinating. I wonder if there are other insights
| that can be drawn from what you've built. Like which two words
| (or such pairs) have the longest sequence of hops to connect?
| Or what are the top "superconnectors"? Or if there is a
| plausible correlation between how well a word is connected to
| how old it is?
| michaeld123 wrote:
| Longest paths: The tail maxes out at 15 hops. These extreme
| paths are disappointingly mechanical--not the poetic
| distances you'd hope for:
|
| * Technical jargon - unrelated obscurities: gryllacridid
| (cricket family) - microclots * Proper nouns - common words:
| Trish Stratus (wrestler) - federating Numbers - anything:
| 9451 - shoulds
|
| Mostly hyper-specific terms with few inbound connections,
| obscure conjugations, or rare idioms.
|
| Superconnectors: We systematically removed generic hubs, but
| your question prompted us to analyze which words still act as
| natural bridges. Added it to the article with an interactive
| explorer! Top survivors:
|
| * polish (0.18% of paths) - verb/nationality homograph *
| symbiosis (0.14%) - biology - cooperation bridge * treaty
| (0.13%) - conflict - resolution bridge
|
| Thanks for the curiosity--it led to an interesting addition.
| Age correlation: No hard data, but I suspect you're right.
| Older words have had centuries to accumulate meanings and
| develop polysemous bridges.
| o11c wrote:
| Something about this website makes scrolling lag even with
| Javascript disabled. Firefox 128 on Linux.
|
| Very interesting topic though.
| michaeld123 wrote:
| thanks. Let me see if I can diagnose the scrolling lag.
| michaeld123 wrote:
| Removed an ill-concieved backdrop-filter: blur(10px);
| downsampled the images, and some other little fixes.
| trizoza wrote:
| Any plans on launching on Android, or simply just browser based
| web version?
| michaeld123 wrote:
| Thanks for your interest! The data is used in our iOS game and
| visualizer.
| suddenlybananas wrote:
| I don't find many of these transitions very appealing. Sweet to
| harmony? Seems a stretch. Nightjar to chirring to bombylious?
| Might as well be gobbledygook.
| droopyEyelids wrote:
| It said the path between "double lock" and "dislodge" was a
| tortured 10 word chain, it seems like you could get there much
| faster
|
| "Double lock" > "clasp" > "grab" > "dislodge"
|
| It's just a quick example, but I think it follows their "rough
| synonym" style connections, and it's not less reasonable than
| the examples.
|
| To me, it feels like this project is kind of hampered by not
| having a rigorous definition of what is allowable, and then
| mixing in the sort of random effects of an LLM
| michaeld123 wrote:
| Good points. The main limiter is often what words happen to
| surface as the top 17 connections, or in those random
| examples when there's a plural or conjugation.
|
| Since this is getting eyeballs here, I will look for some
| less tortured long-paths to add as examples.
| dleeftink wrote:
| A smaller trainable set would be a dictionary, and only
| linking the terms as expressed in the definition, possibly
| with substitutions. You'd miss more abstract jumps, but the
| initial walks would be tractable.
|
| (It is a game best played with a grandparent's pre-war
| dictionary before tea-time)
| michaeld123 wrote:
| You are right. I replaced it with new paths that are +1 longer.
| These are actual paths from our game now.
| Jordan-117 wrote:
| How is a largely text-based app 3.47 GB? Is the
| dictionary/semantic DB just that large or is there other stuff
| going on?
| michaeld123 wrote:
| I wish it were not so! Here's the breakdown: 1.5M headwords x
| ~2KB average per entry >= 3GB Each entry contains: 40
| associations in the core graph Multiple senses (up to 8) x 17
| associations each = up to 136 more Stems and morphological
| variants In-game clue definitions Longer definition entries
| with several types of related word lists.
|
| The only good news is it works offline.
| senkora wrote:
| I do think that as a practical matter it might be better to
| store the word graph in the cloud and query it from the
| client.
|
| You could either store the word graph as a partitioned set of
| S3 buckets, or have a back-end that serves individual words
| and does rate-limiting. I guess that the back-end might be
| better to avoid surprise egress charges from anyone trying to
| download the entire dataset.
|
| I want to try out the game but I'm discouraged by the
| download size.
| hx8 wrote:
| The first really large text data I ever encountered was Google
| Ngram[0], the total size of which is about 3TB. I would have
| guessed it was closer to 3GB before I started downloading it.
|
| [0]
| https://storage.googleapis.com/books/ngrams/books/datasetsv3...
| michaeld123 wrote:
| Yes. I love Google Ngrams.
|
| We use the top google Ngrams in 2 ways. (a) we share it in
| the reference mode of our app, i.e. common words before or
| after; (b) we use longer N-grams, where possible, like a
| 4-gram, to choose literary examples that also show a common
| pattern.
| hftf wrote:
| I really enjoyed the article, reading it more from the
| perspective of what 21st-century lexicography could be, less as a
| customer of a word game however thoughtfully designed. As a
| Wiktionary editor (and Android user who's also grown out of bare
| word-relationship puzzle games) though, it's sad that there seems
| to be no way to just use the end-product network as a reference,
| which I would love to do, but I suppose they did spend a million
| bucks on it.
|
| I'll also use this post to wish that more people would edit
| Wiktionary. It has such a good mission (information on all words)
| and yet there are only like 80 people editing on any given day or
| whatever. In some languages, it's even the best or most updated
| dictionary available. The barriers to entry and bureaucracy are
| really not high for HN audience types.
| michaeld123 wrote:
| I second that! I have edited a few Wiktionary pages myself, and
| find it's a better overall environment than Wikipedia, if you
| can find something meaningful to add.
| Suppafly wrote:
| >I'll also use this post to wish that more people would edit
| Wiktionary.
|
| If it's anything like wikipedia, there is probably a reason
| more people aren't working on it, and it's because the existing
| people discourage it.
| hftf wrote:
| I get the impulse to assume they'd be alike, but I've found
| that Wiktionary really isn't much like Wikipedia.
| card_zero wrote:
| The Wiktionary equivalent of citing sources confuses me.
|
| https://en.wiktionary.org/wiki/Wiktionary:Criteria_for_incl
| u...
|
| Which words should be attested? Presumably only uncommon
| ones? And how is it done, is the "quotes" section the
| attestation? Is there vandalism to clean up, like people
| adding their own names to define themselves as awesome?
| Wiktionary seems to "just work", and I don't really
| understand what holds it together.
| mmooss wrote:
| > it's sad that there seems to be no way to just use the end-
| product network as a reference, which I would love to do, but I
| suppose they did spend a million bucks on it.
|
| From the OP: "This research and computational scale was made
| possible by $295k NSF SBIR seed funding (#2329817) and $150k
| Microsoft Azure compute resources." Does that NSF funding mean
| it's open source? Also, I'm not 100% sure that the quote
| applies to all the research rather than just one component of
| it.
|
| > I'll also use this post to wish that more people would edit
| Wiktionary. It has such a good mission (information on all
| words) ...
|
| I support open source, contribute to it, and love the spirit of
| Wiktionary, I don't understand the practical reality of
| applying 'wisdom of the crowds' to a dictionary, especially the
| English edition, for two reasons:
|
| Definitions are highly accurate (complete, correct,
| consistent), highly precise things - otherwise, what is their
| value? Assuming Wiktionary is _descriptive_ - reporting the
| words ' actual usage - it takes quite a bit of scholarship,
| skill, and editorial resources not to mislead people. I can't
| just write what I think it means - the meaning to me might not
| match the meaning to the person at the next desk. It takes
| quite a bit of research, using powerful (and sometimes
| expensive) tools, and understanding of lexicography to be
| complete and also precisely correct, including usages in places
| and times that are mostly unknown to any particular author.
| Also, writing definitions is tricky: You are using words -
| which have those aformentioned problems with meaning - to
| define words. Also, any writing anywhere can be easily
| misinterpreted - skill and editors are needed to avoid
| misunderstanding. How is the accuracy and precision problem
| solved?
|
| Also, in English there are already many authoritative sources,
| many with a century of profesional lexicography behind them by
| the best in the business. Some are free. There are also meta-
| lookup engines such as Wordnik and OneLook. Why use Wiktionary?
| The few times I've compared definitions or etymologies, the
| authoritative sources almost always exceed or equal Wiktionary
| (though online copies of older print editions suffer from the
| minimalism caused by the constraint of printing costs).
| Arguably, there is nothing else both unabridged and free:
| Oxford unabridged costs $, so does Merriam-Webster (the free
| edition is abridged); American Heritage is free, but has the
| minimalism issue I mentioned above.
| genewitch wrote:
| I'm one of those people who says, unironically, "words have
| meanings." I readily argue with people who present "language
| is living and evolves" - sure, but in order to communicate we
| have to agree on a decent subset of overall definitions.
|
| I enjoy etymology, maybe too much. It's like magic, finding
| out what a barrow was, or how filibuster has a direct lineage
| to pirates (freebooters... In Dutch.)
|
| I can't afford, really, the nicer old English, scandi,
| frisan, Norse, etc. etymology dictionaries. I have incomplete
| scans that were printed and bound of some of them. I still
| have 6 etymology dictionaries, so I can be about as quick
| getting a dictionary as getting on the computer and going to
| !eo.
| PaulDavisThe1st wrote:
| > in order to communicate we have to agree on a decent
| subset of overall definitions.
|
| sociologically speaking, however, it is precisely that
| agreement that is what evolves alongside changes in
| spelling, pronounciation (and occasionally "new" words).
| hftf wrote:
| I don't think definitions "are" highly accurate precise
| things. Sometimes yes. The same scholarship, skill, and need
| to not mislead also applies for so many other things:
| encyclopedic articles, taxonomies, news, maps, operating
| systems. Do people still question the value of Wikipedia,
| OpenStreetMap? Yeah, there are problems with them, and with
| peer review. Using fuzzy words (or fuzzy phonetic symbols,
| fuzzy categories, fuzzy semantic links...) to define words is
| a problem (if at all) of literally any dictionary. I don't
| see any of these as particularly unique obstacles for
| Wiktionary.
|
| Unabridged dictionaries take decades to release new editions
| and are still navigating transition into the exploding
| digital age. They are so expansive in scope, while often so
| limited in resources, and barely accept any crowd
| contributions. Such deliberately slow-going is often a good
| thing, but words also change quite quickly and these sources
| are now playing a very long game of catch-up. (Yesterday I
| tried to verify the latter English senses of "fandango" on
| Wiktionary with other dictionaries; OED's entry has not been
| touched for 131 years! What am I going to do with that, I
| need to use / understand the word now!)
|
| Wiktionary is the big web-native word-resource (and is not
| cluttered with commercial junk) - allowing links, expandable
| quotes, images, diagrams, etc. that print's minimalism
| suffers from as you mention. When someone in 2025 wants
| information on a word, they'll likely use a search engine and
| click a link to Wiktionary (where Google blurbs steal some
| data from). Maybe they are a student wanting to confirm their
| nonstandard pronunciation with the IPA (still rarely used in
| mainstream English dictionaries) or if it's recognized in
| their own dialect (mainstream dictionaries rarely provide
| more than UK and US pronunciations) - if enough people have
| the same question, Wiktionary seems like the best place to
| put the answer - or see an accessible etymology tree. While
| you probably know this, it's also worth reminding that
| English Wiktionary isn't just for English words, it is a
| dictionary of all languages' words, which is written in
| English. It has metadata and links connecting languages'
| words that you can't find elsewhere.
|
| Yes, I indeed do want people to just write what they think a
| word means - as a starting point in a collaborative refining
| process. I believe the number of word-users in the world with
| valuable potential contributions is a lot closer to a billion
| than the thousand gatekeepers working hard on classical
| dictionaries. The barrier to entry is really low, but the
| tooling could still be much better. This is one reason i'm
| putting my appeal under this article - because I think
| (professional) lexicography can stand to evolve more in the
| 21st century. (And are people today really buying enough
| dictionaries to sustain a professional version of Wiktionary,
| or even a professional dictionary offered in structured data
| form?) If we don't contribute to a crowdsourced dictionary,
| then we won't have any such thing.
|
| (Meta-lookup sites are link/search engines, not dictionaries
| and IME really don't do a good job synthesizing their
| information or conventions.)
| bloak wrote:
| "Why use Wiktionary?"
|
| I can answer that one. I have free access to the Oxford
| English Dictionary (OED), which is brilliant and generally
| more detailed and reliable than Wiktionary when it has the
| word I'm looking for, but their login page is so awful that I
| sometimes use en.wiktionary.org instead just to save my time
| and temper. Also, en.wiktionary.org has proper nouns, other
| languages, and occasionally it has some recent or technical
| English word that OED does not have. So if I'm doing some
| serious amateur research: OED. But if I'm doing a crossword
| and want to check that a word exists and is spelt how I think
| it is: Wiktionary.
| 0cf8612b2e1e wrote:
| Could I make a plea to make a wikitionary export easier to
| find/use? Assuming I can even find the magical page which hosts
| them, Wikipedia dumps are terribly documented and seem to
| incorporate shorthand which I do not recognize.
| michaeld123 wrote:
| And they are full of wiki markup, templates, and inconsistent
| formatting. A human brain can easily understand it, but
| automated parsing is impossible (pre LLM).
| dhashe wrote:
| This is very cool. In puzzlehunts, we often use tools to assist
| with solving and writing puzzles (the classic example is
| https://nutrimatic.org ).
|
| Years ago, I wrote a puzzlehunt puzzle that involved navigating
| through words where an edge existed if the two words formed a
| common 2-gram (that is, they often appeared one after another in
| a text dump of Wikipedia).
|
| For example, a fragment of the graph from the puzzle is: mit ->
| press -> office <- post <- blog.
|
| This work is obviously much more advanced, and it's very cool to
| see that they managed to make it work with semantic connections.
| I was able to get away with a much simpler approach since I only
| cared about 2-grams over a set of about 1000 words (I literally
| used a grep command over the entire text of the English
| wikipedia; it took about a day to run).
|
| But the core idea is shared: 1) wanting to build a graph
| representation of word connections for a puzzle, 2) it being way
| to much work to do that manually, 3) you would miss a bunch of
| edges if you did do it manually, so 4) use programming tools to
| construct a dataset, and then 5) the end result is surprisingly
| fun for the user because the dataset is comprehensive and it
| feels really natural.
|
| If anyone is curious, the puzzlehunt puzzle is here:
| https://dhashe.com/files/puzzles/word-wide-web.pdf
|
| And the solution is here: https://dhashe.com/files/puzzles/word-
| wide-web-sol.pdf
|
| And a fair warning to anyone unfamiliar with puzzlehunt puzzles:
| they do not come with instructions and it is very common to get
| stuck when solving them, especially when solving them alone. You
| have not completely solved a puzzlehunt puzzle until you extract
| an answer word or phrase from the puzzle. This one has an extra
| layer after filling in the words in the graph. Peeking at the
| solution is encouraged if you get stuck.
| michaeld123 wrote:
| That's an interesting puzzle. I hope more types of word puzzles
| continue to be created.
| totaldude87 wrote:
| I was looking for a similar app for my upcoming book! At times
| it's very hard to get the word that we are looking for and hope
| this solves it!
|
| I know this is not related to the app but still wanted to
| appreciate the thought
| us-merul wrote:
| I really liked this article and these types of analyses always
| capture me. I just had to try out the game then. I nailed the
| link to "moon" from "rise" on my first try. Then I was a bit let-
| down for my first real task to get to "chill" starting from
| "chain." I went first to conglomerate, then corporation, then
| management... thinking I would at some point encounter "cold,"
| and then "chill". Unfortunately not. Then I tried from chain to
| something like (my memory is imperfect here), necklace, jewelry,
| brilliance, glow, tranquil, calm-- and on a couple of other
| tries, appease, mollify, relax-- but could never get to "chill."
| I was able to win eventually by appealing to temperature which
| led me to chill.
|
| Is there anything the user could do to modify the next steps,
| other than picking a word? Perhaps selecting some sort of valence
| related to metaphor or meaning? "I want to pick 'pacify', but in
| the sense of calming down, not to utterly destroy."
| michaeld123 wrote:
| Thanks for reporting on your experience! Those are good
| questions, and I will think about your valence idea for the
| future.
|
| On a shorter horizon, I can tune the probability that on-path
| terms appear in the cloud. We store a larger pool of words than
| are displayed, and calculate lookaheads (and lookbacks from the
| target).
| mmooss wrote:
| Maybe the user could type in their own words, and the app
| could approve/disapprove based on the 40 word list.
|
| But maybe that adds an entirely new normalization function -
| user types 'runs' or 'ran', the app has to normalize to
| 'run'.
|
| The app could just have a 'more words' button, loading the
| next 17.
| us-merul wrote:
| Thanks for your response. Getting feedback like "hot or cold"
| in the algorithm's mind is exactly what I'm thinking of. It's
| a tricky issue and reminds me a lot of this:
| https://www.datcreativity.com/
|
| I had tried hard to pick a set of fairly simple words,
| thinking I had an intricately unique association in my head,
| only to find out that the reported connections were nothing
| more than average. My partner obviously landed in an
| extremely high percentile by instantly picking the first
| words that came to her without much thought.
| michaeld123 wrote:
| For good or bad, Semantle is able to report hot/cold
| because it's vector-based. We tried a few types of vectors,
| but I thought they were consistently unintuitive. So the
| best (and most relevant) proxy is remaining shortest
| distance-to-target, but often the player is only two hops
| away (spanning 17^2=289 options), and when they go astray
| and are much further, it's computationally too slow to look
| out more than 5 hops with brute force.
| michaeld123 wrote:
| Thanks for that datcreativity.com link. My score was 94.11,
| higher than 99.88% of the people who have completed this
| task! I should hope so after working on relations for
| years. ;)
|
| My words were: apple, shotgun, stardust, anger, hygiene,
| etymology, proctology, slant, dictator, and displacement.
| rafram wrote:
| I wanted to try out your app, but I cancelled the download after
| noticing that it's 3.5 gigabytes. How?! That's by far the biggest
| iOS app I've ever seen.
| michaeld123 wrote:
| Sorry! The problem is ~2kb of data per 1.5M headwords. We
| already use indicies and brotli compression internally. I doubt
| we could smush below 3GB.
| rafram wrote:
| Could you... have fewer headwords? That's like 5x the number
| of headwords actually used in modern English. Or at least
| download some of the data on demand?
| neuroelectron wrote:
| Wow only 3gb? Finally i have something to use the 128gb this
| iThing came with besides Firefox and Kindle.
| 6stringmerc wrote:
| Dissociating English terms from their context and focusing on the
| ease of relationship is a hilariously bad habit that people
| actively are trained AWAY from using. The nuance of English is
| absolutely going to break AI because even the example of "strong"
| relationships are suspect in utility.
|
| Seriously, when is the last time a casual speaker, writer, or
| translator used "domicile" in place of "house" in your world?
| It's an archaic term appropriated into legal jargon. Flattening
| out language and drawing lines between terms is funny to me.
|
| The only issue is normalizing "Thesaurus bashing" type
| mentalities - like this - to degrade the value of coherent,
| purposeful, meaningful use of English. It's an amalgamation
| language with extremely difficult fluency. It's rife with idioms
| and contradictory emotional context.
|
| Oh well, I can grasp that I tend to yell at clouds when it comes
| to this sort of thing. It doesn't change my opinion this is a
| harmful exercise and probably should not exist. There are few
| instances where playing a game will actually make one more
| stupid, but here we are.
| genewitch wrote:
| I used domicile about 45 minutes ago in casual conversation
| about fire ants in my abode. Habitation. Flat.
| cadamsdotcom wrote:
| Such an amazing data set with the amount of curation you've done
| and the care with which it's been put together.
|
| It'd be highly valuable as a thesaurus API.
| michaeld123 wrote:
| Thanks!.... Does anyone pay for thesaurus APIs anymore?
| jcmeyrignac wrote:
| Nice work! Here is a similar idea:
| https://wordassociations.net/en
|
| In french, there is a game to build relations with words (they
| provide a word, and you have to type the most related words):
| https://www.jeuxdemots.org They reached 677 million of relations
| in 2024!
| akudha wrote:
| What other word games do people enjoy? My favorites on iOS
|
| Alpha Omega
|
| Sticky Terms (I struggle with this)
|
| Typeshift
|
| Blackbar (old, not maintained, but we can still play. Not a game
| in strict sense, very enjoyable)
| michaeld123 wrote:
| And where and how do people discover new word games?
| akudha wrote:
| I found the above from iOS search. I also ask around, but not
| many people I know are interested in word games
| unfortunately.
|
| I suppose other languages have way less word games than
| English?
| slantaclaus wrote:
| I remember in college I got all stoned in the library and
| determined that you could find a semantic pathway using synonyms
| to relate completely opposite terms with only a few nodes.
| Completely blew my mind and I still think about it sometimes.
___________________________________________________________________
(page generated 2025-06-03 23:00 UTC)