https://www.inotherwords.app/linguabase/ << HOME Making the Game Mapping Language Other Semantic Games The Small World of English Building a word game forced us to solve a measurement problem: how do you rank 40+ ways to associate any given word down to exactly 17 playable choices? We discovered that combining human-curated thesauri, book cataloging systems, and carefully constrained LLM queries creates a navigable network where 76% of random word pairs connect in <=7 hops--but only when you deprecate superconnectors and balance multiple ranking signals. The resulting network of 1.5 million English terms reveals that nearly any two common words connect in 6-7 hops through chains of meaningful associations. The mean path length of 6.43 hops held true across a million random word pairs--shorter than we'd guessed, and remarkably stable. 1.5M Headwords 100M Relationships <7 Degrees of Separation for 76% of words This is consistent with the small-world structure and near-universal connectivity seen in lexical network research on smaller datasets.^ 1,2 The network's structure makes intuitive semantic navigation possible--players can feel their way through meaningful transitions: a crown's gemstones lead to emerald's foliage and finally to a forest canopy, or a flame becomes an ember, then a glowing memory, a mental recall, and finally the action to cancel. Batman - vigilante - watchful - circumspect - inspect The Mathematics of Semantic Distance English exhibits network effects remarkably similar to social networks--nearly any random pair of words can reach each other in just a few hops through chains of meaningful associations. This "small world" phenomenon was first measured in word co-occurrence networks,^ 3 and persists even after we deprioritize superconnector words that might otherwise dominate many paths. To probe this, we randomly sampled 1 million word pairs (4 days processing on 32 cores), to get a strong statistical sampling of the connected core of English. How to connect any random 2 words? 1 0.01% 2 0.15% 3 2.07% 4 9.97% 5 21.58% 6 24.15% 7 18.25% 8 11.18% 9 6.19% 10+ 6.45% Hop Distance Between Words This bell curve centered at 5-6 hops creates ideal puzzle parameters. Here are some random examples at three distances. Conjugations lead to longer paths. ROLL DICE 2 HOPS rhyme - beat - percussion 5 HOPS outbreak - strife - contenders - finalists - runners-up - grand prizes 8 HOPS grounding - anchoring - berth - sleeping - nocturnal - nightjar - chirring - bombylious - dronelike Screenshot of the visual thesaurus mode showing word connections Visual thesaurus mode reveals multiple senses and weighted connections for any word. Network Construction and Coverage When we started this project, we tried the obvious approach: combining existing resources like WordNet with early-generation AI tools including LDA topic modeling and static word vectors. WordNet gave us clean synonym sets but lacked the associative richness players expect ("coffee" - "morning" not just "beverage"). LDA found topical clusters but mixed unrelated terms that happened to co-occur. Word vectors collapsed all senses into single points, making "bank" (river) indistinguishable from "bank" (financial). These produced fragmented, overly generic relationships that lacked the nuance our game needed. We capture 40 associations per term (enough for algorithmic flexibility) and display 17 in our interfaces (what users can reasonably process). This depth provides flexibility for both puzzle generation and reference use. 1,525,522 headwords We built a semantic network of 1.5 million English terms by casting a wider net than traditional resources. Where academic dictionaries drew sharp boundaries--excluding slang, technical jargon, compound phrases, and proper nouns--we included what people actually say and write. From "ice cream" to "thermodispersion," from "ghosting" to "Khao-I-Dang." This scale would have cost tens of millions to achieve manually. Consider the monumental pre-LLM efforts: * WordNet (1985-2010) - Princeton's 25-year project produced 155,000 words in synonym groups. Became the NLP standard despite missing everyday compounds. * OED (1857-1928, ongoing) - The definitive historical dictionary with 500,000+ entries. Took 70 years and thousands of contributors. * Webster's Third (1961) - America's unabridged dictionary with 476,000 entries. Required 757 editor-years and $3.5 million ($50M+ today). * Roget's Thesaurus (1852) - The original meaning-based reference with 15,000 words in 1,000 conceptual categories. Word counts become arbitrary at this scale. Include every technical term, place name, and slang variant, and the count explodes. Whether we have 1.5 million or 2 million depends entirely on where you draw the line. Atlas of Connected Meaning Our inclusion criteria cast a wide net: all terms that volunteer lexicographers at Wiktionary have included (slightly more liberal than typical unabridged dictionaries), plus high-importance Wikipedia topics that are 1-3 words long (measured by PageRank), plus frequently produced compound terms generated by LLMs when analyzing 648,460 Library of Congress book classifications. Compound terms like "local governance" (appearing in 44,507 classifications) and "literary criticism" (19,417) were included, while "wild equids" (5 occurrences) did not. What kinds of words? We include all these kinds of words, and to illustrate that there's no clear redline to include or exclude, here's a gradation of common and obscure examples of each... Compounds & Phrases * "health care" (standard compound) * "cut the rug" (dated slang for dancing) * "blow one's nose" (phrasal verb) * "hatch, match, and dispatch" (British newspaper jargon) * "make the welkin ring" (archaic for loud noise) Slang & Neologisms * "ghosting" (suddenly ending communication) * "panda huggers" (political slang) * "Devil's buttermilk" (euphemism for alcohol) * "drungry" (drunk + hungry) * "brass neck" (British for audacity) Technical Jargon * "antibiotics" (medical term) * "barber-chaired" (logging accident) * "lead plane" (wildfire aviation) * "hemicorpectomy" (surgical removal) * "photonephograph" (kidney imaging) Species & Taxonomy * "German shepherd" (dog breed) * "dwarf sirens" (salamander family) * "northern raccoons" (regional variant) * "Angoumois moths" (Sitotroga cerealella) * "grass crab spider" (specific arachnid) Historical Language * "thou" (archaic second person) * "oftimes" (Middle English) * "mean'st" (archaic conjugation) * "crurifragium" (Roman execution) * "naumachies" (staged naval battles) Word Variations * "running" (present participle) * "rotavates" (tills with rotary blades) * "masculises" (British spelling) * "disappoynts" (16-17th century) * "mattifies" (makes matte) Acronyms * "GPS" (Global Positioning System) * "CICUs" (Coronary Intensive Care Units) * "HKPF" (Hong Kong Police Force) * "MIMO-OFDM" (telecom standard) * "3DTDS" (3-D structural term) Places & Culture * "Broadway" (NYC theater district) * "Harsimus" (Jersey City district) * "Altai kray" (Russian federal subject) * "Khao-I-Dang" (refugee camp) * "ballybethagh" (Irish land measurement) Rare & Nonce * "selfie" (once nonce, now standard) * "greppable" (programmer slang) * "kiteboating" (water sport) * "quattrocentists" (1400s scholars) * "noitamrofni" (information backwards) Our analysis revealed a fundamental division in the network: * Reachable terms (56.8%): 870,522 words that appear in the top-40 associations of at least one other word * Unreachable terms (43.2%): 662,903 words that never appear in any other word's top-40 list The unreachable terms include rare compounds ("stewing in one's own grease"), technical terminology ("thermodispersion"), proper nouns ("Besisahar"), and alternative capitalizations. While these terms can point to other words, no words point back to them strongly enough to rank in any top-40 list. This doesn't affect puzzles--which start from common words--but reveals an interesting property of the semantic network. Beyond Traditional Thesauri A close-up of an open page from a traditional thesaurus Traditional thesauri focus on synonyms for abstract concepts while excluding concrete objects because they had limited paper pages. Our visual thesaurus presents up to 8 contextual senses per term, each showing its own 17-word neighborhood. Just as our headword inclusion is necessarily arbitrary, so too is our sense distinction. LLMs identified these senses by querying with various prompts for different meanings and contextual flavors, then merging similar results. We capped it at 8 senses as more became unwieldy in the user interface. Whether "bank" gets 2 senses or 5, whether "coffee" as beverage differs from "coffee" as social ritual--these are judgment calls. Beyond homographs (words with identical spelling but different meanings, like "bass" for sound versus fish), we capture what we call "contextual flavors" within single senses. 'Coffee' connects to 'cafe' (location), 'beverage' (category), and 'espresso' (variety) --same core meaning, different facets. Our design philosophy centered on how people think of word associations--pools of related meanings that don't necessarily align with how dictionaries split formal senses or define when meanings relate. This approach yields an average of 70 semantically connected words per headword across multiple senses, compared to 10-20 in traditional resources. Examples of our relationship types include: * Similar meanings: house - domicile, lodge * Category members: house - bungalow, villa * Functional relationships: horse - saddle, bridle * Cultural associations: breakfast - coffee, pastries * Taxonomic connections: quark - boson, fermion * Domain crossings: quark - Feynman (physics) or quark - cheese (food) * Thematic groupings: hike, nature, trail This approach yielded approximately 100 million directed edges connecting our 1.5 million terms. Try it yourself: What relates to "music"? Pick any 10 words from the pink box that you think best relate to "music." There's no perfect answer--that's the point. Great choices! You've captured your unique perspective on music. AVAILABLE WORDS YOUR SELECTIONS 0/10 Multiple Meanings as Network Bridges English words often carry multiple meanings, creating natural bridges in the network: Double Meanings Words with entirely different definitions: "bass" (sound/fish), "tear" (eye/rip) Related Meanings Connected definitions: "head" as body part, leadership role, or ship's bow Contextual Flavors "Hiking" as nature experience vs. physical exercise These multi-sense words create semantic bridges between seemingly unrelated concepts. Words like "ground" can connect earth, coffee, and electrical circuits in a single conceptual leap. You'd think words with multiple meanings would connect distant parts of the network faster. Turns out they don't--they just give you more creative ways to navigate the same distance. Our analysis of 100k homograph-containing paths shows they average 6.57 hops versus the 6.43 random baseline. Instead of creating shortcuts, they exist in densely connected regions, offering creative routing options rather than efficiency gains. The Bridges That Remain To prevent too many paths from routing through generic hubs like "general" or "study," we systematically penalized superconnectors throughout our workflow. But which words still emerge as natural bridges after this filtering? Try it yourself: Explore the bridges that survived After filtering out generic connectors, which words still bridge English's network? Showing bridges ranked 1-2 [1 ] These survivors represent genuine conceptual bridges--words that naturally connect different domains through polysemy ("polish" as verb/nationality), historical significance ("Renaissance"), or conceptual richness ("jazz" connecting musical techniques, cultural movements, and time periods). Their average position of ~2.2 hops from path origins shows they typically serve as the critical pivot point between disparate concepts. So where did we get our data? Five Data Sources A visualization of the five knowledge sources combining into the Linguabase The Linguabase integrates five complementary knowledge sources into a unified semantic network. The Linguabase integrates five complementary knowledge sources, each contributing unique strengths to our amalgam scoring system that uses multiple ranking signals--from word frequency and co-occurrence patterns to manually curated relationship scores: 1. In-House Lexicographic Work Our lexicographer and a team of freelance grad students manually created specialized word lists for 5k varied topics, and associations for polysemous terms and word types like interjections that traditional lexicography treats as "stopwords." These lists cover many of the most important common terms with multiple meanings. LLM Generation vs. Recognition Generation Mode "What relates to schizophrenia?" - hallucinations - delusions - antipsychotics - psychiatry [safe, clinical terms only] Recognition Mode "Is 'shamanism' related?" - Yes, through cultural interpretations of hearing voices - Historical contexts [nuanced connection validated] 2. Mining 125 Years of Library Wisdom We discovered that LLMs are much better at recognizing valid semantic relationships than generating them from scratch. Ask an LLM "What relates to coffee?" and you'll get predictable answers: beverage, caffeine, morning. But the Library of Congress classification system revealed that 'coffee' appears in 2,542 different book classifications--linking to 'fair trade certification' in economic texts, 'coffee berry borer' in Hawaiian agriculture books, and 'import-export tariffs' in 487 trade policy publications. These connections capture how coffee actually intersects with global commerce, agriculture, and regulation. Coffee's 2,542 Library Contexts 487 Economics fair trade, tariffs, commodity markets 312 Agriculture berry borer, arabica, soil 208 Culture cafe society, coffeehouse politics 89 Chemistry caffeine extraction, roasting + 1,446 more classifications across history, law, art, medicine... Since 1897, LOC catalogers have encoded the intellectual connections between 17 million books, creating what's essentially a 125-year collaborative knowledge graph built by thousands of subject experts. Each classification represents a moment when a human expert decided "these concepts belong together"--and unlike web text, these decisions were expensive and permanent, made before SEO or engagement metrics existed. Expert Curation vs. Crowd Wisdom Web Text "Coffee is life! " (1 million tweets) | coffee - morning coffee - tired coffee - addiction LOC Classifications "Coffee industry--Labor--Guatemala" (47 scholarly books) | coffee - fair trade coffee - cooperatives coffee - child labor We gave an LLM a focused task: generate word lists for each of LOC's 648,460 classifications. A classification like "Hawaiian coffee trade" triggered specific, expert-like outputs: "kona coffee, arabica beans, coffee tariffs, pacific trade routes, coffee auctions"--far richer than asking generically about coffee. Each classification acted as a pre-engineered prompt that specified exactly which semantic neighborhood we wanted. "Schizophrenia--medical aspects" surfaced "atypical antipsychotic, dopamine antagonist," while "Schizophrenia--fiction" yielded "asylum writings, trauma memoirs, neurodivergent voices," capturing the full dimensionality of concepts. Context Shapes Connections: Schizophrenia Medical Context dopamine antagonist atypical antipsychotic serotonin antagonist bipolar disorder clinical trials Fiction Context asylum writings trauma memoirs neurodivergent voices madness in literature unreliable narrator The real magic came from inverting the index. When we asked "Which classifications contain 'algorithm'?" we found it appearing not just in computer science but in "aleatory electronic music" (alongside John Cage and stochastic processes), "mathematics in arts" (with fractals and Fibonacci sequences), and "investment mathematics" (with portfolio optimization). The system surfaced connections that require domain expertise: 'Las Vegas' linking to 'Colorado River water rights' through 12 books about Nevada's water crisis, or 'origami' connecting to 'shell structures' and 'stress analysis' through engineering texts on deployable structures. The Double Inversion Process STEP 1: Classification - Terms "Hawaiian coffee trade" - kona, arabica, tariffs, pacific routes... | STEP 2: Which terms co-occur with "algorithm"? Found with: John Cage, fractals, portfolio optimization... | STEP 3: Build co-occurrence network algorithm - stochastic music (8.4) | Fibonacci (7.2) | fractals (6.8) This approach gave us 3.1 million unique terms weighted by intellectual effort--a monograph on 'bank equipment' that mentions 'pneumatic tubes' (still used in 15 classifications!) counts more than casual blog mentions. Terms like "cultural heritage" appearing in 53,833 classifications became superconnectors we could appropriately down-rank, while preserving the "boring but essential" connections found in specialized journals like "sewer pipe periodicals" that link urban infrastructure to public health. Superconnector Term Penalties cultural heritage 53,833 x 0.15 53,833 x 0.15 local governance 44,507 x 0.26 44,507 x 0.26 tourism 8,520 x 0.72 8,520 x 0.72 pneumatic tubes 15 x 0.95 15 x 0.95 Higher frequency - Lower multiplier - Pushed down in rankings The process also revealed what we call the "Montreal effect"--where 'bagels' incorrectly associates with 'Expo 67,' 'McGill University,' and 'French-speaking' simply because Montreal is famous for its bagels. Our initial algorithm strengthened these geographic contaminations throughout the data. We resolved these spurious connections through subsequent LLM reviews that could distinguish true semantic relationships ("bagels - boiled dough - chewy texture") from coincidental geographic co-occurrence ("bagels - Montreal Canadiens"). The Montreal Effect: Geographic Contamination Geographic Co-occurrence Bagels - Expo 67 Bagels - McGill University Bagels - Montreal Canadiens Bagels - French-speaking True Semantic Relations Bagels - boiled dough Bagels - Jewish cuisine Bagels - sesame seeds Bagels - chewy texture 3. Human-Curated Resources Over 70 existing references contributed--dictionaries, thesauri, and encyclopedias from Wiktionary and WordNet to specialized resources like NASA's thesaurus and the National Library of Medicine's UMLS. Relationships appearing across multiple sources received higher weights. General Sources * Wiktionary * WordNet, ConceptNet, FrameNet * Roget's Thesaurus * SWOW-EN18 Specialized Sources * Getty Art & Architecture * NASA Thesaurus * UMLS Metathesaurus * AGROVOC Thesaurus 4. Pre-LLM Topic Extraction Before the rise of modern LLMs, we applied Latent Dirichlet Allocation (LDA) in 2013-2014 to discover eight context clusters for every headword in an in-house corpus of notable literary works. The algorithm scans large text collections and groups words that appear in similar contexts. Running it took 200,000 super-computer hours on the NSF's Extreme Science and Engineering Discovery Environment (XSEDE)--decades on a single machine. Results were noisy: a delightful mix of intuitive associations and oddities (often caused by treating compound terms as separate words). Still, the run surfaced relationships that pure frequency analysis and today's LLMs miss. We skipped early word-embedding vectors--numeric coordinates that place context-similar words near one another but merge all senses into one point--because, as games like Semantle show, their distances rarely match human intuition. We also evaluated word embeddings but their single-vector-per-word approach couldn't handle our need for multiple senses--a fundamental limitation that various researchers tried to patch.^5 5. Large Language Model Enhancement Starting in 2023, frontier models finally provided the semantic understanding we needed--they could distinguish "bank" (river) from "bank" (money) and generate contextually appropriate associations for each. These models could handle: * Everyday compound terms ("apple pie", "department store") * Morphological variations across parts of speech * Contextual dimensions of common words * Capitalization distinctions ("China" vs. "china") Still, left to their own devices, LLMs are banal and formulaic, wallowing in cliche, latching onto what they think prompts intend. We ran over 80 million API calls (~$200k in Azure API costs, with minor xAi costs) across dozens of workflows to combat this tendency. Beyond the LOC classifications, we applied focused-prompt strategies across our entire corpus: extracting distinct senses for each headword, generating contextual word lists per sense, prompting for cultural variations and regional differences. Each workflow fed into the next--outputs from sense detection became inputs for association generation, which informed cultural expansion passes. The key was always the same: constrained, specific prompts yielded far better results than open-ended queries. Even with careful prompting, the Montreal effect persisted. Geographic contamination appeared throughout: 'Broadway' linked to 'taxis' through New York; 'grits' to 'jazz' through the American South. We resolved these spurious connections through iterative LLM reviews that learned to distinguish true semantic relationships from coincidental geographic co-occurrence. This research and computational scale was made possible by $295k NSF SBIR seed funding (#2329817) and $150k Microsoft Azure compute resources. Understanding Our Biases Every semantic network encodes particular worldviews about which words relate to each other and how strongly they connect. Here are six key sources of bias that shape our network's rankings and inclusions: Editorial Choices AI Training Data Our lexicographer and team manually crafted relationships for GPT-4o's training data shapes its common polysemous terms, semantic associations, while its inevitably encoding their guardrails suppress certain linguistic backgrounds, cultural connections. We supplemented with contexts, and conceptual Grok-3 specifically for vulgar and frameworks about how meaning offensive terms that GPT-4o connects. Examples: "market" - wouldn't adequately cover. includes "variety" and "retail," Examples: "sex" - clinical terms omits "souk," "bazaar" * favored over colloquial language * "breakfast" - includes "cereal," "death" - euphemisms like "toast," omits "congee," "idli" * "passing" prioritized over direct "music" - includes "jazz," terms like "corpse," "decay" "consonance," omits "gamelan," "qawwali" Superconnector Deprecation Prompting Cascades No matter how a large thesaurus is constructed, certain terms seem to be ubiquitous. This is partially author bias, partially natural language structure, and worse with repetitive LLMs. We down-rank ubiquitous words like "heritage" and "surname"--a low-key version of Our multi-pass LLM workflow inverse-frequency normalization. (listing senses - expanding Our graduated penalty system culturally - reprocessing) scores 59,112 terms with an introduces systematic preferences inverse document frequency (IDF) that affect both what gets variant that down-ranks common included and how highly it ranks. terms (1-18). Surprisingly, Examples: Geographic diversity penalty correlates with conceptual emphasized, so "dance" includes breadth, not raw frequency: global forms equally * Cultural "heritage" (penalty 18) appears foods given comparable rankings to only 201 times, while "tourism" Western staples (penalty 14) appears 8,520 times. Examples: At one processing stage, 2 words get maximum penalty (18): "surname" and "heritage" * 46,445 words get minimal penalty (1) * "heritage" can connect to almost anything cultural, historical, or traditional Frequency [?] Importance Morphological and Similarity Filters Frequency is a useful but flawed Nobody wants word clouds full of proxy for word importance. It plurals and variants. Our filter captures actual usage but creates pushes 'baguettes' down 30 artifacts: 'pandas' outranking positions if 'baguette' already 'panda,' 'cheesecake' outranking appears, 'rolls' down 23 if 'roll' 'cheesecakes,' literary corpora exists (>90% string similarity overweighting 'thee,' technical gets +12 penalty, plural/singular terms underrepresented. Different differences +17, reordered corpora (books vs. screenplays) compounds +17). Length penalties produce subtly different also apply progressively. hierarchies, and none capture a Examples: In "bagels" - "baguette" word's actual utility for learners drops 30 positions because or gameplay. Examples: "pandas" "baguettes" appears earlier * plural outranks "panda" * "thee" "roll" drops 23 positions when elevated by Shakespeare * "rolls" is present * Singular "cheesecake" singular is more forms consistently cascade common than "cheesecakes" * downward when plurals exist Literary bias from Google N-grams These biases shape which connections appear, how strongly they're weighted, and where they rank in each word cloud. We've made deliberate choices to create a semantic network optimized for engaging gameplay--favoring conceptual diversity over raw frequency, meaningful connections over statistical noise. The Linguabase represents one coherent mapping of English's semantic landscape, designed to reveal the surprising paths that connect all words. How Network Properties Enable Gameplay The mathematical properties of our semantic network create natural game parameters: 17 Word choices per hop (curated from top 40) 7 Maximum path length (hops) 3 Minimum puzzle distance (hops) 27 Genius solutions per puzzle (33 optimal paths) We tested various difficulties and settled on 3-7 hops. Below 3 felt trivial; above 7, players gave up. The 3-hop puzzles naturally yield 27 solutions when we maintain 3 strong choices per step. For virtually any common word selected as a puzzle origin, there are ~370 million outward paths within 7 hops (about 10% less than the 17& sup7;=410 million theoretical maximum due to natural graph loops). Within those paths, only 200k-1 million reach the target--a random success rate of 0.05-0.27%. Players succeed at much higher rates because they navigate semantically rather than randomly. Our puzzles are engineered to ensure at least 3 good choices per hop, creating exactly 27 optimal three-hop "Genius" solutions (33 paths). Theoretical Maximum (if no word overlap) 3 hops: 173 = 4,913 paths 4 hops: 17&sup4; = 83,521 paths 5 hops: 17&sup5; = 1,419,857 paths 6 hops: 17&sup6; = 24,137,569 paths 7 hops: 17&sup7; = 410,338,673 paths Measured Reality (with semantic overlap) Total paths: ~370 million (90% of theoretical) Winning paths: 200k-1 million Beyond game limit: ~94% require 8+ hops Curious how we transformed this linguistic database into a daily word game? Read Making the Game to discover how we found the perfect game mechanic and balanced the difficulty. Further Explorations Making the Game How play mechanics evolved from mapping language to finding optimal 3-hop paths between distant concepts Other Semantic Games A curated showcase of meaning-based word games--from 1960s synonym chains to NYT's Connections --------------------------------------------------------------------- References 1. WordNet. Princeton University. Hand-built database of 155k words in synonym groups with typed relationships. The NLP standard for decades, but missing compounds, rankings, and everyday associations. 2. Steyvers & Tenenbaum (2005). The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cogn Sci. 2005 Jan 2;29(1):41-78. Proved semantic networks have small-world structure. Our 6.43 mean path length matches their findings. 3. Ferrer i Cancho & Sole (2001). The small world of human language. Proc Biol Sci. 2001 Nov 7;268(1482):2261-5. Any two words connect in 2-3 hops via co-occurrence. Shows language is naturally navigable. 4. Roget's Thesaurus (1852). 15,000 words in 1,000 concepts. The original meaning-based organization. 5. Word embeddings research (2010s). Various attempts to handle polysemy in vector models. None solved the fundamental one-vector-per-word limitation. 6. Small World of Words (2019). 12,000 words with crowd-sourced associations. Validates our connections match human intuitions. --------------------------------------------------------------------- By Michael Douma, Greg Ligierko, Li Mei, and Orin Hargraves A product of Institute for Dynamic Educational Advancement (IDEA.org) Press Kit