https://www.inotherwords.app/linguabase/

<< HOME
Making the Game Mapping Language Other Semantic Games

The Small World of English

Building a word game forced us to solve a measurement problem: how do
you rank 40+ ways to associate any given word down to exactly 17
playable choices? We discovered that combining human-curated
thesauri, book cataloging systems, and carefully constrained LLM
queries creates a navigable network where 76% of random word pairs
connect in <=7 hops--but only when you deprecate superconnectors and
balance multiple ranking signals. The resulting network of 1.5
million English terms reveals that nearly any two common words
connect in 6-7 hops through chains of meaningful associations. The
mean path length of 6.43 hops held true across a million random word
pairs--shorter than we'd guessed, and remarkably stable.

1.5M Headwords
100M Relationships
<7 Degrees of Separation
for 76% of words

This is consistent with the small-world structure and near-universal
connectivity seen in lexical network research on smaller datasets.^
1,2 The network's structure makes intuitive semantic navigation
possible--players can feel their way through meaningful transitions: a
crown's gemstones lead to emerald's foliage and finally to a forest
canopy, or a flame becomes an ember, then a glowing memory, a mental
recall, and finally the action to cancel.

Batman - vigilante - watchful - circumspect - inspect

The Mathematics of Semantic Distance

English exhibits network effects remarkably similar to social
networks--nearly any random pair of words can reach each other in just
a few hops through chains of meaningful associations. This "small
world" phenomenon was first measured in word co-occurrence networks,^
3 and persists even after we deprioritize superconnector words that
might otherwise dominate many paths.

To probe this, we randomly sampled 1 million word pairs (4 days
processing on 32 cores), to get a strong statistical sampling of the
connected core of English.

How to connect any random 2 words?

1 0.01%
2 0.15%
3 2.07%
4 9.97%
5 21.58%
6 24.15%
7 18.25%
8 11.18%
9 6.19%
10+ 6.45%

Hop Distance Between Words

This bell curve centered at 5-6 hops creates ideal puzzle parameters.

Here are some random examples at three distances. Conjugations lead
to longer paths.

ROLL
DICE
2
HOPS
rhyme - beat - percussion
5
HOPS
outbreak - strife - contenders - finalists - runners-up 
- grand prizes
8
HOPS
grounding - anchoring - berth - sleeping - nocturnal - nightjar 
- chirring - bombylious - dronelike
Screenshot of the visual thesaurus mode showing word connections
Visual thesaurus mode reveals multiple senses and weighted
connections for any word.

Network Construction and Coverage

When we started this project, we tried the obvious approach:
combining existing resources like WordNet with early-generation AI
tools including LDA topic modeling and static word vectors. WordNet
gave us clean synonym sets but lacked the associative richness
players expect ("coffee" - "morning" not just "beverage"). LDA found
topical clusters but mixed unrelated terms that happened to co-occur.
Word vectors collapsed all senses into single points, making "bank"
(river) indistinguishable from "bank" (financial). These produced
fragmented, overly generic relationships that lacked the nuance our
game needed.

We capture 40 associations per term (enough for algorithmic
flexibility) and display 17 in our interfaces (what users can
reasonably process). This depth provides flexibility for both puzzle
generation and reference use.

1,525,522 headwords

We built a semantic network of 1.5 million English terms by casting a
wider net than traditional resources. Where academic dictionaries
drew sharp boundaries--excluding slang, technical jargon, compound
phrases, and proper nouns--we included what people actually say and
write. From "ice cream" to "thermodispersion," from "ghosting" to
"Khao-I-Dang."

This scale would have cost tens of millions to achieve manually.
Consider the monumental pre-LLM efforts:

  * WordNet (1985-2010) - Princeton's 25-year project produced
    155,000 words in synonym groups. Became the NLP standard despite
    missing everyday compounds.
  * OED (1857-1928, ongoing) - The definitive historical dictionary
    with 500,000+ entries. Took 70 years and thousands of
    contributors.
  * Webster's Third (1961) - America's unabridged dictionary with
    476,000 entries. Required 757 editor-years and $3.5 million
    ($50M+ today).
  * Roget's Thesaurus (1852) - The original meaning-based reference
    with 15,000 words in 1,000 conceptual categories.

Word counts become arbitrary at this scale. Include every technical
term, place name, and slang variant, and the count explodes. Whether
we have 1.5 million or 2 million depends entirely on where you draw
the line.

Atlas of Connected Meaning

Our inclusion criteria cast a wide net: all terms that volunteer
lexicographers at Wiktionary have included (slightly more liberal
than typical unabridged dictionaries), plus high-importance Wikipedia
topics that are 1-3 words long (measured by PageRank), plus
frequently produced compound terms generated by LLMs when analyzing
648,460 Library of Congress book classifications. Compound terms like
"local governance" (appearing in 44,507 classifications) and
"literary criticism" (19,417) were included, while "wild equids" (5
occurrences) did not.

What kinds of words?

We include all these kinds of words, and to illustrate that there's
no clear redline to include or exclude, here's a gradation of common
and obscure examples of each...

Compounds & Phrases

  * "health care" (standard compound)
  * "cut the rug" (dated slang for dancing)
  * "blow one's nose" (phrasal verb)
  * "hatch, match, and dispatch" (British newspaper jargon)
  * "make the welkin ring" (archaic for loud noise)

Slang & Neologisms

  * "ghosting" (suddenly ending communication)
  * "panda huggers" (political slang)
  * "Devil's buttermilk" (euphemism for alcohol)
  * "drungry" (drunk + hungry)
  * "brass neck" (British for audacity)

Technical Jargon

  * "antibiotics" (medical term)
  * "barber-chaired" (logging accident)
  * "lead plane" (wildfire aviation)
  * "hemicorpectomy" (surgical removal)
  * "photonephograph" (kidney imaging)

Species & Taxonomy

  * "German shepherd" (dog breed)
  * "dwarf sirens" (salamander family)
  * "northern raccoons" (regional variant)
  * "Angoumois moths" (Sitotroga cerealella)
  * "grass crab spider" (specific arachnid)

Historical Language

  * "thou" (archaic second person)
  * "oftimes" (Middle English)
  * "mean'st" (archaic conjugation)
  * "crurifragium" (Roman execution)
  * "naumachies" (staged naval battles)

Word Variations

  * "running" (present participle)
  * "rotavates" (tills with rotary blades)
  * "masculises" (British spelling)
  * "disappoynts" (16-17th century)
  * "mattifies" (makes matte)

Acronyms

  * "GPS" (Global Positioning System)
  * "CICUs" (Coronary Intensive Care Units)
  * "HKPF" (Hong Kong Police Force)
  * "MIMO-OFDM" (telecom standard)
  * "3DTDS" (3-D structural term)

Places & Culture

  * "Broadway" (NYC theater district)
  * "Harsimus" (Jersey City district)
  * "Altai kray" (Russian federal subject)
  * "Khao-I-Dang" (refugee camp)
  * "ballybethagh" (Irish land measurement)

Rare & Nonce

  * "selfie" (once nonce, now standard)
  * "greppable" (programmer slang)
  * "kiteboating" (water sport)
  * "quattrocentists" (1400s scholars)
  * "noitamrofni" (information backwards)

Our analysis revealed a fundamental division in the network:

  * Reachable terms (56.8%): 870,522 words that appear in the top-40
    associations of at least one other word
  * Unreachable terms (43.2%): 662,903 words that never appear in any
    other word's top-40 list

The unreachable terms include rare compounds ("stewing in one's own
grease"), technical terminology ("thermodispersion"), proper nouns
("Besisahar"), and alternative capitalizations. While these terms can
point to other words, no words point back to them strongly enough to
rank in any top-40 list. This doesn't affect puzzles--which start from
common words--but reveals an interesting property of the semantic
network.

Beyond Traditional Thesauri

A close-up of an open page from a traditional thesaurus

Traditional thesauri focus on synonyms for abstract concepts while
excluding concrete objects because they had limited paper pages.

Our visual thesaurus presents up to 8 contextual senses per term,
each showing its own 17-word neighborhood. Just as our headword
inclusion is necessarily arbitrary, so too is our sense distinction.
LLMs identified these senses by querying with various prompts for
different meanings and contextual flavors, then merging similar
results. We capped it at 8 senses as more became unwieldy in the user
interface. Whether "bank" gets 2 senses or 5, whether "coffee" as
beverage differs from "coffee" as social ritual--these are judgment
calls.

Beyond homographs (words with identical spelling but different
meanings, like "bass" for sound versus fish), we capture what we call
"contextual flavors" within single senses. 'Coffee' connects to
'cafe' (location), 'beverage' (category), and 'espresso' (variety)
--same core meaning, different facets.

Our design philosophy centered on how people think of word
associations--pools of related meanings that don't necessarily align
with how dictionaries split formal senses or define when meanings
relate. This approach yields an average of 70 semantically connected
words per headword across multiple senses, compared to 10-20 in
traditional resources. Examples of our relationship types include:

  * Similar meanings: house - domicile, lodge
  * Category members: house - bungalow, villa
  * Functional relationships: horse - saddle, bridle
  * Cultural associations: breakfast - coffee, pastries
  * Taxonomic connections: quark - boson, fermion
  * Domain crossings: quark - Feynman (physics) or quark - cheese
    (food)
  * Thematic groupings: hike, nature, trail

This approach yielded approximately 100 million directed edges
connecting our 1.5 million terms.

Try it yourself: What relates to "music"?

Pick any 10 words from the pink box that you think best relate to
"music." There's no perfect answer--that's the point.

Great choices! You've captured your unique perspective on music.

AVAILABLE WORDS
YOUR SELECTIONS
0/10

Multiple Meanings as Network Bridges

English words often carry multiple meanings, creating natural bridges
in the network:

Double Meanings

Words with entirely different definitions: "bass" (sound/fish),
"tear" (eye/rip)

Related Meanings

Connected definitions: "head" as body part, leadership role, or
ship's bow

Contextual Flavors

"Hiking" as nature experience vs. physical exercise

These multi-sense words create semantic bridges between seemingly
unrelated concepts. Words like "ground" can connect earth, coffee,
and electrical circuits in a single conceptual leap.

You'd think words with multiple meanings would connect distant parts
of the network faster. Turns out they don't--they just give you more
creative ways to navigate the same distance. Our analysis of 100k
homograph-containing paths shows they average 6.57 hops versus the
6.43 random baseline. Instead of creating shortcuts, they exist in
densely connected regions, offering creative routing options rather
than efficiency gains.

The Bridges That Remain

To prevent too many paths from routing through generic hubs like
"general" or "study," we systematically penalized superconnectors
throughout our workflow. But which words still emerge as natural
bridges after this filtering?

Try it yourself: Explore the bridges that survived

After filtering out generic connectors, which words still bridge
English's network?

Showing bridges ranked 1-2 [1                   ]

These survivors represent genuine conceptual bridges--words that
naturally connect different domains through polysemy ("polish" as
verb/nationality), historical significance ("Renaissance"), or
conceptual richness ("jazz" connecting musical techniques, cultural
movements, and time periods). Their average position of ~2.2 hops
from path origins shows they typically serve as the critical pivot
point between disparate concepts.

So where did we get our data?

Five Data Sources

A visualization of the five knowledge sources combining into the
Linguabase

The Linguabase integrates five complementary knowledge sources into a
unified semantic network.

The Linguabase integrates five complementary knowledge sources, each
contributing unique strengths to our amalgam scoring system that uses
multiple ranking signals--from word frequency and co-occurrence
patterns to manually curated relationship scores:

1. In-House Lexicographic Work

Our lexicographer and a team of freelance grad students manually
created specialized word lists for 5k varied topics, and associations
for polysemous terms and word types like interjections that
traditional lexicography treats as "stopwords." These lists cover
many of the most important common terms with multiple meanings.

LLM Generation vs. Recognition

Generation Mode

"What relates to schizophrenia?"
- hallucinations
- delusions
- antipsychotics
- psychiatry
[safe, clinical terms only]

Recognition Mode

"Is 'shamanism' related?"
- Yes, through cultural
  interpretations of
  hearing voices
- Historical contexts
[nuanced connection validated]

2. Mining 125 Years of Library Wisdom

We discovered that LLMs are much better at recognizing valid semantic
relationships than generating them from scratch. Ask an LLM "What
relates to coffee?" and you'll get predictable answers: beverage,
caffeine, morning. But the Library of Congress classification system
revealed that 'coffee' appears in 2,542 different book
classifications--linking to 'fair trade certification' in economic
texts, 'coffee berry borer' in Hawaiian agriculture books, and
'import-export tariffs' in 487 trade policy publications. These
connections capture how coffee actually intersects with global
commerce, agriculture, and regulation.

Coffee's 2,542 Library Contexts

487
Economics
fair trade, tariffs, commodity markets
312
Agriculture
berry borer, arabica, soil
208
Culture
cafe society, coffeehouse politics
89
Chemistry
caffeine extraction, roasting
+ 1,446 more classifications across history, law, art, medicine...

Since 1897, LOC catalogers have encoded the intellectual connections
between 17 million books, creating what's essentially a 125-year
collaborative knowledge graph built by thousands of subject experts.
Each classification represents a moment when a human expert decided
"these concepts belong together"--and unlike web text, these decisions
were expensive and permanent, made before SEO or engagement metrics
existed.

Expert Curation vs. Crowd Wisdom

Web Text

"Coffee is life! "
(1 million tweets)
|
coffee - morning
coffee - tired
coffee - addiction

LOC Classifications

"Coffee industry--Labor--Guatemala"
(47 scholarly books)
|
coffee - fair trade
coffee - cooperatives
coffee - child labor

We gave an LLM a focused task: generate word lists for each of LOC's
648,460 classifications. A classification like "Hawaiian coffee
trade" triggered specific, expert-like outputs: "kona coffee, arabica
beans, coffee tariffs, pacific trade routes, coffee auctions"--far
richer than asking generically about coffee. Each classification
acted as a pre-engineered prompt that specified exactly which
semantic neighborhood we wanted. "Schizophrenia--medical aspects"
surfaced "atypical antipsychotic, dopamine antagonist," while
"Schizophrenia--fiction" yielded "asylum writings, trauma memoirs,
neurodivergent voices," capturing the full dimensionality of
concepts.

Context Shapes Connections: Schizophrenia

Medical Context

dopamine antagonist
atypical antipsychotic
serotonin antagonist
bipolar disorder
clinical trials

Fiction Context

asylum writings
trauma memoirs
neurodivergent voices
madness in literature
unreliable narrator

The real magic came from inverting the index. When we asked "Which
classifications contain 'algorithm'?" we found it appearing not just
in computer science but in "aleatory electronic music" (alongside
John Cage and stochastic processes), "mathematics in arts" (with
fractals and Fibonacci sequences), and "investment mathematics" (with
portfolio optimization). The system surfaced connections that require
domain expertise: 'Las Vegas' linking to 'Colorado River water
rights' through 12 books about Nevada's water crisis, or 'origami'
connecting to 'shell structures' and 'stress analysis' through
engineering texts on deployable structures.

The Double Inversion Process

STEP 1: Classification - Terms
"Hawaiian coffee trade" - kona, arabica, tariffs, pacific routes...
|
STEP 2: Which terms co-occur with "algorithm"?
Found with: John Cage, fractals, portfolio optimization...
|
STEP 3: Build co-occurrence network
algorithm - stochastic music (8.4) | Fibonacci (7.2) | fractals (6.8)

This approach gave us 3.1 million unique terms weighted by
intellectual effort--a monograph on 'bank equipment' that mentions
'pneumatic tubes' (still used in 15 classifications!) counts more
than casual blog mentions. Terms like "cultural heritage" appearing
in 53,833 classifications became superconnectors we could
appropriately down-rank, while preserving the "boring but essential"
connections found in specialized journals like "sewer pipe
periodicals" that link urban infrastructure to public health.

Superconnector Term Penalties

cultural heritage
53,833 x 0.15
53,833 x 0.15
local governance
44,507 x 0.26
44,507 x 0.26
tourism
8,520 x 0.72
8,520 x 0.72
pneumatic tubes
15 x 0.95
15 x 0.95
Higher frequency - Lower multiplier - Pushed down in rankings

The process also revealed what we call the "Montreal effect"--where
'bagels' incorrectly associates with 'Expo 67,' 'McGill University,'
and 'French-speaking' simply because Montreal is famous for its
bagels. Our initial algorithm strengthened these geographic
contaminations throughout the data. We resolved these spurious
connections through subsequent LLM reviews that could distinguish
true semantic relationships ("bagels - boiled dough - chewy texture")
from coincidental geographic co-occurrence ("bagels - Montreal
Canadiens").

The Montreal Effect: Geographic Contamination

 Geographic Co-occurrence

Bagels - Expo 67
Bagels - McGill University
Bagels - Montreal Canadiens
Bagels - French-speaking

 True Semantic Relations

Bagels - boiled dough
Bagels - Jewish cuisine
Bagels - sesame seeds
Bagels - chewy texture

3. Human-Curated Resources

Over 70 existing references contributed--dictionaries, thesauri, and
encyclopedias from Wiktionary and WordNet to specialized resources
like NASA's thesaurus and the National Library of Medicine's UMLS.
Relationships appearing across multiple sources received higher
weights.

General Sources

  * Wiktionary
  * WordNet, ConceptNet, FrameNet
  * Roget's Thesaurus
  * SWOW-EN18

Specialized Sources

  * Getty Art & Architecture
  * NASA Thesaurus
  * UMLS Metathesaurus
  * AGROVOC Thesaurus

4. Pre-LLM Topic Extraction

Before the rise of modern LLMs, we applied Latent Dirichlet
Allocation (LDA) in 2013-2014 to discover eight context clusters for
every headword in an in-house corpus of notable literary works. The
algorithm scans large text collections and groups words that appear
in similar contexts. Running it took 200,000 super-computer hours on
the NSF's Extreme Science and Engineering Discovery Environment
(XSEDE)--decades on a single machine. Results were noisy: a delightful
mix of intuitive associations and oddities (often caused by treating
compound terms as separate words). Still, the run surfaced
relationships that pure frequency analysis and today's LLMs miss.

We skipped early word-embedding vectors--numeric coordinates that
place context-similar words near one another but merge all senses
into one point--because, as games like Semantle show, their distances
rarely match human intuition. We also evaluated word embeddings but
their single-vector-per-word approach couldn't handle our need for
multiple senses--a fundamental limitation that various researchers
tried to patch.^5

5. Large Language Model Enhancement

Starting in 2023, frontier models finally provided the semantic
understanding we needed--they could distinguish "bank" (river) from
"bank" (money) and generate contextually appropriate associations for
each. These models could handle:

  * Everyday compound terms ("apple pie", "department store")
  * Morphological variations across parts of speech
  * Contextual dimensions of common words
  * Capitalization distinctions ("China" vs. "china")

Still, left to their own devices, LLMs are banal and formulaic,
wallowing in cliche, latching onto what they think prompts intend. We
ran over 80 million API calls (~$200k in Azure API costs, with minor
xAi costs) across dozens of workflows to combat this tendency. Beyond
the LOC classifications, we applied focused-prompt strategies across
our entire corpus: extracting distinct senses for each headword,
generating contextual word lists per sense, prompting for cultural
variations and regional differences. Each workflow fed into the
next--outputs from sense detection became inputs for association
generation, which informed cultural expansion passes. The key was
always the same: constrained, specific prompts yielded far better
results than open-ended queries.

Even with careful prompting, the Montreal effect persisted.
Geographic contamination appeared throughout: 'Broadway' linked to
'taxis' through New York; 'grits' to 'jazz' through the American
South. We resolved these spurious connections through iterative LLM
reviews that learned to distinguish true semantic relationships from
coincidental geographic co-occurrence. This research and
computational scale was made possible by $295k NSF SBIR seed funding
(#2329817) and $150k Microsoft Azure compute resources.

Understanding Our Biases

Every semantic network encodes particular worldviews about which
words relate to each other and how strongly they connect. Here are
six key sources of bias that shape our network's rankings and
inclusions:

        Editorial Choices                   AI Training Data
Our lexicographer and team
manually crafted relationships for GPT-4o's training data shapes its
common polysemous terms,           semantic associations, while its
inevitably encoding their          guardrails suppress certain
linguistic backgrounds, cultural   connections. We supplemented with
contexts, and conceptual           Grok-3 specifically for vulgar and
frameworks about how meaning       offensive terms that GPT-4o
connects. Examples: "market" -     wouldn't adequately cover.
includes "variety" and "retail,"   Examples: "sex" - clinical terms
omits "souk," "bazaar" *           favored over colloquial language *
"breakfast" - includes "cereal,"   "death" - euphemisms like
"toast," omits "congee," "idli" *  "passing" prioritized over direct
"music" - includes "jazz,"         terms like "corpse," "decay"
"consonance," omits "gamelan,"
"qawwali"
    Superconnector Deprecation             Prompting Cascades
No matter how a large thesaurus is
constructed, certain terms seem to
be ubiquitous. This is partially
author bias, partially natural
language structure, and worse with
repetitive LLMs. We down-rank
ubiquitous words like "heritage"
and "surname"--a low-key version of Our multi-pass LLM workflow
inverse-frequency normalization.   (listing senses - expanding
Our graduated penalty system       culturally - reprocessing)
scores 59,112 terms with an        introduces systematic preferences
inverse document frequency (IDF)   that affect both what gets
variant that down-ranks common     included and how highly it ranks.
terms (1-18). Surprisingly,        Examples: Geographic diversity
penalty correlates with conceptual emphasized, so "dance" includes
breadth, not raw frequency:        global forms equally * Cultural
"heritage" (penalty 18) appears    foods given comparable rankings to
only 201 times, while "tourism"    Western staples
(penalty 14) appears 8,520 times.
Examples: At one processing stage,
2 words get maximum penalty (18):
"surname" and "heritage" * 46,445
words get minimal penalty (1) *
"heritage" can connect to almost
anything cultural, historical, or
traditional
      Frequency [?] Importance          Morphological and Similarity
                                                Filters
Frequency is a useful but flawed   Nobody wants word clouds full of
proxy for word importance. It      plurals and variants. Our filter
captures actual usage but creates  pushes 'baguettes' down 30
artifacts: 'pandas' outranking     positions if 'baguette' already
'panda,' 'cheesecake' outranking   appears, 'rolls' down 23 if 'roll'
'cheesecakes,' literary corpora    exists (>90% string similarity
overweighting 'thee,' technical    gets +12 penalty, plural/singular
terms underrepresented. Different  differences +17, reordered
corpora (books vs. screenplays)    compounds +17). Length penalties
produce subtly different           also apply progressively.
hierarchies, and none capture a    Examples: In "bagels" - "baguette"
word's actual utility for learners drops 30 positions because
or gameplay. Examples: "pandas"    "baguettes" appears earlier *
plural outranks "panda" * "thee"   "roll" drops 23 positions when
elevated by Shakespeare *          "rolls" is present * Singular
"cheesecake" singular is more      forms consistently cascade
common than "cheesecakes" *        downward when plurals exist
Literary bias from Google N-grams

These biases shape which connections appear, how strongly they're
weighted, and where they rank in each word cloud. We've made
deliberate choices to create a semantic network optimized for
engaging gameplay--favoring conceptual diversity over raw frequency,
meaningful connections over statistical noise. The Linguabase
represents one coherent mapping of English's semantic landscape,
designed to reveal the surprising paths that connect all words.

How Network Properties Enable Gameplay

The mathematical properties of our semantic network create natural
game parameters:

17 Word choices per hop
(curated from top 40)
7 Maximum path length
(hops)
3 Minimum puzzle distance
(hops)
27 Genius solutions per puzzle
(33 optimal paths)

We tested various difficulties and settled on 3-7 hops. Below 3 felt
trivial; above 7, players gave up. The 3-hop puzzles naturally yield
27 solutions when we maintain 3 strong choices per step.

For virtually any common word selected as a puzzle origin, there are
~370 million outward paths within 7 hops (about 10% less than the 17&
sup7;=410 million theoretical maximum due to natural graph loops).
Within those paths, only 200k-1 million reach the target--a random
success rate of 0.05-0.27%. Players succeed at much higher rates
because they navigate semantically rather than randomly. Our puzzles
are engineered to ensure at least 3 good choices per hop, creating
exactly 27 optimal three-hop "Genius" solutions (33 paths).

Theoretical Maximum (if no word overlap)

3 hops: 173 = 4,913 paths
4 hops: 17&sup4; = 83,521 paths
5 hops: 17&sup5; = 1,419,857 paths
6 hops: 17&sup6; = 24,137,569 paths
7 hops: 17&sup7; = 410,338,673 paths

Measured Reality (with semantic overlap)

Total paths: ~370 million (90% of theoretical)
Winning paths: 200k-1 million
Beyond game limit: ~94% require 8+ hops

Curious how we transformed this linguistic database into a daily word
game? Read Making the Game to discover how we found the perfect game
mechanic and balanced the difficulty.

Further
Explorations

Making the Game How play mechanics evolved from mapping language to
finding optimal 3-hop paths between distant concepts Other Semantic
Games A curated showcase of meaning-based word games--from 1960s
synonym chains to NYT's Connections
---------------------------------------------------------------------

References

1. WordNet. Princeton University.
Hand-built database of 155k words in synonym groups with typed
relationships. The NLP standard for decades, but missing compounds,
rankings, and everyday associations.
2. Steyvers & Tenenbaum (2005). The large-scale structure of semantic
networks: statistical analyses and a model of semantic growth. Cogn
Sci. 2005 Jan 2;29(1):41-78.
Proved semantic networks have small-world structure. Our 6.43 mean
path length matches their findings.
3. Ferrer i Cancho & Sole (2001). The small world of human language.
Proc Biol Sci. 2001 Nov 7;268(1482):2261-5.
Any two words connect in 2-3 hops via co-occurrence. Shows language
is naturally navigable.
4. Roget's Thesaurus (1852).
15,000 words in 1,000 concepts. The original meaning-based
organization.
5. Word embeddings research (2010s).
Various attempts to handle polysemy in vector models. None solved the
fundamental one-vector-per-word limitation.
6. Small World of Words (2019).
12,000 words with crowd-sourced associations. Validates our
connections match human intuitions.
---------------------------------------------------------------------

By Michael Douma, Greg Ligierko, Li Mei, and Orin Hargraves

A product of Institute for Dynamic Educational Advancement (IDEA.org)

Press Kit