[HN Gopher] Compressing Icelandic name declension patterns into ...
___________________________________________________________________
Compressing Icelandic name declension patterns into a 3.27 kB trie
Author : alexharri
Score : 189 points
Date : 2025-08-02 11:28 UTC (11 hours ago)
(HTM) web link (alexharri.com)
(TXT) w3m dump (alexharri.com)
| jedimastert wrote:
| It's like an interview question from hell. Reversing a trie is
| those things that I might ever use once in my life, but that one
| time I will look like an absolute wizard.
| treetalker wrote:
| I remember that when I was first learning Spanish in high school,
| I found a piece of (Windows) software that pelted you with a
| series of pairs of an infinitive and a tense, and you had to
| conjugate the infinitive accordingly. (Spanish conjugation
| typically changes the end of the word; irregular verbs tend to
| involve stem changes). It was fantastic practice and really
| ingrained the rules; I became a whiz at it.
|
| When I started learning Russian, the declensions (like the ones
| mentioned in the article) really threw me for a loop. I looked
| all over for a similar app to explain the patterns and drill rote
| practice, but never found one.
|
| While slightly off-topic, does anyone know of such an app (web-
| based or macOS/iOS)?
| leobg wrote:
| https://memrussian.com/?
| netsharc wrote:
| Grandfather talks about classical Windows software. On the
| Play Store this app says "Contains ads - In-app purchases".
|
| Ah, as a cheap bastard, I hate how software was pay once back
| then, and for this one I'm just going to ask you what's the
| monthly subscription price?
| mpascale00 wrote:
| This comes up in so many threads here... How can we change
| the culture of subscriptions back to pay once???
| necovek wrote:
| It's not really about the culture anymore. Software that
| requires maintenance -- and most does -- has a continuous
| development cost. As such, subscription is the most
| natural way to cover it.
|
| On the other hand, we have software which has low
| maintenance cost, but sold for peanuts ($0-$10) in small
| quantities, so authors try to introduce alternative
| revenue streams.
|
| As in, it's fair to pay continuously (subscription) for
| continuous work (maintenance), so I don't expect that to
| go away. Ads, though, yuck...
| sneak wrote:
| Software sold today does not require maintenance.
| Software to work in the future requires maintenance. I am
| not buying future software. I am buying today software.
|
| Increasingly I am not buying software at all.
| perching_aix wrote:
| This is a good argument in favor of subscriptions not
| being mandatory, but not in favor of the abolishment of
| subscriptions overall, which is what they were talking
| about.
| 3036e4 wrote:
| That is the old way. You bought some application and it
| came with upgrades until next major version release or
| similar. Then when that release came out you could decide
| to pay again or just keep using the old (now unsupported)
| version you already paid for.
|
| That solved all the issues with paying for maintenance,
| but sadly someone must have figured out a mandatory
| subscription was a better way to make more money.
| sgarland wrote:
| On the contrary, software today is so absurdly buggy that
| it often does require maintenance to work.
| charcircuit wrote:
| Even ignoring security, bug fixes, new features, etc it
| is also not fair that you can get value from the app
| every month, but the developer doesn't get to capture a
| reward for any of this value. Having people pay monthly
| for value they get monthly seems reasonable.
| BenjiWiebe wrote:
| Does that mean you'd be in favor of subscriptions for
| owning a vehicle, rather than paying outright? Or a
| house?
|
| The manufacturer/builder gets paid once, and you get
| value monthly.
| charcircuit wrote:
| Leasing cars and renting houses is already a common
| practice. So yes I believe these make sense to exist.
|
| The existence of purchasing cars and houses with no
| ongoing cost to the builder is due to competition.
| mpascale00 wrote:
| I disagree. You can read a book or listen to a record,
| watch a dvd, unlimited times, having fairly paid upfront
| a price for the item. A computer is general purpose and
| lets you check your email every day, hell even lets you
| create new value in the form of new software, without the
| manufacturer receiving a royalty.
|
| The idea of _capturing reward_ post-receipt is
| feudalistic.
| charcircuit wrote:
| The existence of products in competitive markets is not a
| counter example to what my point was. I recommend looking
| at the terms bottom up pricing and top down pricing. The
| former is about creating a price based off of how much it
| costs to do business and then adding a profit margin. The
| latter is creating price in line with how much value it
| offers customers. The existence of products using bottom
| up pricing doesn't mean top down pricing does not exist.
| necovek wrote:
| That's not how markets work (and I disagree that it would
| be reasonable).
|
| Price is usually established based on how much something
| cost to make (materials, effort, profit), combined with
| market conditions (abundance/shortage of products,
| surplus cash/tough economy...).
|
| If you want to continuously extract profit from
| consistent use of a hammer or vacuum cleaner, somebody
| else will trivially make a competing product at a lower
| price with no subscription.
| sgarland wrote:
| Given how profitable it is, I doubt it'll be changed.
|
| That said, I very much like Codeweavers' approach [0],
| which IMO is the modern equivalent to purchasing software
| on a physical medium: you buy it, you can re-download it
| as many times as you'd like, install it on as many
| machines as you'd like (single-user usage only), and you
| get 1 year of updates and support. After that, you can
| still keep using it indefinitely, but you don't get
| updates or paid support. You get a discount if you renew
| before expiry. They also have a lifetime option which, so
| far, they've not indicated they're going to change.
|
| I have no affiliation with them, I just think it's a good
| product, and a good licensing / sales model.
|
| [0]: https://www.codeweavers.com/store
| mpascale00 wrote:
| Profitable for sure, but I'm often half surprised by the
| lack of competition against subscription-based everything
| these days.
| nsksl wrote:
| Find a pirate version if possible...
| GuB-42 wrote:
| I don't know about this app but many of the "Contains ads -
| In-app purchases" apps offer to remove the ads for a one-
| time payment.
| yorwba wrote:
| You might be able to build something similar yourself using
| declension data extracted from Wiktionary using wiktextract:
| https://github.com/tatuylonen/wiktextract#pre-extracted-data
| jeffwass wrote:
| When I was learning Spanish (on my own) 25 years ago I had a
| Spanish/English dictionary. It only translated verbs to Spanish
| infinitive, but each had a numerical index mapping it to a
| class of verbs with the same conjugation pattern.
|
| There was a section at the front of the dictionary with full
| conjugation patterns over all tenses for one sample verb in
| each class.
|
| Eg, each type of stem-changing verb fell into one index, full
| irregulars were singletons in their own class, some irregulars
| that behave similarly (iirc tener and detener) shared one
| class.
|
| So all verbs in Spanish fell neatly into a few dozen unique
| patterns, and the indexing was already done.
|
| I was going to build a quiz software just like you mentioned to
| conjugate any verb in any tense, but "never got around to it".
|
| I wonder how the reverse-string trie pattern in the article
| would be for reconstructing the class mapping.
| kashunstva wrote:
| > ... learning Russian... explain the patterns... such an app
|
| Non-native Russian speaker here. In the past, I cobbled
| together some scripts that use the spaCy Python module with the
| larger of the two Russian modules to provide context-aware
| lemmatization and grammatical tag extraction.
|
| On the whole, though, my biggest gains in Russian were in
| letting go of the need to analytically deconstruct the
| inflections and instead build up a mental library of patterns
| (and exceptions) in my head through use.
|
| EDIT: I mean context within a sentence, not a broader meaning.
| Rendello wrote:
| There's some Anki (flashcard) decks that use the "KOFI" method:
|
| > KOFI (Konjugation First) is the name I've given to a
| provocative language-learning approach I've created: to learn
| all the forms of a language's conjugation before even starting
| to formally study the language
|
| I used the French one, years after I learned French, because my
| conjugation was abysmal. You can get by using basic tenses or
| wrong tenses, and people will understand you, but it's not what
| you want. The KOFI method is supposed to teach you all the
| conjugation patterns in a matter of months _before_ learning
| the language, I 'd like to give it a try in-earnest some day
| for a new language. My interest in French has waned so I didn't
| stick with it.
|
| https://ankiweb.net/shared/info/1131659186
| gametorch wrote:
| I used Clozemaster effectively to learn Russian. It's not
| exactly what out describe, but you can fly through many
| "clozes" to ingrain the patterns into your brain.
| jdcarr wrote:
| I use ConjuGato on iOS for practicing Spanish conjugations.
| There's a game mode where you're given an
| infinitive/tense/person and think of the conjugation and you
| can filter it down to solely irregular verbs to learn the
| exceptions
| LoneGeek wrote:
| If you can read Russian, there is a Python app for
| morphological analysis called pymorphy3. Documentation:
| https://pymorphy2.readthedocs.io/en/stable/.
|
| It is based on an OpenCorpora dictionary:
| https://opencorpora.org/dict.php
|
| This dictionary is based on a Zaliznyak dictionary, which is
| always referenced in Wiktionary's articles.
| lifthrasiir wrote:
| A possible alternative, especially for beygla/strict, would be
| perfect hashing.
| Scaevolus wrote:
| You can compress even better than standard perfect hashing
| because not all values are unique, so collisions might be allow
| you to store multiple name -> suffix combos in the same bucket.
|
| Of course, that would mean you lose the ability to say "name
| not handled".
| kmmbvnr_ wrote:
| Doesn't that look like an interesting approach for highly
| optimized embeddings?
| robin_reala wrote:
| No idea if Rails copes with this automatically, but it feels like
| the sort of magic it's historically been really good at. I
| remember reading the source code for `pluralise` and finding that
| someone had encoded the pluralisation rules including irregular
| cases for Welsh.
| Alifatisk wrote:
| Love Rails, there is a method for everything
| dmurray wrote:
| For the 800 names that were missing declension data in the
| database, it seems like the most straightforward thing to do
| would be to assign their declensions by hand. It shouldn't take a
| native speaker more than a couple of hours (if some name they
| haven't seen before is ambiguous, then whatever they guess at
| least won't sound obviously wrong to other native speakers).
| Alternatively, very cheap to ask an LLM to do it.
|
| Encoding them into a trie like this would still be a good way to
| distribute the result, but you don't have to rely on the trie
| also being a good way to guess the declensions.
| perching_aix wrote:
| Yeah, that'd be a good idea. That said, it still wouldn't
| resolve the issue for names that are in-use despite not being
| approved (or foreign names).
|
| I also live in a country with a centrally governed personal
| name list, but you can request exceptions, and there are people
| who were born before the list existed, so their names won't
| necessarily be on the list either. Immigrants can also retain
| their names during naturalization I believe, and there can be
| lots of other complications still. So the ability to sorta-
| kinda predict the proper declension is still useful.
| thaumasiotes wrote:
| Related:
| https://en.wikipedia.org/wiki/Naming_laws_in_China#Ma_Cheng
| wizzwizz4 wrote:
| I see no reason that an LLM should be better at guessing than a
| trie (unless the actual example was in its training data, in
| which case a web search would be more appropriate).
| dmurray wrote:
| I agree. I just like having the guessing done at compile time
| on principle. It allows you to change a guess, if you find
| that it's wrong, and convince yourself that you haven't
| broken any of the other cases where you were previously
| accidentally right.
| esafak wrote:
| I wonder if existing LLMs already know these patterns?
| jer0me wrote:
| The Icelandic government has been proactive about helping
| OpenAI train its models on the language to stave off
| extinction: https://openai.com/index/government-of-iceland/
| xigoi wrote:
| If they'd rather support open-source models so the future
| of the language is not in the hands of a single foreign
| corporation...
| thaumasiotes wrote:
| Yes, this is an example of a problem that an LLM is ideally
| suited to solve.
| alexharri wrote:
| It would be good to cover more names for sure -- that's an
| ongoing process at DIM. Names are frequently added to the
| approved list of Icelandic names, so there's always going to be
| some lag.
|
| I would not be confident enough myself to add the data myself
| since I'd probably be wrong a lot of the time. When reviewing
| the results for the top 100 unknown names I frequently got
| results that I thought _might_ be wrong, but I wasn't sure. For
| those, I looked up similar names in DIM to verify, and often
| thought "huh, I would not have declined those names like this".
| For that reason, I rely on the DIM data as the source of truth
| since it's maintained by experts on the language.
| alucardo wrote:
| Hmm, is this lib GDPR compliant?
| detaro wrote:
| Why wouldn't it be?
| bot403 wrote:
| If this isn't compliant than neither are name day calendars or
| baby name websites.
|
| It's not a privacy issue if it's just "someone's" name.
| kiicia wrote:
| GDPR is about accountability for handling identifiers like full
| name of actual person. Using parts of names, where each part
| does not identify any particular person, in generalized list
| like described here does not fall under GDPR.
| shagie wrote:
| There are a relatively finite number of Icelandic names.
| https://en.wikipedia.org/wiki/Icelandic_Naming_Committee
|
| > A name not already on the official list of approved names
| must be submitted to the naming committee for approval. A new
| name is considered for its compatibility with Icelandic
| tradition and for the likelihood that it might cause the bearer
| embarrassment. Under Article 5 of the Personal Names Act, names
| must be compatible with Icelandic grammar (in which all nouns,
| including proper names, have grammatical gender and change
| their forms in an orderly fashion according to the language's
| case system).
|
| A database of those names is no more interesting or personal
| than a dictionary or list of names (
| https://www.insee.fr/en/statistiques/6536067 ) in another
| language... which is where they got the data.
|
| > Iceland has a publicly run institution, Arnastofnun, that
| manages the Database of Icelandic Morphology (DIM). The
| database was created, amongst other reasons, to support
| Icelandic language technology.
|
| https://bin.arnastofnun.is/DMII/aboutDMII/
|
| There is no more personal information being presented than
| saying John or providing
| https://en.wikipedia.org/wiki/John_(given_name) or
| https://www.wolframalpha.com/input?i=John
|
| John may be _your_ given name, but that data isn 't personal
| data. One of the numbers 1969, 1978, 1987, 1996 might be your
| birth year... but https://oeis.org/A101039 isn't personal
| information either. Combining John with Smith and 1978 as the
| year of someone's birth... now you've got personal information
| that would be covered by the GDPR.
| ralferoo wrote:
| That's not quite what qualifies it as PII.
|
| > John may be your given name, but that data isn't personal
| data. One of the numbers 1969, 1978, 1987, 1996 might be your
| birth year... but https://oeis.org/A101039 isn't personal
| information either. Combining John with Smith and 1978 as the
| year of someone's birth... now you've got personal
| information that would be covered by the GDPR.
|
| Just the facts "John" or "Smith" or "1978" aren't PII, but
| any single one attached to some other data is, because then
| that provides partial identification of that other data. So,
| for instance an attribution of a forum post to "John" is PII,
| even if there are thousands of other Johns using the system.
|
| Actually, even that's not necessarily true. The mere fact
| that you are acknowledging a user exists with that name may
| make it PII. It's not a big deal to say our usernames include
| "John", "Mark", etc if there are literally thousands of them,
| but it's a big deal if one of the usernames is an incredibly
| rare name or spelling. In this case, the list presented in
| the article isn't PII, because the list is just a list of
| names downloaded from a government site that represent
| possible acceptable names. Just having that list provides no
| information about whether anyone with any of those names is
| using your service.
| radpanda wrote:
| > There are, in fact, 88 approved Icelandic names with this exact
| pattern of declension, and they all end with "dur", "tur" or
| "dur".
|
| ...
|
| > But that quickly breaks down. There are other names ending with
| "dur" or "dur" that follow a different pattern of declension
|
| My "everything should be completely orderly" comp-sci brain is
| always triggered by these almost trivial problems that end up
| being much more interesting.
|
| Is the suffix pattern based on the pronunciation of the
| syllable(s) before the suffix? If one wanted to improve upon your
| work for unknown names, rather than consider the letters used,
| would you have to do some NLP on the name to get a representation
| of the pronunciation and look that up (in a trie or otherwise)?
| dmit wrote:
| > Is the suffix pattern based on the pronunciation of the
| syllable(s) before the suffix?
|
| Careful, this is how you fall down the Are Dependent Types The
| Answer?? hole.
| perching_aix wrote:
| Not sure what that's supposed to mean, but if Icelandic is
| anything like my native language in this, then it _is_ indeed
| a pronunciation based thing. Which should make sense, since
| languages are (historically) spoken first, written second.
| dmit wrote:
| Heheh, it was mostly a reference to my [and mostly
| others'!] experiments with encoding human languages in a
| programming language. There are some pretty neat ideas
| there to explore, like the difference between Subject-
| Object-Verb (SOV) and Object-Subject-Verb. Or postfix
| languages (e.g. Forth) mapping to some human languages.
|
| In this particular example, having a subsequent part of an
| expression rely on prior parts would usually be
| accomplished at runtime in most languages. But some (like
| Idris) might allow you to encode the rules in the type
| system. Thus the rabbit hole.
| perching_aix wrote:
| Ah okay. That's a journey I'm currently also preparing to
| embark on, though from the other direction: I'm trying to
| generate "natural" language from program code. I already
| know it's pretty hopeless, but increasingly I feel like
| it's not really a choice anyhow, so I may as well finally
| have a go at it. Let's see :)
| dmit wrote:
| Godspeed!
| alexharri wrote:
| Hmm, good idea. There are names that have the exact same
| pronunciation yet have different patterns of declension, for
| example:
|
| - Astvaldur -> ur,,i,ar - Baldur -> ur,ur,ri,urs
|
| The "aldur" ending is pronounced in the exact same manner, but
| applying the declension pattern of "Astvaldur" to "Baldur"
| would yield:
|
| - Baldur - Bald - Baldi - Baldar
|
| The three last forms feel very wrong (I asked my partner to
| verify and she cringed).
|
| Spoken Icelandic is surprisingly close to its written form. I
| wouldn't expect very different results for the trie if a
| "phonetic" version of names and their endings were used instead
| of their written forms
| sneak wrote:
| This seems complicated.
|
| Why not just reuse the existing standard and change everyone's
| last names to Kim, Lee, or Park?
| dmit wrote:
| > everyone's last names
|
| *surnames. Not last in that case, whatever the case is you're
| trying to make.
| yujzgzc wrote:
| Valiant effort at old-school engineering applied to a niche
| problem. (Iceland has a population of only around 400,000
| people!) As much as I love the geekery of this stuff though,
| isn't it already a better ROI to get an LLM to generate the
| strings you need? It has its own other problems (not claiming
| it'll be perfect) but for something so language related, it makes
| a lot of sense. Would also work for other languages that have the
| same problem with declension of proper nouns like Russian or
| Finnish.
| tomsmeding wrote:
| The article describes that a government body is using this
| library to generate indictments. In that situation, you do
| _not_ want something that is mostly usually correct. Indeed,
| they asked the author for a strict version that does not try to
| guess the declension of unknown names based on their suffix,
| presumably so that they can just not decline them, which is
| better than picking the wrong declension 0.05% of the time.
| silvestrov wrote:
| One more optimization idea: instead of the trie mapping to the
| suffix string directly, then instead make an array of unique
| suffixes and let the trie map to the index into the array, e.g.
| const suffixes = [",,,", "a,u,u,u", ",,i,s", ",,,s", "i,a,a,a",
| ...];
|
| and then use the index of this list in the var
| serializedInput = "{e:{n:{ein:0_r: ...
| KTibow wrote:
| I (Claude Code) tried this and it actually increased the
| gzipped size by 100b (3456 -> 3556), only reducing the non-
| compressed size by 20%, likely because gzip is really good at
| interning repeated patterns already.
| contravariant wrote:
| You could go a step further by putting the suffixes themselves
| into the trie and then identifying identical subtrees.
|
| If you can use gzip there's bound to be a clever way of using a
| suffix array as well, that might end up being better unless you
| can use an optimised binary format for the tree.
| ryanjshaw wrote:
| An interesting article but I was surprised there was no
| discussion about what humans do to address this problem?
| Zanfa wrote:
| They stick with the nominative case. That's the only safe way
| not to butcher somebody's name in a language like Estonian that
| has 14 cases. It's infinitely easier to update copy to use only
| nominative than try to apply the cases automatically.
| alexharri wrote:
| As a native Icelandic speaker, I have an intuition for how to
| decline names -- I don't really think about it consciously. I'd
| assume that for most people it's just pattern matching.
|
| Native speakers very frequently decline names in ways that are
| not technically perfect but sound correct enough. For example,
| my name (Alex) should not be declined, but people frequently
| use the declension pattern (Alex, Alex, Alexi, Alexar).
|
| There's some parallel to be drawn with how the compressed trie
| applies patterns that it's learned to names. That's at least
| how I thought about it when designing the library.
| mikepurvis wrote:
| I'm surprised there'd be a benefit to doing this in the JS vs
| having your database just return all the cases with the name and
| then you select which one you need at display time -- basically
| in the same layer that's populating your localized language
| templates.
|
| That said I'm curious how this manifests with cross-language
| situations. I guess the Icelandic UI displaying French names
| would just always use the nomitive case, and likewise for the
| English UI displaying Icelandic names? I assume this all mostly
| matters where the user is directly being addressed, or perhaps in
| an admin panel ("user x responded to user y").
| tempodox wrote:
| Is Icelandic name declension deterministic enough that this
| method reliably works? That would be a lucky break. Language is
| typically quite messy.
| nkrisc wrote:
| It probably helps that Iceland has a relatively small
| population and the language is actively managed by the
| government.
| ralferoo wrote:
| I mean, it's an interesting problem for Icelandic sites, but
| because he's explaining the basic concepts of how declensions
| work, it seems like he's aiming this at non-Icelandic developers.
| If they were to use this, no doubt it'll end up butchering names
| in some other language and lead to all manner of hard to track
| down bugs.
|
| For example, if an English person called Arthur uses the site in
| Icelandic, I'm not sure they'd expect their name to be changed to
| presumably "Arth", "Arthi" or "Arthar" even if they were a keen
| learner of Icelandic. Their name is their name. So, as well as
| storing someone's name, you also have to ask them what language
| their name is, or guess and get it wrong. At that point, you
| might as well just ask them for all the different forms for the
| name as well, and then you don't have to worry about whether
| their name is on an approved list or not.
|
| And if the website isn't localised into Icelandic, I've also got
| to wonder if Icelandic visitors would have an expectation of
| Icelandic grammar rules being applied to English (or whatever)
| text. Most Icelandic people I've spoken to before have an
| excellent command of English anyway, and I'm sure they'd
| understand why their name isn't changing form in English.
| pelorat wrote:
| Not sure how it is nowadays, but Iceland used to force anyone
| immigrating to officialy change or "icelandify" their names.
|
| So if your name was Arthur, and you wanted to emigrate to
| Iceland you would you change name.
|
| Might still be like this.
| SonOfLilit wrote:
| My brain is screaming that there has to be a solution in <1kb
| uncompressed (for the non-strict version).
|
| Maybe generating a minimal list of regexes that classifies 100%
| of names correctly? Maybe a big enough bloom filter? Maybe like a
| bloom filter but instead of hashes we use engineered features?
___________________________________________________________________
(page generated 2025-08-02 23:00 UTC)