hngopher.com

       [HN Gopher] Compressing Icelandic name declension patterns into ...
       ___________________________________________________________________
        
       Compressing Icelandic name declension patterns into a 3.27 kB trie
        
       Author : alexharri
       Score  : 189 points
       Date   : 2025-08-02 11:28 UTC (11 hours ago)
        
 (HTM) web link (alexharri.com)
 (TXT) w3m dump (alexharri.com)
        
       | jedimastert wrote:
       | It's like an interview question from hell. Reversing a trie is
       | those things that I might ever use once in my life, but that one
       | time I will look like an absolute wizard.
        
       | treetalker wrote:
       | I remember that when I was first learning Spanish in high school,
       | I found a piece of (Windows) software that pelted you with a
       | series of pairs of an infinitive and a tense, and you had to
       | conjugate the infinitive accordingly. (Spanish conjugation
       | typically changes the end of the word; irregular verbs tend to
       | involve stem changes). It was fantastic practice and really
       | ingrained the rules; I became a whiz at it.
       | 
       | When I started learning Russian, the declensions (like the ones
       | mentioned in the article) really threw me for a loop. I looked
       | all over for a similar app to explain the patterns and drill rote
       | practice, but never found one.
       | 
       | While slightly off-topic, does anyone know of such an app (web-
       | based or macOS/iOS)?
        
         | leobg wrote:
         | https://memrussian.com/?
        
           | netsharc wrote:
           | Grandfather talks about classical Windows software. On the
           | Play Store this app says "Contains ads - In-app purchases".
           | 
           | Ah, as a cheap bastard, I hate how software was pay once back
           | then, and for this one I'm just going to ask you what's the
           | monthly subscription price?
        
             | mpascale00 wrote:
             | This comes up in so many threads here... How can we change
             | the culture of subscriptions back to pay once???
        
               | necovek wrote:
               | It's not really about the culture anymore. Software that
               | requires maintenance -- and most does -- has a continuous
               | development cost. As such, subscription is the most
               | natural way to cover it.
               | 
               | On the other hand, we have software which has low
               | maintenance cost, but sold for peanuts ($0-$10) in small
               | quantities, so authors try to introduce alternative
               | revenue streams.
               | 
               | As in, it's fair to pay continuously (subscription) for
               | continuous work (maintenance), so I don't expect that to
               | go away. Ads, though, yuck...
        
               | sneak wrote:
               | Software sold today does not require maintenance.
               | Software to work in the future requires maintenance. I am
               | not buying future software. I am buying today software.
               | 
               | Increasingly I am not buying software at all.
        
               | perching_aix wrote:
               | This is a good argument in favor of subscriptions not
               | being mandatory, but not in favor of the abolishment of
               | subscriptions overall, which is what they were talking
               | about.
        
               | 3036e4 wrote:
               | That is the old way. You bought some application and it
               | came with upgrades until next major version release or
               | similar. Then when that release came out you could decide
               | to pay again or just keep using the old (now unsupported)
               | version you already paid for.
               | 
               | That solved all the issues with paying for maintenance,
               | but sadly someone must have figured out a mandatory
               | subscription was a better way to make more money.
        
               | sgarland wrote:
               | On the contrary, software today is so absurdly buggy that
               | it often does require maintenance to work.
        
               | charcircuit wrote:
               | Even ignoring security, bug fixes, new features, etc it
               | is also not fair that you can get value from the app
               | every month, but the developer doesn't get to capture a
               | reward for any of this value. Having people pay monthly
               | for value they get monthly seems reasonable.
        
               | BenjiWiebe wrote:
               | Does that mean you'd be in favor of subscriptions for
               | owning a vehicle, rather than paying outright? Or a
               | house?
               | 
               | The manufacturer/builder gets paid once, and you get
               | value monthly.
        
               | charcircuit wrote:
               | Leasing cars and renting houses is already a common
               | practice. So yes I believe these make sense to exist.
               | 
               | The existence of purchasing cars and houses with no
               | ongoing cost to the builder is due to competition.
        
               | mpascale00 wrote:
               | I disagree. You can read a book or listen to a record,
               | watch a dvd, unlimited times, having fairly paid upfront
               | a price for the item. A computer is general purpose and
               | lets you check your email every day, hell even lets you
               | create new value in the form of new software, without the
               | manufacturer receiving a royalty.
               | 
               | The idea of _capturing reward_ post-receipt is
               | feudalistic.
        
               | charcircuit wrote:
               | The existence of products in competitive markets is not a
               | counter example to what my point was. I recommend looking
               | at the terms bottom up pricing and top down pricing. The
               | former is about creating a price based off of how much it
               | costs to do business and then adding a profit margin. The
               | latter is creating price in line with how much value it
               | offers customers. The existence of products using bottom
               | up pricing doesn't mean top down pricing does not exist.
        
               | necovek wrote:
               | That's not how markets work (and I disagree that it would
               | be reasonable).
               | 
               | Price is usually established based on how much something
               | cost to make (materials, effort, profit), combined with
               | market conditions (abundance/shortage of products,
               | surplus cash/tough economy...).
               | 
               | If you want to continuously extract profit from
               | consistent use of a hammer or vacuum cleaner, somebody
               | else will trivially make a competing product at a lower
               | price with no subscription.
        
               | sgarland wrote:
               | Given how profitable it is, I doubt it'll be changed.
               | 
               | That said, I very much like Codeweavers' approach [0],
               | which IMO is the modern equivalent to purchasing software
               | on a physical medium: you buy it, you can re-download it
               | as many times as you'd like, install it on as many
               | machines as you'd like (single-user usage only), and you
               | get 1 year of updates and support. After that, you can
               | still keep using it indefinitely, but you don't get
               | updates or paid support. You get a discount if you renew
               | before expiry. They also have a lifetime option which, so
               | far, they've not indicated they're going to change.
               | 
               | I have no affiliation with them, I just think it's a good
               | product, and a good licensing / sales model.
               | 
               | [0]: https://www.codeweavers.com/store
        
               | mpascale00 wrote:
               | Profitable for sure, but I'm often half surprised by the
               | lack of competition against subscription-based everything
               | these days.
        
             | nsksl wrote:
             | Find a pirate version if possible...
        
             | GuB-42 wrote:
             | I don't know about this app but many of the "Contains ads -
             | In-app purchases" apps offer to remove the ads for a one-
             | time payment.
        
         | yorwba wrote:
         | You might be able to build something similar yourself using
         | declension data extracted from Wiktionary using wiktextract:
         | https://github.com/tatuylonen/wiktextract#pre-extracted-data
        
         | jeffwass wrote:
         | When I was learning Spanish (on my own) 25 years ago I had a
         | Spanish/English dictionary. It only translated verbs to Spanish
         | infinitive, but each had a numerical index mapping it to a
         | class of verbs with the same conjugation pattern.
         | 
         | There was a section at the front of the dictionary with full
         | conjugation patterns over all tenses for one sample verb in
         | each class.
         | 
         | Eg, each type of stem-changing verb fell into one index, full
         | irregulars were singletons in their own class, some irregulars
         | that behave similarly (iirc tener and detener) shared one
         | class.
         | 
         | So all verbs in Spanish fell neatly into a few dozen unique
         | patterns, and the indexing was already done.
         | 
         | I was going to build a quiz software just like you mentioned to
         | conjugate any verb in any tense, but "never got around to it".
         | 
         | I wonder how the reverse-string trie pattern in the article
         | would be for reconstructing the class mapping.
        
         | kashunstva wrote:
         | > ... learning Russian... explain the patterns... such an app
         | 
         | Non-native Russian speaker here. In the past, I cobbled
         | together some scripts that use the spaCy Python module with the
         | larger of the two Russian modules to provide context-aware
         | lemmatization and grammatical tag extraction.
         | 
         | On the whole, though, my biggest gains in Russian were in
         | letting go of the need to analytically deconstruct the
         | inflections and instead build up a mental library of patterns
         | (and exceptions) in my head through use.
         | 
         | EDIT: I mean context within a sentence, not a broader meaning.
        
         | Rendello wrote:
         | There's some Anki (flashcard) decks that use the "KOFI" method:
         | 
         | > KOFI (Konjugation First) is the name I've given to a
         | provocative language-learning approach I've created: to learn
         | all the forms of a language's conjugation before even starting
         | to formally study the language
         | 
         | I used the French one, years after I learned French, because my
         | conjugation was abysmal. You can get by using basic tenses or
         | wrong tenses, and people will understand you, but it's not what
         | you want. The KOFI method is supposed to teach you all the
         | conjugation patterns in a matter of months _before_ learning
         | the language, I 'd like to give it a try in-earnest some day
         | for a new language. My interest in French has waned so I didn't
         | stick with it.
         | 
         | https://ankiweb.net/shared/info/1131659186
        
         | gametorch wrote:
         | I used Clozemaster effectively to learn Russian. It's not
         | exactly what out describe, but you can fly through many
         | "clozes" to ingrain the patterns into your brain.
        
         | jdcarr wrote:
         | I use ConjuGato on iOS for practicing Spanish conjugations.
         | There's a game mode where you're given an
         | infinitive/tense/person and think of the conjugation and you
         | can filter it down to solely irregular verbs to learn the
         | exceptions
        
         | LoneGeek wrote:
         | If you can read Russian, there is a Python app for
         | morphological analysis called pymorphy3. Documentation:
         | https://pymorphy2.readthedocs.io/en/stable/.
         | 
         | It is based on an OpenCorpora dictionary:
         | https://opencorpora.org/dict.php
         | 
         | This dictionary is based on a Zaliznyak dictionary, which is
         | always referenced in Wiktionary's articles.
        
       | lifthrasiir wrote:
       | A possible alternative, especially for beygla/strict, would be
       | perfect hashing.
        
         | Scaevolus wrote:
         | You can compress even better than standard perfect hashing
         | because not all values are unique, so collisions might be allow
         | you to store multiple name -> suffix combos in the same bucket.
         | 
         | Of course, that would mean you lose the ability to say "name
         | not handled".
        
       | kmmbvnr_ wrote:
       | Doesn't that look like an interesting approach for highly
       | optimized embeddings?
        
       | robin_reala wrote:
       | No idea if Rails copes with this automatically, but it feels like
       | the sort of magic it's historically been really good at. I
       | remember reading the source code for `pluralise` and finding that
       | someone had encoded the pluralisation rules including irregular
       | cases for Welsh.
        
         | Alifatisk wrote:
         | Love Rails, there is a method for everything
        
       | dmurray wrote:
       | For the 800 names that were missing declension data in the
       | database, it seems like the most straightforward thing to do
       | would be to assign their declensions by hand. It shouldn't take a
       | native speaker more than a couple of hours (if some name they
       | haven't seen before is ambiguous, then whatever they guess at
       | least won't sound obviously wrong to other native speakers).
       | Alternatively, very cheap to ask an LLM to do it.
       | 
       | Encoding them into a trie like this would still be a good way to
       | distribute the result, but you don't have to rely on the trie
       | also being a good way to guess the declensions.
        
         | perching_aix wrote:
         | Yeah, that'd be a good idea. That said, it still wouldn't
         | resolve the issue for names that are in-use despite not being
         | approved (or foreign names).
         | 
         | I also live in a country with a centrally governed personal
         | name list, but you can request exceptions, and there are people
         | who were born before the list existed, so their names won't
         | necessarily be on the list either. Immigrants can also retain
         | their names during naturalization I believe, and there can be
         | lots of other complications still. So the ability to sorta-
         | kinda predict the proper declension is still useful.
        
           | thaumasiotes wrote:
           | Related:
           | https://en.wikipedia.org/wiki/Naming_laws_in_China#Ma_Cheng
        
         | wizzwizz4 wrote:
         | I see no reason that an LLM should be better at guessing than a
         | trie (unless the actual example was in its training data, in
         | which case a web search would be more appropriate).
        
           | dmurray wrote:
           | I agree. I just like having the guessing done at compile time
           | on principle. It allows you to change a guess, if you find
           | that it's wrong, and convince yourself that you haven't
           | broken any of the other cases where you were previously
           | accidentally right.
        
         | esafak wrote:
         | I wonder if existing LLMs already know these patterns?
        
           | jer0me wrote:
           | The Icelandic government has been proactive about helping
           | OpenAI train its models on the language to stave off
           | extinction: https://openai.com/index/government-of-iceland/
        
             | xigoi wrote:
             | If they'd rather support open-source models so the future
             | of the language is not in the hands of a single foreign
             | corporation...
        
           | thaumasiotes wrote:
           | Yes, this is an example of a problem that an LLM is ideally
           | suited to solve.
        
         | alexharri wrote:
         | It would be good to cover more names for sure -- that's an
         | ongoing process at DIM. Names are frequently added to the
         | approved list of Icelandic names, so there's always going to be
         | some lag.
         | 
         | I would not be confident enough myself to add the data myself
         | since I'd probably be wrong a lot of the time. When reviewing
         | the results for the top 100 unknown names I frequently got
         | results that I thought _might_ be wrong, but I wasn't sure. For
         | those, I looked up similar names in DIM to verify, and often
         | thought "huh, I would not have declined those names like this".
         | For that reason, I rely on the DIM data as the source of truth
         | since it's maintained by experts on the language.
        
       | alucardo wrote:
       | Hmm, is this lib GDPR compliant?
        
         | detaro wrote:
         | Why wouldn't it be?
        
         | bot403 wrote:
         | If this isn't compliant than neither are name day calendars or
         | baby name websites.
         | 
         | It's not a privacy issue if it's just "someone's" name.
        
         | kiicia wrote:
         | GDPR is about accountability for handling identifiers like full
         | name of actual person. Using parts of names, where each part
         | does not identify any particular person, in generalized list
         | like described here does not fall under GDPR.
        
         | shagie wrote:
         | There are a relatively finite number of Icelandic names.
         | https://en.wikipedia.org/wiki/Icelandic_Naming_Committee
         | 
         | > A name not already on the official list of approved names
         | must be submitted to the naming committee for approval. A new
         | name is considered for its compatibility with Icelandic
         | tradition and for the likelihood that it might cause the bearer
         | embarrassment. Under Article 5 of the Personal Names Act, names
         | must be compatible with Icelandic grammar (in which all nouns,
         | including proper names, have grammatical gender and change
         | their forms in an orderly fashion according to the language's
         | case system).
         | 
         | A database of those names is no more interesting or personal
         | than a dictionary or list of names (
         | https://www.insee.fr/en/statistiques/6536067 ) in another
         | language... which is where they got the data.
         | 
         | > Iceland has a publicly run institution, Arnastofnun, that
         | manages the Database of Icelandic Morphology (DIM). The
         | database was created, amongst other reasons, to support
         | Icelandic language technology.
         | 
         | https://bin.arnastofnun.is/DMII/aboutDMII/
         | 
         | There is no more personal information being presented than
         | saying John or providing
         | https://en.wikipedia.org/wiki/John_(given_name) or
         | https://www.wolframalpha.com/input?i=John
         | 
         | John may be _your_ given name, but that data isn 't personal
         | data. One of the numbers 1969, 1978, 1987, 1996 might be your
         | birth year... but https://oeis.org/A101039 isn't personal
         | information either. Combining John with Smith and 1978 as the
         | year of someone's birth... now you've got personal information
         | that would be covered by the GDPR.
        
           | ralferoo wrote:
           | That's not quite what qualifies it as PII.
           | 
           | > John may be your given name, but that data isn't personal
           | data. One of the numbers 1969, 1978, 1987, 1996 might be your
           | birth year... but https://oeis.org/A101039 isn't personal
           | information either. Combining John with Smith and 1978 as the
           | year of someone's birth... now you've got personal
           | information that would be covered by the GDPR.
           | 
           | Just the facts "John" or "Smith" or "1978" aren't PII, but
           | any single one attached to some other data is, because then
           | that provides partial identification of that other data. So,
           | for instance an attribution of a forum post to "John" is PII,
           | even if there are thousands of other Johns using the system.
           | 
           | Actually, even that's not necessarily true. The mere fact
           | that you are acknowledging a user exists with that name may
           | make it PII. It's not a big deal to say our usernames include
           | "John", "Mark", etc if there are literally thousands of them,
           | but it's a big deal if one of the usernames is an incredibly
           | rare name or spelling. In this case, the list presented in
           | the article isn't PII, because the list is just a list of
           | names downloaded from a government site that represent
           | possible acceptable names. Just having that list provides no
           | information about whether anyone with any of those names is
           | using your service.
        
       | radpanda wrote:
       | > There are, in fact, 88 approved Icelandic names with this exact
       | pattern of declension, and they all end with "dur", "tur" or
       | "dur".
       | 
       | ...
       | 
       | > But that quickly breaks down. There are other names ending with
       | "dur" or "dur" that follow a different pattern of declension
       | 
       | My "everything should be completely orderly" comp-sci brain is
       | always triggered by these almost trivial problems that end up
       | being much more interesting.
       | 
       | Is the suffix pattern based on the pronunciation of the
       | syllable(s) before the suffix? If one wanted to improve upon your
       | work for unknown names, rather than consider the letters used,
       | would you have to do some NLP on the name to get a representation
       | of the pronunciation and look that up (in a trie or otherwise)?
        
         | dmit wrote:
         | > Is the suffix pattern based on the pronunciation of the
         | syllable(s) before the suffix?
         | 
         | Careful, this is how you fall down the Are Dependent Types The
         | Answer?? hole.
        
           | perching_aix wrote:
           | Not sure what that's supposed to mean, but if Icelandic is
           | anything like my native language in this, then it _is_ indeed
           | a pronunciation based thing. Which should make sense, since
           | languages are (historically) spoken first, written second.
        
             | dmit wrote:
             | Heheh, it was mostly a reference to my [and mostly
             | others'!] experiments with encoding human languages in a
             | programming language. There are some pretty neat ideas
             | there to explore, like the difference between Subject-
             | Object-Verb (SOV) and Object-Subject-Verb. Or postfix
             | languages (e.g. Forth) mapping to some human languages.
             | 
             | In this particular example, having a subsequent part of an
             | expression rely on prior parts would usually be
             | accomplished at runtime in most languages. But some (like
             | Idris) might allow you to encode the rules in the type
             | system. Thus the rabbit hole.
        
               | perching_aix wrote:
               | Ah okay. That's a journey I'm currently also preparing to
               | embark on, though from the other direction: I'm trying to
               | generate "natural" language from program code. I already
               | know it's pretty hopeless, but increasingly I feel like
               | it's not really a choice anyhow, so I may as well finally
               | have a go at it. Let's see :)
        
               | dmit wrote:
               | Godspeed!
        
         | alexharri wrote:
         | Hmm, good idea. There are names that have the exact same
         | pronunciation yet have different patterns of declension, for
         | example:
         | 
         | - Astvaldur -> ur,,i,ar - Baldur -> ur,ur,ri,urs
         | 
         | The "aldur" ending is pronounced in the exact same manner, but
         | applying the declension pattern of "Astvaldur" to "Baldur"
         | would yield:
         | 
         | - Baldur - Bald - Baldi - Baldar
         | 
         | The three last forms feel very wrong (I asked my partner to
         | verify and she cringed).
         | 
         | Spoken Icelandic is surprisingly close to its written form. I
         | wouldn't expect very different results for the trie if a
         | "phonetic" version of names and their endings were used instead
         | of their written forms
        
       | sneak wrote:
       | This seems complicated.
       | 
       | Why not just reuse the existing standard and change everyone's
       | last names to Kim, Lee, or Park?
        
         | dmit wrote:
         | > everyone's last names
         | 
         | *surnames. Not last in that case, whatever the case is you're
         | trying to make.
        
       | yujzgzc wrote:
       | Valiant effort at old-school engineering applied to a niche
       | problem. (Iceland has a population of only around 400,000
       | people!) As much as I love the geekery of this stuff though,
       | isn't it already a better ROI to get an LLM to generate the
       | strings you need? It has its own other problems (not claiming
       | it'll be perfect) but for something so language related, it makes
       | a lot of sense. Would also work for other languages that have the
       | same problem with declension of proper nouns like Russian or
       | Finnish.
        
         | tomsmeding wrote:
         | The article describes that a government body is using this
         | library to generate indictments. In that situation, you do
         | _not_ want something that is mostly usually correct. Indeed,
         | they asked the author for a strict version that does not try to
         | guess the declension of unknown names based on their suffix,
         | presumably so that they can just not decline them, which is
         | better than picking the wrong declension 0.05% of the time.
        
       | silvestrov wrote:
       | One more optimization idea: instead of the trie mapping to the
       | suffix string directly, then instead make an array of unique
       | suffixes and let the trie map to the index into the array, e.g.
       | const suffixes = [",,,", "a,u,u,u", ",,i,s", ",,,s", "i,a,a,a",
       | ...];
       | 
       | and then use the index of this list in the                   var
       | serializedInput = "{e:{n:{ein:0_r: ...
        
         | KTibow wrote:
         | I (Claude Code) tried this and it actually increased the
         | gzipped size by 100b (3456 -> 3556), only reducing the non-
         | compressed size by 20%, likely because gzip is really good at
         | interning repeated patterns already.
        
         | contravariant wrote:
         | You could go a step further by putting the suffixes themselves
         | into the trie and then identifying identical subtrees.
         | 
         | If you can use gzip there's bound to be a clever way of using a
         | suffix array as well, that might end up being better unless you
         | can use an optimised binary format for the tree.
        
       | ryanjshaw wrote:
       | An interesting article but I was surprised there was no
       | discussion about what humans do to address this problem?
        
         | Zanfa wrote:
         | They stick with the nominative case. That's the only safe way
         | not to butcher somebody's name in a language like Estonian that
         | has 14 cases. It's infinitely easier to update copy to use only
         | nominative than try to apply the cases automatically.
        
         | alexharri wrote:
         | As a native Icelandic speaker, I have an intuition for how to
         | decline names -- I don't really think about it consciously. I'd
         | assume that for most people it's just pattern matching.
         | 
         | Native speakers very frequently decline names in ways that are
         | not technically perfect but sound correct enough. For example,
         | my name (Alex) should not be declined, but people frequently
         | use the declension pattern (Alex, Alex, Alexi, Alexar).
         | 
         | There's some parallel to be drawn with how the compressed trie
         | applies patterns that it's learned to names. That's at least
         | how I thought about it when designing the library.
        
       | mikepurvis wrote:
       | I'm surprised there'd be a benefit to doing this in the JS vs
       | having your database just return all the cases with the name and
       | then you select which one you need at display time -- basically
       | in the same layer that's populating your localized language
       | templates.
       | 
       | That said I'm curious how this manifests with cross-language
       | situations. I guess the Icelandic UI displaying French names
       | would just always use the nomitive case, and likewise for the
       | English UI displaying Icelandic names? I assume this all mostly
       | matters where the user is directly being addressed, or perhaps in
       | an admin panel ("user x responded to user y").
        
       | tempodox wrote:
       | Is Icelandic name declension deterministic enough that this
       | method reliably works? That would be a lucky break. Language is
       | typically quite messy.
        
         | nkrisc wrote:
         | It probably helps that Iceland has a relatively small
         | population and the language is actively managed by the
         | government.
        
       | ralferoo wrote:
       | I mean, it's an interesting problem for Icelandic sites, but
       | because he's explaining the basic concepts of how declensions
       | work, it seems like he's aiming this at non-Icelandic developers.
       | If they were to use this, no doubt it'll end up butchering names
       | in some other language and lead to all manner of hard to track
       | down bugs.
       | 
       | For example, if an English person called Arthur uses the site in
       | Icelandic, I'm not sure they'd expect their name to be changed to
       | presumably "Arth", "Arthi" or "Arthar" even if they were a keen
       | learner of Icelandic. Their name is their name. So, as well as
       | storing someone's name, you also have to ask them what language
       | their name is, or guess and get it wrong. At that point, you
       | might as well just ask them for all the different forms for the
       | name as well, and then you don't have to worry about whether
       | their name is on an approved list or not.
       | 
       | And if the website isn't localised into Icelandic, I've also got
       | to wonder if Icelandic visitors would have an expectation of
       | Icelandic grammar rules being applied to English (or whatever)
       | text. Most Icelandic people I've spoken to before have an
       | excellent command of English anyway, and I'm sure they'd
       | understand why their name isn't changing form in English.
        
         | pelorat wrote:
         | Not sure how it is nowadays, but Iceland used to force anyone
         | immigrating to officialy change or "icelandify" their names.
         | 
         | So if your name was Arthur, and you wanted to emigrate to
         | Iceland you would you change name.
         | 
         | Might still be like this.
        
       | SonOfLilit wrote:
       | My brain is screaming that there has to be a solution in <1kb
       | uncompressed (for the non-strict version).
       | 
       | Maybe generating a minimal list of regexes that classifies 100%
       | of names correctly? Maybe a big enough bloom filter? Maybe like a
       | bloom filter but instead of hashes we use engineered features?
        
       ___________________________________________________________________
       (page generated 2025-08-02 23:00 UTC)