[HN Gopher] Learning how dictionaries work
       ___________________________________________________________________
        
       Learning how dictionaries work
        
       Author : dadt
       Score  : 29 points
       Date   : 2021-11-10 20:12 UTC (1 days ago)
        
 (HTM) web link (www.arrantpedantry.com)
 (TXT) w3m dump (www.arrantpedantry.com)
        
       | kazinator wrote:
       | How dictionaries work is this:
       | 
       | - words are created, or existing words bent, mainly by non-users
       | of dictionaries.
       | 
       | - dictionaries cautiously track this process the benefit of
       | dictionary users.
        
         | kazinator wrote:
         | If you wish to be conservative in the area of words, the
         | easiest thing to do is to keep a dictionary that is some 30-40
         | years old, and use nothing that isn't in it, unless it's a
         | genuinely new concept for which the old dictionary offers no
         | word.
         | 
         | The nice thing about dictionaries is that nobody is forcing you
         | to throw out your old ones.
        
       | beervirus wrote:
       | The article maybe just hints at another issue that's at least as
       | important. Dictionaries are mostly descriptivist rather than
       | prescriptivist.
       | 
       | Adding "deplatform" or whatever to the dictionary isn't a
       | statement about how proper it is as a word. It's a statement that
       | it's a word in use, and so if you come across it, here's what it
       | means.
        
         | solarmist wrote:
         | Yes, which leads to the other end of the cycle. They are
         | removing obsolete words for the same reason.
        
       | boffinAudio wrote:
       | I've always wanted a program or set of tools that I can use to
       | generate my own dictionaries - not just a glossary, but real
       | dictionaries with multiple definitions for words, derivations,
       | example sentences, and so on.
       | 
       | And, one other very important feature I'd need, would be to be
       | able to run the dictionary _on itself_ , to find words that are
       | used in the descriptions for which there isn't yet a definition.
       | 
       | This would be amazing, for example, to run on a large corpus,
       | generate the dictionary, and then run it again to find words that
       | are used but not defined - not just in the original corpus but in
       | the definitions too.
       | 
       | I've yet to find anything like this and have managed, over the
       | years, to do this with some cobbled-together sed and awk hacks ..
       | but I still think this is something that would be quite a viable
       | commercial product - especially useful for international
       | translations and creating properly-defined glossaries for
       | documentation, etc.
       | 
       | Anyone know of such tools? I'd love to have my own Dictionary
       | builder, proper, and stop fantasizing about turning sed and awk
       | scripts into a proper app ..
        
         | knadh wrote:
         | https://alar.ink (Kannada-English dictionary) was built using
         | https://github.com/knadh/dictmaker
         | 
         | It addresses some of the things you've mentioned.
        
         | gajomi wrote:
         | There are a couple starting points you could take. I spent a
         | weekend hacking out a program that generates fake
         | word/definition pairs with a transformer model set against a
         | dictionary: https://youtu.be/XnJ2TKAn-Vk?t=1547. If you
         | substitute fake words for real words and have a sufficiently
         | accurate model you could quickly generate reasonable and novel
         | definitions.
         | 
         | There are more complete versions of this kind of thing publicly
         | available: https://github.com/turtlesoupy/this-word-does-not-
         | exist
         | 
         | > This would be amazing, for example, to run on a large corpus,
         | generate the dictionary, and then run it again to find words
         | that are used but not defined - not just in the original corpus
         | but in the definitions too.
         | 
         | I think this would be how you would gauge success of the model.
         | That is to say, you would evaluate model accuracy on a set of
         | held-out words with definitions that never appeared in your
         | dictionary training set but appeared in context in your corpus.
         | You would have to manually annotate whether or not the
         | generated definition of these held out words was acceptable.
        
         | solarmist wrote:
         | The closest thing linguists have come up with, to my knowledge,
         | are wordnets.
         | 
         | They build definitions by the words, directly and indirectly,
         | associated with them. You could then use those clusters to
         | create individual definitions in some way.
         | 
         | Have you looked at them?
         | 
         | I think the biggest problem is that definitions are semantic
         | which computers are terrible at. We're at the infancy of being
         | able to with transformers and large language models nowadays.
         | So is start looking around for the proto-tools that will lead
         | to what you're thinking about.
        
       ___________________________________________________________________
       (page generated 2021-11-11 23:03 UTC)