[HN Gopher] Learning how dictionaries work
___________________________________________________________________
Learning how dictionaries work
Author : dadt
Score : 29 points
Date : 2021-11-10 20:12 UTC (1 days ago)
(HTM) web link (www.arrantpedantry.com)
(TXT) w3m dump (www.arrantpedantry.com)
| kazinator wrote:
| How dictionaries work is this:
|
| - words are created, or existing words bent, mainly by non-users
| of dictionaries.
|
| - dictionaries cautiously track this process the benefit of
| dictionary users.
| kazinator wrote:
| If you wish to be conservative in the area of words, the
| easiest thing to do is to keep a dictionary that is some 30-40
| years old, and use nothing that isn't in it, unless it's a
| genuinely new concept for which the old dictionary offers no
| word.
|
| The nice thing about dictionaries is that nobody is forcing you
| to throw out your old ones.
| beervirus wrote:
| The article maybe just hints at another issue that's at least as
| important. Dictionaries are mostly descriptivist rather than
| prescriptivist.
|
| Adding "deplatform" or whatever to the dictionary isn't a
| statement about how proper it is as a word. It's a statement that
| it's a word in use, and so if you come across it, here's what it
| means.
| solarmist wrote:
| Yes, which leads to the other end of the cycle. They are
| removing obsolete words for the same reason.
| boffinAudio wrote:
| I've always wanted a program or set of tools that I can use to
| generate my own dictionaries - not just a glossary, but real
| dictionaries with multiple definitions for words, derivations,
| example sentences, and so on.
|
| And, one other very important feature I'd need, would be to be
| able to run the dictionary _on itself_ , to find words that are
| used in the descriptions for which there isn't yet a definition.
|
| This would be amazing, for example, to run on a large corpus,
| generate the dictionary, and then run it again to find words that
| are used but not defined - not just in the original corpus but in
| the definitions too.
|
| I've yet to find anything like this and have managed, over the
| years, to do this with some cobbled-together sed and awk hacks ..
| but I still think this is something that would be quite a viable
| commercial product - especially useful for international
| translations and creating properly-defined glossaries for
| documentation, etc.
|
| Anyone know of such tools? I'd love to have my own Dictionary
| builder, proper, and stop fantasizing about turning sed and awk
| scripts into a proper app ..
| knadh wrote:
| https://alar.ink (Kannada-English dictionary) was built using
| https://github.com/knadh/dictmaker
|
| It addresses some of the things you've mentioned.
| gajomi wrote:
| There are a couple starting points you could take. I spent a
| weekend hacking out a program that generates fake
| word/definition pairs with a transformer model set against a
| dictionary: https://youtu.be/XnJ2TKAn-Vk?t=1547. If you
| substitute fake words for real words and have a sufficiently
| accurate model you could quickly generate reasonable and novel
| definitions.
|
| There are more complete versions of this kind of thing publicly
| available: https://github.com/turtlesoupy/this-word-does-not-
| exist
|
| > This would be amazing, for example, to run on a large corpus,
| generate the dictionary, and then run it again to find words
| that are used but not defined - not just in the original corpus
| but in the definitions too.
|
| I think this would be how you would gauge success of the model.
| That is to say, you would evaluate model accuracy on a set of
| held-out words with definitions that never appeared in your
| dictionary training set but appeared in context in your corpus.
| You would have to manually annotate whether or not the
| generated definition of these held out words was acceptable.
| solarmist wrote:
| The closest thing linguists have come up with, to my knowledge,
| are wordnets.
|
| They build definitions by the words, directly and indirectly,
| associated with them. You could then use those clusters to
| create individual definitions in some way.
|
| Have you looked at them?
|
| I think the biggest problem is that definitions are semantic
| which computers are terrible at. We're at the infancy of being
| able to with transformers and large language models nowadays.
| So is start looking around for the proto-tools that will lead
| to what you're thinking about.
___________________________________________________________________
(page generated 2021-11-11 23:03 UTC)