Newsgroups: comp.text
Path: utzoo!utgpu!watserv1!watdragon!drraymond
From: drraymond@watdragon.waterloo.edu (Darrell Raymond)
Subject: Re: Public Domain Dictionary
Message-ID: <1991Jun12.163331.26129@watdragon.waterloo.edu>
Keywords: Dictionary SGML
Organization: University of Waterloo
References: <1991Jun9.025529.21722@sq.sq.com> <811@tivoli.UUCP>
Date: Wed, 12 Jun 1991 16:33:31 GMT
Lines: 112

>It would be interesting to see what the guys at U. of Waterloo did with the 
>online O.E.D. project.  I understand it is very SGML-like.
 
  The online OED is marked up with tags that are reminiscent of SGML.  
However, there is no DTD for the OED, or for many of the other markup 
projects that Oxford University Press has undertaken. Many existing
dictionaries have too much variance in their structure to be completely
captured by SGML.  Even deciding what sort of information you want to
capture in your markup is a subject of some controversy.

----------------

  Maybe you guys could stand a few comments on your project in general.
Basically, you've got three things to worry about:

  (i)   coverage
  (ii)  correctness
  (iii) finishing

  Coverage implies you have to find a source of words that gives us some 
confidence that your dictionary is comprehensive enough for whatever 
purpose you have in mind.  This means more than just finding instances of
every word, it means finding instances of most of the senses of the words.
The strength of any dictionary is its underlying corpus, the collection
of language from which the examples are drawn.  In the case of the OED,
this means 8 to 10 million quotations sent in by volunteer readers.  In 
the case of the Collins COBUILD dictionary it's a special online corpus 
of about 40 million words.

  Correctness means that unless you put in place some mechanism that'll
give us confidence in how you obtained your results, no one will be using 
(or at least depending on) your dictionary.  One such mechanism is that
old scholarly tradition, accountability.  For example, the OED provides 
you with the quotes used to define the entry, as well as bibliographic 
information, so you can go and check the quote in the original source if
you like.  Thus you can hold the OED and its editors accountable for
the decisions they made, because you can look at the same evidence.

  Finishing means that you ought to be aware of the fact that many a 
dictionary project takes decades longer than the original editors 
forecast.  Dictionary-writing is not a part-time activity. 

  Some comments on statements made in various postings:

>There seems to be a serious need for a public domain dictionary.

  My first question is, what for? I admit I didn't see the first posting 
in this thread.  Is it really the definitions you want, or just a word list 
with correct spellings and parts of speech (which would be fine for a lot 
of automatic uses)?  If you actually want to write a dictionary from 
scratch, good luck, you'll be at it a long time.  If it's only a word 
list that you want, you stand a better chance of completing.

>That means we need about 100 volunteers who each undertake to come up with
>the definitions (in their own unique words) of 5 words per day. Or
>50 who can do 10 per day.

  Goodness gracious.  10 words per day?  Just sit down and write me up
a definition of the word "good".  Make sure to cover as many senses and 
usages as you can think of.  Go check a couple of dictionaries and see 
how many senses you missed.  If it takes you less than an hour to do 
a good job I'd be surprised.  Now multiply that by 10.

  Just as you cannot get twice the software production by doubling the
number of programmers, you cannot get twice the dictionary by doubling
the number of volunteers who write definitions.

>Of course, quality control and checks for regional variations are very
>important.  

  Whoops, add to that hour per word all the checks you're going to do for 
regional variations and quality control.  Who has the final word on the 
quality of a definition, anyway?  

>Dictionary definitions are extraordinarily hard to write well.

  But you plan to do 5 to 10 a day?

>* a writer is sent n randomly-chosen words (for example, 30 words taken from
>  random usenet articles and other sources, subject to other checking)

  Usenet is not exactly what I would call a broadly based source of 
words (especially if you want them spelled correctly).

>- - This sounds good to me. It might even be a good idea make a newsgroup
>and let the author post their definition and let anyone who wishes
>reply do so. The author can then after a suitable period revise their
>definition. 

  When there are disputes, who is the final authority?  Since the author
is basically chosen at random, he or she probably has no more claim to
being the final authority than anyone else...

>I think it would be nice to allow accompanying articles
>(kind of like encyclopedia entries). For those who are ambitious.

  It would be nice - who's going to check them for correctness? What if 
some of them are sexist or racist?  Who decides what is permitted and what 
isn't?  Are the people who decide such things then exposing themselves to 
liability for lawsuits?
 
>- - I can understand the reasons for issuing words randomly but I would
>enjoy the project much more if I could pick some of the words I were to
>write entries for. 

  No doubt.  Who decides who gets the most popular words?  

----------------
  
  I'm not trying to throw a wet blanket on this project.   But imagine
if a bunch of lexicographers got together to rewrite Unix on a part-time
basis ('cause we need a public domain one, don'tcha know)....
