--Metaphone DLL written in Delphi--

This is a DLL written in Delphi 2.0 to do the Metaphone algorithm. It's
kind of like soundex, but a little more specific. I'm using this in a 
system I wrote in Paradox for a District Attorney's office so they can 
look up "similar" names.

I could have done this in the Paradox scripting language but it's probably 
faster this way and can be used from any scripting language or program this 
way.

If you are going to translate this into a Delphi component, please send me 
the source.

snail mail:

Tom White
503 W 2nd
Tahlequah, OK 74464

email

wcs@intellex.com
or
whitet@iname.com


Bug Fixes:

v1.1	Checks the length of the string before removing trailing S (must be >1)

	'PH' used to translate to 'H' (incorrectly). It now translates to 'F'.



***************************************************************************
Here are the original comments to the C version of this algorithm I found 
on IIUGS.

The Delphi version is just a straight translation of the C source code.
***************************************************************************

From: Michael Kuhn <rhlab!mkuhn@uunet.uu.net>
Subject: Metaphone searches
Date: Fri, 24 Nov 1995 11:26:50 -0500 (EST)

I have tested this "Metaphone" routine to the best of my current
time/ability. You can put it in the archive. I talked Sadru Fidai
about his routine that is in the archive. I have included the notes
from that conversation in the comments.

My opinion is that this algorithm does group names together that are
more closely related than Soundex does. However, in my paticular
situation it cause the # of matches to increase significantly. I am
planning a trial period in the next couple of weeks to see if it
is managable.

Robert Minter sent me a 4GL program that is a "hacked" version of
this algorithm, ie. it does not included all of the transformation
rules originally described. This was not much help for what I was
trying to do.

If anything I got the important stuff from the original article
written down.

I am sure somebody can have as much fun with this as I did. :-))

-- 
Michael J. Kuhn  Computer Systems Consultant  phone:410-254-7060
Email: mkuhn@csd.clark.net
       mkuhn@rhlab.com     or mkuhn%rhlab@uunet.uu.net or uunet!rhlab!mkuhn
       c/o Baltimore Rh Typing Laboratory, Inc.  phone:410-225-9595

******************************

/*  Metaphone Conversion Notes

  When I found this Algorithm, in article, there were discrepancies between
  the BASIC code and the verbal description. The discrepances look like they
  could have been caused by typing errors in the article.

  I have included the BASIC code from this article for the specific purpose
  of presenting the Algorithm the way it was originally described.
  I have tried to reproduce the BASIC EXACTLY the way it appeared. So when
  you see "ENAM" with nothing behind it, that is how it was presented.

  Lawrence Philips has no doubt spent a lot of time in the development of
  this algorithm. I am trusting that the algorithm described has been
  throughly tested to the best of his ability.

  It was my intention to reproduce it using his rules as best as I could
  discern them.

  It looks like it works better than Soundex. Thank You Lawrence.

  To anyone passing this along. Please include all of the notes they
  are part of the documentation and credits. Thanks

  Mike Kuhn (mkuhn@rhlab.com)

  Michael J. Kuhn        Computer Systems Consultant
                         5916 Glenoak Ave.
                         Baltimore, MD 21214-2009
                         410-254-7060

  P.S.
  A version of this routine in the Informix Archive was done by:

               Sadru Fidai   Munics Information Systems
                             50 Mount Prospect Ave
                             Clifton NJ 07013   (201)778-7753
               aol.com!SFidai

  Sadru called me to discuss this and said the following:

    His routine was NOT done from the article published in "Computer Language".
    He started with a working version from a PICK system that was using this.
    He had 2,000+ names with metaphone from the PICK system that he used
    to test the C code with.

    You might want to check this routine out.

    I did not use his routine at the time because there was no verbal
    explanation of the transformations. Also my intent was to be able
    to easily modify the transformation rules with some of my own.

    I did a mod 100 of my 20,726 test names and got 221 scattered names.
    I then computed Metaphone for Sadru version and mine. There were 14
    differences. Excluding the trailing S's in his, which I eliminated.
    I also changed his code so that O was a ZERO.  The differences account
    for changes I MADE and interpretation of transformation rules.

    At this point I have no need to do a more comprehensive analysis.

         lastname         Mike Kuhn  Sadru Fidai

         ANASTHA            ANS0       ANSX
         DAVIS-CARTER       TFSKRTR    TFXKRTR
         ESCARMANT          ESKRMNT    EXKRMNT
         MCCALL             MCL        MKKL
         MCCROREY           MCRR       MKKRR
         MERSEAL            MRSL       MRXL
         PIEURISSAINT       PRSNT      PRXNT
         ROTMAN             RTMN       RXMN
         SCHEVEL            SXFL       SKFL
         SCHROM             SXRM       SKRM
         SEAL               SL         XL
         SPARR              SPR        XPR
         STARLEPER          STRLPR     XTRLPR
         THRASH             TRX        0RX
*/


/***************************************************************

 Metaphone Algorithm

   Created by Lawrence Philips (location unknown). Metaphone presented
   in article in "Computer Language" December 1990 issue.

   Converted from Pick BASIC, as demonstrated in article, to C by
   Michael J. Kuhn (Baltimore, Maryland)

   My original intention was to replace SOUNDEX with METAPHONE in
   order to get lists of similar sounding names that were more precise.
   SOUNDEX maps "William" and "Williams" to the same values. METAPHONE
   as it turns out DOES THE SAME.  There are going to be problems
   that you need to resolve with your own set of data.

   Basically, for my problem with S's I think that if

      IF metaphone[strlen(metaphone)] == "S"
                                  AND strlen(metaphone) >= 4  THEN
           metaphone[strlen(metaphone)] = ""

   You can add you own rules as required.

   Also, Lawrence Philips suggests that for practical reasons only the
   first 4 characters of the metaphone be used. This happens to be the
   number of characters that Soundex produces. This is indeed practical
   if you already have reserved exactly 4 characters in your database.

   In addition an analysis of your data may show that names are split
   into undesirable "metaphone groups" as the number of metaphone characters
   increases.

             *********** BEGIN METAPHONE RULES ***********

 Lawrence Philips' RULES follow:

 The 16 consonant sounds:
                                             |--- ZERO represents "th"
                                             |
      B  X  S  K  J  T  F  H  L  M  N  P  R  0  W  Y

 Exceptions:

   Beginning of word: "ae-", "gn", "kn-", "pn-", "wr-"  ----> drop first letter
                      "Aebersold", "Gnagy", "Knuth", "Pniewski", "Wright"

   Beginning of word: "x"                                ----> change to "s"
                                      as in "Deng Xiaopeng"

   Beginning of word: "wh-"                              ----> change to "w"
                                      as in "Whalen"

 Transformations:

   B ----> B      unless at the end of word after "m", as in "dumb", "McComb"

   C ----> X      (sh) if "-cia-" or "-ch-"
           S      if "-ci-", "-ce-", or "-cy-"
                  SILENT if "-sci-", "-sce-", or "-scy-"
           K      otherwise, including in "-sch-"

   D ----> J      if in "-dge-", "-dgy-", or "-dgi-"
           T      otherwise

   F ----> F

   G ---->        SILENT if in "-gh-" and not at end or before a vowel
                            in "-gn" or "-gned"
                            in "-dge-" etc., as in above rule
           J      if before "i", or "e", or "y" if not double "gg"
           K      otherwise

   H ---->        SILENT if after vowel and no vowel follows
                         or after "-ch-", "-sh-", "-ph-", "-th-", "-gh-"
           H      otherwise

   J ----> J

   K ---->        SILENT if after "c"
           K      otherwise

   L ----> L

   M ----> M

   N ----> N

   P ----> F      if before "h"
           P      otherwise

   Q ----> K

   R ----> R

   S ----> X      (sh) if before "h" or in "-sio-" or "-sia-"
           S      otherwise

   T ----> X      (sh) if "-tia-" or "-tio-"
           0      (th) if before "h"
                  silent if in "-tch-"
           T      otherwise

   V ----> F

   W ---->        SILENT if not followed by a vowel
           W      if followed by a vowel

   X ----> KS

   Y ---->        SILENT if not followed by a vowel
           Y      if followed by a vowel

   Z ----> S

 **************************************************************/

/*

  NOTE: This list turned out to be various issues that I passed over
        while trying to discern this algorithm. The final outcome
        of these items may or may not be reflected in the code.

  There where some discrepancies between the Pick BASIC code in the
  original article and the verbal discription of the transformations:

     1. CASE SYMB = "G"

              AND ENAME[N +3] = "D" AND (N + 3) = L)) and ENAM
                                                             ^
                  this was cut off in the magazine listing   |

         I used the verbal discription in the transformation list
         to add the appropriate code.

     2.  H ---->        SILENT if after vowel and no vowel follows
                 H      otherwise

         This is the transformation description, however, the BASIC
         routine HAS code do this:

                      SILENT if after "ch-", "sh-", "ph-", "th", "gh"

         which is the correct behaviour if you look at c,s,p,t,g

         If did not, however, have "after vowel" coded even though this
         was in the description. I added it.

    3.   The BASIC code appears to skip double letters except "C" yet
         the transformation code for "G" looks at previous letter to
         see if we have "GG". This is inconsistent.

         I am making the assumption that "C" was a typo in the BASIC
         code. It should have been "G".

     4.  Transformation notation. "-..-" where .. are letters; means that
         the letters indicated are bounded by other letters. So "-gned"
         means at the end and "ch-" means at the beginning. I have noticed
         that the later is not explicity stated in the verbal description
         but it is coded in the BASIC.

     5.  case 'C'    K otherwise, including in "-sch-"
         this implies that "sch" be bounded by other letters. The BASIC
         code, however, has: N > 1
         It should have N > 2 for this to be correct.
               SCH-
               123    greater than 1 means that C can be 2nd letter

         I coded it as per the verbal description and not what was in
         the code.

     6.  as of 11-20-95 I am still trying to understand "H". The BASIC
         code seems to indicate that if "-.h" is at the end it is not
         silent. But if it is at the end there is no way a vowel could
         follow the "h". I am looking for examples.

      7. ok now I am really confused. Case "T". There is code in BASIC
         that says if next = "H" and previous != "T" . There is no
         way that a double T goes through the code. Double letters
         are dumped in the beginning.

              MATTHEW, MATTHIES, etc

         The first T goes through the second is skipped so the
         "th" is never detected.

         Modified routine to allow "G,T" duplicates through the switch.

       8. case "D"  -dge- is indicated in transformation
                    -dge- or -dge is coded.

            STEMBRIDGE should have "j" on end and not "t"

           I am leaving the code as is, verbal must be wrong.

       9. Regarding duplicate letters. "C" must be allowed through
          as in all of the McC... names.

          The way to handle "GG and "TT" I think is to pass over the
          first duplicate. The transformation rules would then handle
          duplicates of themselves by looking at the PREVIOUS letter.

          This solves the problems of "TTH" where you want the "th"
          sound.

       10. Change "CC" so that the metaphone character is "C", they
           way it is now for McComb and such you get "MKK", which
           unnecessiarly eats up and extra metaphone character.

       11. "TH" at the beginning as in Thomas. The verbal was not
           clear about this. I think is should be "T" and not "0"
           so I am changing code.

           After the first test I think that "THvowel" should be
           "0" and "TH(!vowel)" should be "T"

       12. I think throwing away 1 "S" and the end would be good.
           Since I am doing this anyway after the fact. If I
           do it before then names like. ..
                   BURROUGHS & BURROUGH would be the same
           because the GH would map to the same value in
           both cases.

       13. Case "Y", Brian and Bryan give different codes
           Don't know how to handle this yet.

       14. Comments on metaphone groups. Metaphone actually
           makes groups bigger. Names like:

                 C...R...  G...R...  K...R...  Q...R...

           will map to "KR". Soundex would have produced for example

           C600,C620,G600,G620,K600,K620,Q600,Q620

           the names from these 8 groups would have been collapsed into 1.

           Another way to look at this is for a more exact initial
           guess of a name Soundex would give you a smaller list of
           posibilities. If you don't know how to spell it at all
           however, your success at finding the right match with
           Metaphone is much greater than with Soundex.

      15. After some tests decided to leave S's at the end of the
          Metaphone. #12 takes care of my problems with plurals and
          then S gets used to help make distinct metaphone.

    Lawrence Philips is no longer at the company indicated in the
    article. So I was unable to verify these items.
*/
