[HN Gopher] Phonetic Matching
       ___________________________________________________________________
        
       Phonetic Matching
        
       Author : raybb
       Score  : 63 points
       Date   : 2024-11-18 03:02 UTC (19 hours ago)
        
 (HTM) web link (smoores.dev)
 (TXT) w3m dump (smoores.dev)
        
       | cess11 wrote:
       | It's about someone using Levenshtein distance for phonetic
       | fitting against text learning about soundex.
       | 
       | One way to start playing around with it is to put some stuff in a
       | database: https://dev.mysql.com/doc/refman/8.4/en/string-
       | functions.htm...
       | 
       | (or this module,
       | https://www.postgresql.org/docs/current/fuzzystrmatch.html if
       | you're stuck with PG)
        
       | asveikau wrote:
       | The idea that "shore" and "sure" are pronounced "almost
       | identically" would depend pretty heavily on your accent. The
       | vowel is pretty different to me.
       | 
       | Also, the matches for "sorI" and "sorY" would seem to me to
       | misinterpret the words as having a vowel at the end, rather than
       | a silent vowel. If you're using data meant for foreign surnames,
       | the rules of which may differ from English and which might have
       | silent vowels be very rare depending on the original language, of
       | course you may mispronounce English words like this, saying both
       | shore and sure as "sore-ee".
       | 
       | I'm sure there are much better ways to transcribe orthography to
       | phonetics, probably people have published libraries that do it.
       | From some googling, it seems like some people call this type of
       | library a phonemic transcriber or IPA transcriber.
        
         | xdennis wrote:
         | > The idea that "shore" and "sure" are pronounced "almost
         | identically" would depend pretty heavily on your accent.
         | 
         | For an idea of how bad various accents can complicate
         | recognition see how Baltimoreans pronounce "Aaron Earned An
         | Iron Urn": https://www.youtube.com/watch?v=Oj7a-p4psRA
        
         | woodrowbarlow wrote:
         | IPA is the most-used tool by linguistic researchers for
         | encoding pronunciation in a standardized way. IPA is criticized
         | for being a little bit anglo-centric and falls short for some
         | languages and edge cases, but overall it performs pretty well.
         | (learned from an ex who studies linguistics.)
        
           | asveikau wrote:
           | I've always found IPA to be deeply confusing for English,
           | because different accents have different historical vowel
           | mergers, so I am never sure about vowels. And I think
           | linguists aren't always sure about them either. IIRC, I saw a
           | video by Geoff Lindsey suggesting Americans don't really have
           | a /^/ phoneme. Most people who have written about this write
           | as if we do. (By the way, Dr. Lindsey's YouTube videos are
           | some of the more interesting content I've found about English
           | phonetics)
           | 
           | For other languages I have exposure to, IPA seems to make
           | more sense. Possibly I have a bias in that they're not my
           | native language, so I can analyze them instead of
           | internalizing them. But also, they have cleaner phonetics,
           | cleaner orthography, and less regional variation of phonemes.
        
             | thaumasiotes wrote:
             | > For other languages I have exposure to, IPA seems to make
             | more sense.
             | 
             | > But also, they have cleaner phonetics, cleaner
             | orthography, and less regional variation of phonemes.
             | 
             | The first and last of those are essentially guaranteed to
             | be false.
             | 
             | > Possibly I have a bias in that they're not my native
             | language
             | 
             | The more likely bias is that you just don't know very much
             | about those other languages.
        
               | asveikau wrote:
               | You assume too much. I'm talking about languages I'm
               | fluent in, can read and write, etc.
               | 
               | You'd have to be insane to think that, for example, IPA
               | for Spanish isn't easier than IPA for English vowels. In
               | contrast to English, most Spanish regional pronunciations
               | are about consonants. And the orthography is very
               | regular. If you give me the correct spelling of any word
               | and a short description of where the speaker is from I
               | can give you the IPA with remarkable accuracy, a task
               | that would be very difficult for English.
        
           | tokinonagare wrote:
           | The issue is not really in the IPA but how to use it. If you
           | stay at the phonemic level, it's makes more words comparable
           | but hides distinctions that occurs only in dialects. Also for
           | a lot of language, there's multiple modelization in terms of
           | the set of phonemes involved. If you go down the phonetic
           | rabbit hole the notation quickly become read heard to read.
           | If you have to handle multiples variations, there's also
           | diaphonemes but then it's even less standardized.
        
           | lupire wrote:
           | Yes, but stay aware that IPA is for pronunciations.
           | 
           | A word doesn't have unique pronunciation. (Speaker, Word)
           | pair has pronunciation, and even those are not unique.
           | (Speaker, Word, Utterane) Triple has a pronunciation.
        
             | jjtheblunt wrote:
             | even a speaker with a specified word in a specified
             | utterance will vary pronunciation for the context of who is
             | listening (imitation of local accent).
             | 
             | (we worked on all this in Motorola in 2001
             | extensively....then they dropped it)
        
           | bane wrote:
           | This is sort of the inverse of the problem IPA is trying to
           | solve. You're correct in that IPA is used to try to encode
           | pronunciation. But phonetic matching is trying to encode
           | those areas where different people, in different accents
           | (maybe languages), say or write semantically the same thing,
           | but differently -- but you need to find all the others using
           | only one of the different versions _without_ finding things
           | that are not or irrelevant.
           | 
           | Basically it's trying to smush all the different versions
           | together into a single sort of cluster, where the identity of
           | the cluster is any of the versions.
           | 
           | I used to work in this field about 30 years ago, specifically
           | how names can end up being latinized when coming from non-
           | latin languages. We were very focused on trying to collapse
           | variants into a complex ruleset that could be used both to
           | recognize the cluster of names as being the same "thing", and
           | then that ruleset could also produce all the valid variants.
           | It was very much a kind of applied "expert systems" approach
           | that predated ML.
           | 
           | The rulesets were more or less context free grammars and
           | regular expressions that we could use to "decompile" a name
           | token into a kind of limited regular expression (no infinite
           | closures) and then recompile the expression back into a list
           | of other variants. Each variant in turn was supposed to
           | "decompile" back into the same expression so a name could be
           | part of a kind of closed algebra of names all with the same
           | semantic meaning.
           | 
           | For example:
           | 
           | A Korean name like "Park" might turn into a {rule} that would
           | also generate "Pak", "Paek", "Baek", etc.
           | 
           | Any one of those would also generate the same {rule}.
           | 
           | In practice it worked surprisingly well, and the struggle was
           | mostly in identifying the areas where the precision/recall in
           | this scheme caused the names to not form a closed algebra.
           | 
           | Building the rules was an ungoldly amount of human labor
           | though, with expert linguists involved at every step.
           | 
           | These days I'm sure the problem would be approached in an
           | entirely different way.
        
         | thaumasiotes wrote:
         | > The idea that "shore" and "sure" are pronounced "almost
         | identically" would depend pretty heavily on your accent. The
         | vowel is pretty different to me.
         | 
         | That's not the similarity the author is trying to point out.
         | The idea is that the spelling is a lot more different than the
         | pronunciation is, and that's true. The pronunciations are as
         | similar as it's possible to be, measured by substitution count,
         | without actually being identical. (You could use a measure of
         | phonetic similarity, in which case e.g. _fought_ and _thought_
         | would be much more similar than _fought_ and _caught_ , but
         | he's not doing that either.)
         | 
         | The pronunciation of _sure_ comes from (1) the old, dead idea
         | that the letter _u_ should be pronounced  /ju/ rather than /u/
         | (compare _cure_ ); and (2) the still vital English reduction of
         | /sj/ to /S/. _Shore_ has to indicate the same sound in a
         | radically different way, since it doesn 't have and never had a
         | medial /j/ to transform a bare _s_.
        
           | asveikau wrote:
           | > the old, dead idea that the letter u should be pronounced
           | /ju/ rather than /u/ (compare cure)
           | 
           | Tell me your accent has yod dropping without telling me.
        
       | ajuc wrote:
       | This is one of these cases where inheriting hacked-together piece
       | of crap (English spelling) makes a lot of additional work higher
       | up.
       | 
       | Another example is poetry. A regex can find rhymes in Polish.
       | Same postfix == it rhymes.
       | 
       | In English it's a feat of engineering.
        
         | wavemode wrote:
         | It's really just a feat of data collection (e.g.
         | rhymezone.com). You just compile all English words and record
         | which ones rhyme with which.
         | 
         | (Yeah it's labor-intensive, but probably not moreso than, say,
         | writing a dictionary.)
        
           | williamdclt wrote:
           | > You just compile all English words and record which ones
           | rhyme with which
           | 
           | I suppose, if we ignore accents and heteronyms... both of
           | which English is famous for, unfortunately!
        
             | nyrikki wrote:
             | Shakespeare in RP loses most of the raunchy jokes as an
             | example of the above.
             | 
             | My highschool English teacher was horrified when she
             | figured out why us boys were laughing when reading her copy
             | of the first folio, our hick accent ment we were getting
             | some of the jokes she didn't even notice.
             | 
             | Theme rhyming with sixteen in the Cranberry's song Zombie
             | is another.
        
         | thechao wrote:
         | English orthography isn't really hacked together. Most of the
         | "examples" I see people bandy about are because you're reading
         | the wrong English: try Old English, instead. For example,
         | knight: it was pronounced "k-ng-ee-h-tuh" (my IPA is too rusty
         | to use). That's, like, precisely how it's spelled? What's gone
         | _wrong_ is the our modern pronunciation is poor.
         | 
         | Other languages have this even worse. Try comparing Egyptian
         | Colloquial Arabic vs literary Arabic. I mean... these are
         | different languages. Or, for instance, American Sign Language
         | (ASL) vs. written English: the former is more like _Chinese_
         | than English.
        
       | WarOnPrivacy wrote:
       | This short epilogue struck me.                   This past Yom
       | Kippur, my wife and I drove two hours to spend the afternoon at
       | my aunt's house, with my cousins. As the night drew on,
       | conversation roamed from television shows and books to politics
       | and philosophy. The circle grew as we touched on increasingly
       | sensitive and challenging topics, drawing us in.              We
       | didn't agree, per se. We were engaging in debate as often as we
       | were engaging in conversation. But we all love each other deeply,
       | and the amount of care and restraint that went into how each
       | person expressed their disagreement was palpable.
        
       | willwade wrote:
       | Im intrigued.. Is this not done just with a phonemizer?
       | from phonemizer.phonemize import phonemize              text =
       | "hello world"         variations = [             phonemize(text,
       | backend="espeak", language="en-us", strip=True),
       | phonemize(text, backend="espeak", language="en-gb", strip=True),
       | phonemize(text, backend="espeak", language="en-au", strip=True),
       | ]
       | 
       | I mean, espeak isnt the best but a lot of folks in the ASR/Speech
       | world still are using this right?
       | 
       | (NB: If you are on iOS check out the inbuilt one - Settings ->
       | Accessibility -> Spoken Content -> Pronounciations. Adding one it
       | has the ability to phonemize to IPA your spoken message. If
       | someone can tell me where that SDK/API is they use in that I'd
       | love to know) for i, variation in enumerate(variations, 1):
       | print(f"Variation {i}: {variation}")
        
         | rahimnathwani wrote:
         | It seems like Beider-Morse outputs more variations of each
         | word, which I guess means fewer false negatives, and using only
         | equality tests?
        
       | Der_Einzige wrote:
       | Highly related to my paper on why tokenization in LLMs is the
       | devil: https://paperswithcode.com/paper/most-language-models-can-
       | be...
        
       | msgerbush wrote:
       | I'm using a library, stable-ts, for a similar issue with short
       | audio clips and it works well: https://github.com/jianfch/stable-
       | ts/tree/main
       | 
       | Not sure how it will perform on something long like an audiobook.
        
       ___________________________________________________________________
       (page generated 2024-11-18 23:01 UTC)