[HN Gopher] Phonetic Matching
___________________________________________________________________
Phonetic Matching
Author : raybb
Score : 63 points
Date : 2024-11-18 03:02 UTC (19 hours ago)
(HTM) web link (smoores.dev)
(TXT) w3m dump (smoores.dev)
| cess11 wrote:
| It's about someone using Levenshtein distance for phonetic
| fitting against text learning about soundex.
|
| One way to start playing around with it is to put some stuff in a
| database: https://dev.mysql.com/doc/refman/8.4/en/string-
| functions.htm...
|
| (or this module,
| https://www.postgresql.org/docs/current/fuzzystrmatch.html if
| you're stuck with PG)
| asveikau wrote:
| The idea that "shore" and "sure" are pronounced "almost
| identically" would depend pretty heavily on your accent. The
| vowel is pretty different to me.
|
| Also, the matches for "sorI" and "sorY" would seem to me to
| misinterpret the words as having a vowel at the end, rather than
| a silent vowel. If you're using data meant for foreign surnames,
| the rules of which may differ from English and which might have
| silent vowels be very rare depending on the original language, of
| course you may mispronounce English words like this, saying both
| shore and sure as "sore-ee".
|
| I'm sure there are much better ways to transcribe orthography to
| phonetics, probably people have published libraries that do it.
| From some googling, it seems like some people call this type of
| library a phonemic transcriber or IPA transcriber.
| xdennis wrote:
| > The idea that "shore" and "sure" are pronounced "almost
| identically" would depend pretty heavily on your accent.
|
| For an idea of how bad various accents can complicate
| recognition see how Baltimoreans pronounce "Aaron Earned An
| Iron Urn": https://www.youtube.com/watch?v=Oj7a-p4psRA
| woodrowbarlow wrote:
| IPA is the most-used tool by linguistic researchers for
| encoding pronunciation in a standardized way. IPA is criticized
| for being a little bit anglo-centric and falls short for some
| languages and edge cases, but overall it performs pretty well.
| (learned from an ex who studies linguistics.)
| asveikau wrote:
| I've always found IPA to be deeply confusing for English,
| because different accents have different historical vowel
| mergers, so I am never sure about vowels. And I think
| linguists aren't always sure about them either. IIRC, I saw a
| video by Geoff Lindsey suggesting Americans don't really have
| a /^/ phoneme. Most people who have written about this write
| as if we do. (By the way, Dr. Lindsey's YouTube videos are
| some of the more interesting content I've found about English
| phonetics)
|
| For other languages I have exposure to, IPA seems to make
| more sense. Possibly I have a bias in that they're not my
| native language, so I can analyze them instead of
| internalizing them. But also, they have cleaner phonetics,
| cleaner orthography, and less regional variation of phonemes.
| thaumasiotes wrote:
| > For other languages I have exposure to, IPA seems to make
| more sense.
|
| > But also, they have cleaner phonetics, cleaner
| orthography, and less regional variation of phonemes.
|
| The first and last of those are essentially guaranteed to
| be false.
|
| > Possibly I have a bias in that they're not my native
| language
|
| The more likely bias is that you just don't know very much
| about those other languages.
| asveikau wrote:
| You assume too much. I'm talking about languages I'm
| fluent in, can read and write, etc.
|
| You'd have to be insane to think that, for example, IPA
| for Spanish isn't easier than IPA for English vowels. In
| contrast to English, most Spanish regional pronunciations
| are about consonants. And the orthography is very
| regular. If you give me the correct spelling of any word
| and a short description of where the speaker is from I
| can give you the IPA with remarkable accuracy, a task
| that would be very difficult for English.
| tokinonagare wrote:
| The issue is not really in the IPA but how to use it. If you
| stay at the phonemic level, it's makes more words comparable
| but hides distinctions that occurs only in dialects. Also for
| a lot of language, there's multiple modelization in terms of
| the set of phonemes involved. If you go down the phonetic
| rabbit hole the notation quickly become read heard to read.
| If you have to handle multiples variations, there's also
| diaphonemes but then it's even less standardized.
| lupire wrote:
| Yes, but stay aware that IPA is for pronunciations.
|
| A word doesn't have unique pronunciation. (Speaker, Word)
| pair has pronunciation, and even those are not unique.
| (Speaker, Word, Utterane) Triple has a pronunciation.
| jjtheblunt wrote:
| even a speaker with a specified word in a specified
| utterance will vary pronunciation for the context of who is
| listening (imitation of local accent).
|
| (we worked on all this in Motorola in 2001
| extensively....then they dropped it)
| bane wrote:
| This is sort of the inverse of the problem IPA is trying to
| solve. You're correct in that IPA is used to try to encode
| pronunciation. But phonetic matching is trying to encode
| those areas where different people, in different accents
| (maybe languages), say or write semantically the same thing,
| but differently -- but you need to find all the others using
| only one of the different versions _without_ finding things
| that are not or irrelevant.
|
| Basically it's trying to smush all the different versions
| together into a single sort of cluster, where the identity of
| the cluster is any of the versions.
|
| I used to work in this field about 30 years ago, specifically
| how names can end up being latinized when coming from non-
| latin languages. We were very focused on trying to collapse
| variants into a complex ruleset that could be used both to
| recognize the cluster of names as being the same "thing", and
| then that ruleset could also produce all the valid variants.
| It was very much a kind of applied "expert systems" approach
| that predated ML.
|
| The rulesets were more or less context free grammars and
| regular expressions that we could use to "decompile" a name
| token into a kind of limited regular expression (no infinite
| closures) and then recompile the expression back into a list
| of other variants. Each variant in turn was supposed to
| "decompile" back into the same expression so a name could be
| part of a kind of closed algebra of names all with the same
| semantic meaning.
|
| For example:
|
| A Korean name like "Park" might turn into a {rule} that would
| also generate "Pak", "Paek", "Baek", etc.
|
| Any one of those would also generate the same {rule}.
|
| In practice it worked surprisingly well, and the struggle was
| mostly in identifying the areas where the precision/recall in
| this scheme caused the names to not form a closed algebra.
|
| Building the rules was an ungoldly amount of human labor
| though, with expert linguists involved at every step.
|
| These days I'm sure the problem would be approached in an
| entirely different way.
| thaumasiotes wrote:
| > The idea that "shore" and "sure" are pronounced "almost
| identically" would depend pretty heavily on your accent. The
| vowel is pretty different to me.
|
| That's not the similarity the author is trying to point out.
| The idea is that the spelling is a lot more different than the
| pronunciation is, and that's true. The pronunciations are as
| similar as it's possible to be, measured by substitution count,
| without actually being identical. (You could use a measure of
| phonetic similarity, in which case e.g. _fought_ and _thought_
| would be much more similar than _fought_ and _caught_ , but
| he's not doing that either.)
|
| The pronunciation of _sure_ comes from (1) the old, dead idea
| that the letter _u_ should be pronounced /ju/ rather than /u/
| (compare _cure_ ); and (2) the still vital English reduction of
| /sj/ to /S/. _Shore_ has to indicate the same sound in a
| radically different way, since it doesn 't have and never had a
| medial /j/ to transform a bare _s_.
| asveikau wrote:
| > the old, dead idea that the letter u should be pronounced
| /ju/ rather than /u/ (compare cure)
|
| Tell me your accent has yod dropping without telling me.
| ajuc wrote:
| This is one of these cases where inheriting hacked-together piece
| of crap (English spelling) makes a lot of additional work higher
| up.
|
| Another example is poetry. A regex can find rhymes in Polish.
| Same postfix == it rhymes.
|
| In English it's a feat of engineering.
| wavemode wrote:
| It's really just a feat of data collection (e.g.
| rhymezone.com). You just compile all English words and record
| which ones rhyme with which.
|
| (Yeah it's labor-intensive, but probably not moreso than, say,
| writing a dictionary.)
| williamdclt wrote:
| > You just compile all English words and record which ones
| rhyme with which
|
| I suppose, if we ignore accents and heteronyms... both of
| which English is famous for, unfortunately!
| nyrikki wrote:
| Shakespeare in RP loses most of the raunchy jokes as an
| example of the above.
|
| My highschool English teacher was horrified when she
| figured out why us boys were laughing when reading her copy
| of the first folio, our hick accent ment we were getting
| some of the jokes she didn't even notice.
|
| Theme rhyming with sixteen in the Cranberry's song Zombie
| is another.
| thechao wrote:
| English orthography isn't really hacked together. Most of the
| "examples" I see people bandy about are because you're reading
| the wrong English: try Old English, instead. For example,
| knight: it was pronounced "k-ng-ee-h-tuh" (my IPA is too rusty
| to use). That's, like, precisely how it's spelled? What's gone
| _wrong_ is the our modern pronunciation is poor.
|
| Other languages have this even worse. Try comparing Egyptian
| Colloquial Arabic vs literary Arabic. I mean... these are
| different languages. Or, for instance, American Sign Language
| (ASL) vs. written English: the former is more like _Chinese_
| than English.
| WarOnPrivacy wrote:
| This short epilogue struck me. This past Yom
| Kippur, my wife and I drove two hours to spend the afternoon at
| my aunt's house, with my cousins. As the night drew on,
| conversation roamed from television shows and books to politics
| and philosophy. The circle grew as we touched on increasingly
| sensitive and challenging topics, drawing us in. We
| didn't agree, per se. We were engaging in debate as often as we
| were engaging in conversation. But we all love each other deeply,
| and the amount of care and restraint that went into how each
| person expressed their disagreement was palpable.
| willwade wrote:
| Im intrigued.. Is this not done just with a phonemizer?
| from phonemizer.phonemize import phonemize text =
| "hello world" variations = [ phonemize(text,
| backend="espeak", language="en-us", strip=True),
| phonemize(text, backend="espeak", language="en-gb", strip=True),
| phonemize(text, backend="espeak", language="en-au", strip=True),
| ]
|
| I mean, espeak isnt the best but a lot of folks in the ASR/Speech
| world still are using this right?
|
| (NB: If you are on iOS check out the inbuilt one - Settings ->
| Accessibility -> Spoken Content -> Pronounciations. Adding one it
| has the ability to phonemize to IPA your spoken message. If
| someone can tell me where that SDK/API is they use in that I'd
| love to know) for i, variation in enumerate(variations, 1):
| print(f"Variation {i}: {variation}")
| rahimnathwani wrote:
| It seems like Beider-Morse outputs more variations of each
| word, which I guess means fewer false negatives, and using only
| equality tests?
| Der_Einzige wrote:
| Highly related to my paper on why tokenization in LLMs is the
| devil: https://paperswithcode.com/paper/most-language-models-can-
| be...
| msgerbush wrote:
| I'm using a library, stable-ts, for a similar issue with short
| audio clips and it works well: https://github.com/jianfch/stable-
| ts/tree/main
|
| Not sure how it will perform on something long like an audiobook.
___________________________________________________________________
(page generated 2024-11-18 23:01 UTC)