Subj : Re: Determining Syllables To : comp.lang.c,comp.programming From : Arthur J. O'Dwyer Date : Thu Aug 04 2005 02:12 pm [followups set to c.p, since this is not a C question] On Thu, 4 Aug 2005, pemo wrote: > > Does anyone know of an algorithm that can accurately determine the number of > syllables in a given English word - esp. if that word isn't already 'known' > by such an algorithm? > > FYI, there are two approaches I'm currently considering. > > One is to reduce/convert (somehow) the word into its IPA equivalent, e.g., > 'parliament' becomes 'plm()nt' (apologies if that doesn't come out too > right!) - parsing that is, I believe, straight forward. However, I can't > find a program to convert English words into their IPA equivalents, Not surprising, since that would be equivalent to counting the syllables in English words, and that's not an algorithmic problem. English doesn't follow strictly algorithmic rules, because it's not strictly phonetic. I could come along tomorrow and make up a word, like "Worcestershire," and make up a pronunciation for it, like "wooster," and any computer program in the word wouldn't be able to figure that out from the spelling. Heck, most /humans/ don't know how every English word is pronounced, and we have many, many man-years to study the problem! [...] > Another approach might be to modify a good hyphenating algorithm; as I'm > lead to believe that these usually insert a hyphen at a syllable boundary. > However, how they do that (determine the point), and whether it's even true, > I just don't know. Yes, a good hyphenation algorithm can be /very/ good. The basic rule of good hyphenation is to come up with sets of English words that all have a hyphenation point in the same general context, and then remember the context. For example, if you see a word ending in -ible, you can hyphenate it there, unless it ends in c-ible or g-ible, in which case you can't. You can generally hyphenate before -str, or after hy-. And so on. The basic research for hyphenation patterns in English has already been done several times, e.g. by Frank Liang for TeX, but I don't know anywhere you could get patterns for syllable counting. Still, I'd start by downloading the TeX hyphenation patterns, and using them to find every single hyphenation point in your word. Then it would probably be a good idea to discard any segments that don't contain any vowels (but I'm sure there are exceptions, and not just "nth" and "ssh"). > I've also had a look at the Flesch readability stuff - but it's probably not > going to be accurate enough for what I need it for. Really? One of the inputs to the Flesch readability formula /is/ the number of syllables in the text. So if you can find a program that claims to accurately compute Flesch scores, go with it! (I doubt such programs exist, though. A Google search turned up Flesh, http://jack.gravco.com/flesh.html, but it thinks "birthday" has one syllable, so I didn't bother investigating any further.) Actually, given the application to Flesch readability computations, I might be interested in the syllable-counting problem. If you get anything working, would you let me know? And I'll post here if I find anything clever --- but don't hold your breath. -Arthur .