[HN Gopher] Accents in latent spaces: How AI hears accent streng...
___________________________________________________________________
Accents in latent spaces: How AI hears accent strength in English
Author : ilyausorov
Score : 164 points
Date : 2025-05-06 14:07 UTC (8 hours ago)
(HTM) web link (accent-strength.boldvoice.com)
(TXT) w3m dump (accent-strength.boldvoice.com)
| treetalker wrote:
| This is cool and one of the applications of LLMs that I'm
| actually looking forward to: accent training when acquiring a new
| language, particularly hearing what you would sound like without
| an accent!
|
| That said, I found the recording of Victor's speech after
| practicing with the recording of his own unaccented voice to be
| far less intelligible than his original recording.
|
| Looking forward to seeing the developments in this particular
| application.
| ilyausorov wrote:
| Fair point! When Victor tried to speed up to speak as fast as
| Coach Eliza, while it sounded somewhat less accented, a few
| parts of the phrase did get less intelligible. 10 minutes of
| practice is only a start after all.
|
| Interesting to note that we're also developing a separate
| measure of intelligibility that will give a separate sense of
| how intelligible versus accented something is.
| georgewsinger wrote:
| This is so cool. Real-time accent feedback is something language
| learners have never had throughout all of human history, until
| now.
|
| Along similar lines, it would be useful to map a speaker's vowels
| in vowel-space (and likewise for consonants?) to compare native
| to non-native speakers.
|
| I can't wait until something like this is available for Japanese.
| pjc50 wrote:
| > something language learners have never had throughout all of
| human history
|
| .. unless they had access to a native speaker and/or vocal
| coach? While an automated Henry Higgins is nifty, it's not
| something humans haven't been able to do themselves.
| anadalakra wrote:
| Native speakers are less helpful at this than you might
| think. Speech coaches are absolutely the way to go, but
| they're outside the price range for most people ($200+/hr for
| a good coach). BoldVoice gives coach-level feedback and
| instruction at a price point that everyone can access, on
| demand.
| ilyausorov wrote:
| That's a fascinating idea! Definitely something to try out for
| our team. We actively and continuously do all sorts of
| experiments with our machine learning models to be able to
| extract the most useful insights. We will definitely share if
| we find something useful here.
| mckirk wrote:
| Is it just me, or did the sound files get hugged-to-death?
| pjc50 wrote:
| What the vector-space data gets right, and what the human
| commentary tends not to, is the idea that accents are a complex
| statistical distribution. You should be careful about the concept
| of a "default" or "neutral" accent. Telecommunications has spent
| the 20th century flattening accents together, as has accent
| discrimination. There's always the tendency for people to say "
| _my_ accent is the neutral standard against which all others
| should be measured ".
| lurk2 wrote:
| > There's always the tendency for people to say "my accent is
| the neutral standard against which all others should be
| measured".
|
| You can measure this by mutual intelligibility with other
| accent groupings.
| ilyausorov wrote:
| For sure, and I don't think we ever use the term default or
| neutral. The "the American English accent of our expert accent
| coach Eliza" is just that -- it's one accent.
|
| As a learning platform that provides instruction to our users,
| we do need to set some kind of direction in our pedagogy, but
| we 100% recognize that there isn't just 1 American English
| accent, and there's lots of variance.
| fxtentacle wrote:
| What a great AI use-case! At first, I felt excited ...
|
| But then I read their privacy policy. They want permission to
| save all of my audio interactions for all eternity. It's so sad
| that I will never try out their (admittedly super cool) AI tech.
| anadalakra wrote:
| You can reach out and request your data to be deleted at any
| time.
| fxtentacle wrote:
| "if you wish to opt out of future collection of voice
| samples, you may do so by disabling voice-related features in
| the BoldVoice app. Please note that this may limit the
| functionality of certain services."
|
| Yeah, I can opt out. By not using any voice-related feature
| in their voice training app.
| anadalakra wrote:
| If you're still actively using the app, the voice will be
| retained and processed so that you can receive instant
| feedback, and also so that you receive additional
| personalized practice items and video lessons based on your
| speech needs. If you don't want the samples saved "in
| perpetuity", you can request them to be deleted once you
| decide that you're done with the application. Hope this
| helps!
| joshjhargreaves wrote:
| Damn, this is really cool.
| oscar120 wrote:
| thanks!
| vessenes wrote:
| This is super cool.
|
| A suggestion and some surprise: I'm surprised by your assertion
| that there's no clustering. I see the representation shows no
| clustering, and believe you that there is therefore no broad
| high-dimensional clustering. I also agree that the demo where
| Victor's voice moves closer to Eliza's sounds more native.
|
| But, how can it be that you can show directionality toward
| "native" without clustering? I would read this as a _problem_
| with my embedding, not a feature. Perhaps there are some smaller-
| dimensional sub-axes that do encode what sort of accent someone
| has?
|
| Suggestion for the BoldVoice team: if you'd like to go viral, I
| suggest you dig into American idiolects -- two that are hard not
| to talk about / opine on / retweet are AAVE and Gay male speech
| (not sure if there's a more formal name for this, it's what
| Wikipedia uses).
|
| I'm in a mixed race family, and we spent a lot of time playing
| with ChatGPT's AAVE abilities which have, I think sadly, been
| completely nerfed over the releases. Chat seems to have no sense
| of shame when it says speaking like one of my kids is harmful; I
| imagine the well intentioned OpenAI folks were sort of thinking
| the opposite when they cut it out. It seems to have a list of
| "okay" and "bad" idiolects baked in - for instance, it will give
| you a thick Irish accent, a Boston accent, a NY/Bronx accent, but
| no Asian/SE Asian accents.
|
| I like the idea of an idiolect-manager, something that could help
| me move my speech more or less toward a given idiolect. Similarly
| England is a rich minefield of idiolects, from scouse to highly
| posh.
|
| I'm guessing you guys are aimed at the call center market based
| on your demo, but there could be a lot more applications! Voice
| coaches in Hollywood (the good ones) charge hundreds of dollar
| per hour, so there's a valuable if small market out there for
| much of this. Thanks for the demo and write up. Very cool.
| BalinKing wrote:
| (Minor nitpick, but I think "dialect" is a more appropriate
| word than "idiolect" here--at least according to Wikipedia,
| "idiolect" refers to a _single_ person 's way of speaking,
| whereas AAVE et al. are shared and are therefore considered
| dialects.)
| vessenes wrote:
| OK, good read for me here. Based on your feedback and some
| research, I think I should have use 'sociolect' for both in
| that I was less complaining about ChatGPT's unwillingness to
| use, say, finna, in a sentence, and more complaining about
| the vocalized accents. Anyway good catch, thanks!
| retrac wrote:
| Sociolect is the right term for a dialect used by a
| particular social group. A related idea is "register" when
| multiple related and mutually understandable standards
| exist, and are used in different contexts.
| pjc50 wrote:
| > It seems to have a list of "okay" and "bad" idiolects baked
| in
|
| We're back to "AI safety actually means brand safety": inept
| pushback against being made into an automated racism factory
| with their name on it.
| vessenes wrote:
| 100%
| adhsu01 wrote:
| Super cool work, congrats BoldVoice team! I've always thought
| that one of the non-obvious applications of voice
| cloning/matching is the ability to show a language learner what
| they would sound like with a more native accent.
| oscar120 wrote:
| this^
| ilyausorov wrote:
| This and more exciting features are coming to the BoldVoice app
| soon!
| asveikau wrote:
| Victor's problem isn't really the vowels or pacing. The final
| consonants are soft or not really audible. I am not hearing the
| /NG/ of "long" as the most marked example. It sounds closer to
| "law". In his "improved" recording he hasn't fixed this.
|
| I sometimes see content on social media encouraging people to
| sound more native or improve their accent. But IMO it's perfectly
| ok to have an accent, as long as the speech meets some baseline
| of intelligibility. (So Victor needs to work on "long" but not
| "days".) I've even come across people who are trying to mimick a
| native accent but lose intelligibility, where they'd sound better
| with their foreign accent. (An example I've seen is a native
| Spanish speaker trying to imitate the American accent's
| intervocalic T and D, and I don't understand them. A Spanish /t/
| or /d/ would be different from most English language accents, but
| be way more understandable.)
| anadalakra wrote:
| "If Victor wanted to move beyond this point, the sound-by-sound
| phonetic analysis available in the BoldVoice app would allow
| him to understand the patterns in pronunciation and stress that
| contribute to Eliza's accent and teach him how to apply them in
| his own speech."
|
| Indeed Victor would likely receive a personalized lesson and
| practice on the NG sound on the app.
| JoshTko wrote:
| Thank you for pinpoints my confusion/disconnect on what lack of
| improvement that I was sensing. There was an improvement on
| pacing, and cadence, yes, but that was not the main challenge
| with Victors accent. Visually I'd say victor improved by at
| most 5% and not 50% as indicated by the visualization. In some
| regards it was even harder to understand than the original due
| to speed and cadence without improvement in core pronunciation.
| gxs wrote:
| Yeah, as long as it's intelligible an accent is perfectly fine
|
| It's also perfectly fine to want to sound like a native speaker
| - whether it be because they are self conscious, think it will
| benefit them in some way, or simply want to feel like they are
| speaking "correctly"
|
| Sorry to pick on you, it's just amazing to me how sensitive we
| are to "inclusivity" to the point where we almost discourage
| people wanting to fit in
| orbital-decay wrote:
| Intelligibility heavily depends on what you expect to hear,
| and that depends on your native language or even locality.
| Even a tiny amount of French accent in English makes it sound
| like gibberish to me (but not others, and I don't have this
| issue with other thick accents). I'm sure my native accent is
| also incompatible with someone else's ears. That's the reason
| people pay accent coaches.
| gxs wrote:
| Yes, should go without saying that intelligibly is
| perfectly provided it's intelligible in whatever context
| you're in
| matsemann wrote:
| Being legible also means to cater to your audience. I work in
| an English-speaking company in a country where English isn't
| the native language, with loads of non-native speakers from
| around the world. Sometimes the native/best English speakers
| are the ones being misunderstood, because they use idioms or
| advanced words. None of us are bad at English, and I don't
| mean that I need to "dumb it down" (if anything, verbally I'm
| one of the worser ones), but I don't feel like I'm missing
| out on speaking simple with an accent.
| dhosek wrote:
| Generalizing from my own experience, it's easier for me to
| understand a non-native Spanish speaker than a native
| Spanish speaker and I would guess that the same applies
| with ESL speakers. One thing I found really fascinating is
| that even though I'd never studied French1, I actually had
| an easier time understanding a conversation between my ex-
| wife and her aunt in French than when they spoke Spanish in
| which I was functional (my skill in the language has gone
| up a great deal since then so that I now read fluently, and
| speak and listen reasonably well, albeit less well than I
| would like).
|
| [?]
|
| 1. Thanks to my kids studying French on Duolingo and my
| joining them, I can no longer say that I've never studied
| it.
| wbroo wrote:
| Very interestng! Have you tested for other factors like speaking
| speed, emotional tone, or microphone quality to see what else is
| (or isn't) influencing model perception?
| ilyausorov wrote:
| For sure we did! The training data we used for this was
| purposely highly varied to account for these various factors so
| they don't cause too much bias in the model. But there's also
| an error rate regardless of how good you make it. We keep
| improving!
| ccppurcell wrote:
| Oh pssh. There's no such thing as accent strength. There's only
| accent distance. Accent strength is just an artefact of distance
| from the accent of a socially dominant group.
| semiquaver wrote:
| What a silly nitpick. You're just using different words to say
| the same thing.
| ilyausorov wrote:
| Sure, that's fair. We apply labels that have a connotation of
| strength based on the distance, but the underlying calculation
| is indeed based on distance.
| dmurray wrote:
| The article defines accent strength in precisely this way, as
| the difference "relative to native speakers of English".
|
| That group has a vast range of accents, but it's believable
| that that range occupies an identifiable part of the multi-
| dimensional accent space, and has very little overlap with, for
| example, beginner ESL students from China.
|
| Even between native speakers, I bet you could come up with some
| measure of centrality and measure accent strength as a distance
| from that. And if language families exist upon a continuum -
| there must be some point on that continuum where you are no
| longer speaking English, but say Scots or Friesian or Nigerian
| Creole instead. Accents close to those points are objectively
| stronger.
|
| But there is a lot of freedom in how you measure centrality -
| if you weight by number of speakers, you might expect to get
| some mid-American or mid-Atlantic accent, but wind up with the
| dialect of semi-literate Hyderabad call centre workers.
| ilyausorov wrote:
| Indeed, although the inference output of the model is based
| on the ratings input that we trained it on. And that rating
| input was done by American English native speakers, so this
| iteration of the model is centered towards those accents more
| than e.g. UK or Australian or other accents of English from
| outside the US.
| joshuaissac wrote:
| > relative to native speakers of English
|
| > Even between native speakers, I bet you could come up with
| some measure of centrality and measure accent strength as a
| distance from that
|
| Is that what BoldVoice is actually doing? At least from the
| article is saying, it is measuring the strength of the user's
| American English accent (maybe GenAm?), and there is no
| discussion of any user choice of native accent to target.
| dmurray wrote:
| > Is that what BoldVoice is actually doing?
|
| No, I don't think it is doing that, I'm just taking issue
| with cccpurcell, who seems to believe that any definition
| of accent strength is chauvinistic.
| IshKebab wrote:
| > Accent strength is just ... distance from the accent of a
| socially dominant group.
|
| Yes, that is a good definition of accent strength.
|
| > There's no such thing as accent strength.
|
| ??! You literally just defined it.
| Goofy_Coyote wrote:
| Glad to see BoldVoice here.
|
| I've been using it for a few months, and I can confirm it's
| working.
| ilyausorov wrote:
| Happy to see a happy BoldVoice user. Please don't hesitate to
| reach out to our team with feedback or thoughts on how we can
| continue to improve your learning journey. Helping you succeed
| is our #1 priority!
| sonny3690 wrote:
| This is some insanely cool work. It's going to help so many
| people.
| ilyausorov wrote:
| Thanks, we're doing our best!
| childintime wrote:
| I didn't find international english, would have been interesting.
|
| Also, the USA writing convention falls short, like "who put the
| dot inside the string."
|
| crazy. Rationals "put the dot after the string". No spelling
| corrector should change that.
| Unearned5161 wrote:
| I'm always very entertained when I'm talking with someone and
| pick up on some very slight deviation from the "norm" in their
| accent. I think it shows two things: that its near impossible to
| totally wipe that fingerprint of a past tongue, and that our ears
| are incredibly adept pieces of tooling
| SamBam wrote:
| Like others recently, I've been extremely impressed by LLM's
| ability to play GeoGuessr, or, more generally, to geo-locate
| random snapshots that you give them, with what seem (to me) to be
| almost no context clues. (I gave ChatGPT loads of holiday
| snapshots, screenshotted to remove metadata, and it did
| amazingly.)
|
| I assume that, with enough training, we could get similarly
| accurate guesses of a person's linguistic history from their
| voice data.
|
| Obviously it would be extremely tricky for lots of people. For
| instance, many people think I sound English or Irish. I grew up
| in France to American parents who both went to Oxford and spent
| 15 years in England. I wouldn't be surprised, though, if a well-
| trained model could do much better on my accent than "you sound
| kinda Irish."
| chris_va wrote:
| I bet you are right.
|
| I had a forensic linguistics TA during college who was able to
| identify the island in southeast Asia one of the students grew
| up on, and where they moved to in the UK as a teenager before
| coming to the US (if I am remembering this story right).
|
| From what I gather, there are a lot of clues in how we speak
| that most brains edit out when parsing language.
| dhosek wrote:
| Or the classic scene in Mrs Doubtfire where Pierce Brosnan
| attempts to locate the origin of Robin Williams's fake
| English accent.
| ilyausorov wrote:
| We actually did something like this for non-native English
| speakers a few months back. Check out https://accentoracle.com
| (most mind-blowing if you're a non native English speaker)
| nmeofthestate wrote:
| I'm 42% Arabic apparently! And 20% Russian. Got an 81%
| American accent level. I guess it is tuned to non-native-
| English speaker accents.
| ilyausorov wrote:
| Was that right? Or what is the correct native language it
| should have predicted? Note the %s in the accent breakdown
| section are prediction probabilities
| SamBam wrote:
| Well, it says I'm Finish. But now I have a new game, where I
| put on my best Italian or Russian or Greek or Australian
| accent and try to see how close I am.
|
| I'm terrible, according to the program. My Italian is Russian
| or Hungarian or Swedish, my Australian is English.
|
| New party game unlocked.
| ilyausorov wrote:
| Amazing! If you can make it go viral again too, I will love
| you!
| AJoxo wrote:
| I've been building that exact game
|
| accentgame.xyz
| owenthejumper wrote:
| Wow that was actually accurate
| nmstoker wrote:
| Yes, although I believe this is a speaker embedding model here,
| so not LLM related.
|
| This kind of speech clustering has been possible for years -
| the exciting point with their model here is how it's highly
| focused on accents alone. Here's a video of mine from 2020 that
| demonstrated this kind of voice clustering in the Mozilla TTS
| repo (sadly the code got broken + dropped after a refactoring).
| Bokeh made it possible to directly click on points in a cluster
| and have them play
|
| https://youtu.be/KW3oO7JVa7Q?si=1w-4pU5488WxYL3l
|
| note: take care when listening as the audio level varies a bit
| (sorry!)
| ilyausorov wrote:
| Correct, not LLM
| dhosek wrote:
| I've seen some online quizzes that based on regional variations
| in accent (does root rhyme with foot or boot?) and vocabulary
| (what do you call a sweet fizzy beverage) that did a great job
| of locating where my Facebook friends back in the day grew up.
| It got me a bit off largely because while I grew up in Chicago,
| I had spent most of my adult life in Los Angeles so I tend to
| prefer "freeway" to "expressway" (changing that answer moved me
| from Rockford to Chicago).
| dgan wrote:
| wow always wanted to know an objective measure of my Russian
| accent in French. I ve been living here for a long, long time and
| some people tell me it's impossible to recognise where i come
| from. i d like to put that to test
| oezi wrote:
| Did you publish that accent dataset somewhere?
| ilyausorov wrote:
| No, the dataset isn't published beyond what you see on the 2D
| visualization. Sorry.
| AJoxo wrote:
| you may be interested in Mozillas CommonVoice dataset
| ccheever wrote:
| This is really cool.
|
| Just had an employee at our company start expensing BoldVoice.
| Being able to be understood more easily is a big deal for global
| remote employees.
|
| (Note - I am a small investor in BoldVoice)
| runelohrhauge wrote:
| This is fascinating work. Love seeing how you're combining
| machine learning with practical coaching to support real accent
| improvement. The concept of an "accent fingerprint" is especially
| clever, and the visualization of progress in latent space really
| brings it to life. Excited to see where you take this next!
| WhitneyLand wrote:
| The hear my own voice without an accent thing is a really cool
| party trick.
|
| I'd consider making this feature available free with super low
| friction, maybe no signup required, to get some viral traction.
| ilyausorov wrote:
| What if it was already available? Try it out at
| https://accentfilter.com!
| PaulDavisThe1st wrote:
| Hmmm. Initially impressive but upon retries and reflection
| ... not that great. It doesn't even maintain timing ...
| unless that's part of the transform.
| ilyausorov wrote:
| Indeed yeah that's one of the key weaknesses of the
| approach that we're using. It overrides the speakers
| cadence and accent while keeping their voice profile /
| timbre in place. Different techniques may not do this but
| also may not copy over the accent to the resulting clip as
| effectively. So far we're using this to support pedagogical
| (and lead-gen) use cases where we think it works
| sufficiently enough.
| PaulDavisThe1st wrote:
| Let's put it a different way. I grew up in the UK till
| 24. I've lived in the USA for 36 years. The UK/US accent
| conversions dramatically altered my voice/accent; the AU
| one left it mostly unchanged.
|
| This is offensive :))
| rayrah wrote:
| Cool stuff
| sardines wrote:
| How's the "accent conversion model" work? Is it all embedding
| based?
|
| If so--and if you want to transfer-learn new downstream models
| from embeddings--then seems to me you are onto a very effective
| way of doing data augmentation. It's expensive to do data
| augmentation on raw waveforms since you always need to run the
| STFT again; but if you've pre-computed & cached embeddings and
| can do data augmentation there, it would be super fast.
| avalys wrote:
| I (an American from suburban Connecticut) was recently in London
| for an event and someone misheard me. Another Londoner said "It's
| because of your accent!", which of course was nonsense to me.
| What accent?
|
| I'd be really interested to play with this tool and see what it
| thinks of my accent. Can it tell where I grew up? Can it tell
| what my parents' native languages are (not English!)
|
| A free tool like this would be great marketing for this company.
| ilyausorov wrote:
| We did built two free tools, which are geared towards non-
| native English speakers. You can find them at
| https://accentoracle.com and https://accentfilter.com. They're
| less effective for English native speakers, but could still be
| fun.
| dhosek wrote:
| What I find interesting is that it seems that folks from the UK
| tend to focus on consonants in distinguishing accents while in
| the US we distinguish more on vowels.
___________________________________________________________________
(page generated 2025-05-06 23:00 UTC)