[HN Gopher] Accents in latent spaces: How AI hears accent streng...
       ___________________________________________________________________
        
       Accents in latent spaces: How AI hears accent strength in English
        
       Author : ilyausorov
       Score  : 164 points
       Date   : 2025-05-06 14:07 UTC (8 hours ago)
        
 (HTM) web link (accent-strength.boldvoice.com)
 (TXT) w3m dump (accent-strength.boldvoice.com)
        
       | treetalker wrote:
       | This is cool and one of the applications of LLMs that I'm
       | actually looking forward to: accent training when acquiring a new
       | language, particularly hearing what you would sound like without
       | an accent!
       | 
       | That said, I found the recording of Victor's speech after
       | practicing with the recording of his own unaccented voice to be
       | far less intelligible than his original recording.
       | 
       | Looking forward to seeing the developments in this particular
       | application.
        
         | ilyausorov wrote:
         | Fair point! When Victor tried to speed up to speak as fast as
         | Coach Eliza, while it sounded somewhat less accented, a few
         | parts of the phrase did get less intelligible. 10 minutes of
         | practice is only a start after all.
         | 
         | Interesting to note that we're also developing a separate
         | measure of intelligibility that will give a separate sense of
         | how intelligible versus accented something is.
        
       | georgewsinger wrote:
       | This is so cool. Real-time accent feedback is something language
       | learners have never had throughout all of human history, until
       | now.
       | 
       | Along similar lines, it would be useful to map a speaker's vowels
       | in vowel-space (and likewise for consonants?) to compare native
       | to non-native speakers.
       | 
       | I can't wait until something like this is available for Japanese.
        
         | pjc50 wrote:
         | > something language learners have never had throughout all of
         | human history
         | 
         | .. unless they had access to a native speaker and/or vocal
         | coach? While an automated Henry Higgins is nifty, it's not
         | something humans haven't been able to do themselves.
        
           | anadalakra wrote:
           | Native speakers are less helpful at this than you might
           | think. Speech coaches are absolutely the way to go, but
           | they're outside the price range for most people ($200+/hr for
           | a good coach). BoldVoice gives coach-level feedback and
           | instruction at a price point that everyone can access, on
           | demand.
        
         | ilyausorov wrote:
         | That's a fascinating idea! Definitely something to try out for
         | our team. We actively and continuously do all sorts of
         | experiments with our machine learning models to be able to
         | extract the most useful insights. We will definitely share if
         | we find something useful here.
        
       | mckirk wrote:
       | Is it just me, or did the sound files get hugged-to-death?
        
       | pjc50 wrote:
       | What the vector-space data gets right, and what the human
       | commentary tends not to, is the idea that accents are a complex
       | statistical distribution. You should be careful about the concept
       | of a "default" or "neutral" accent. Telecommunications has spent
       | the 20th century flattening accents together, as has accent
       | discrimination. There's always the tendency for people to say "
       | _my_ accent is the neutral standard against which all others
       | should be measured ".
        
         | lurk2 wrote:
         | > There's always the tendency for people to say "my accent is
         | the neutral standard against which all others should be
         | measured".
         | 
         | You can measure this by mutual intelligibility with other
         | accent groupings.
        
         | ilyausorov wrote:
         | For sure, and I don't think we ever use the term default or
         | neutral. The "the American English accent of our expert accent
         | coach Eliza" is just that -- it's one accent.
         | 
         | As a learning platform that provides instruction to our users,
         | we do need to set some kind of direction in our pedagogy, but
         | we 100% recognize that there isn't just 1 American English
         | accent, and there's lots of variance.
        
       | fxtentacle wrote:
       | What a great AI use-case! At first, I felt excited ...
       | 
       | But then I read their privacy policy. They want permission to
       | save all of my audio interactions for all eternity. It's so sad
       | that I will never try out their (admittedly super cool) AI tech.
        
         | anadalakra wrote:
         | You can reach out and request your data to be deleted at any
         | time.
        
           | fxtentacle wrote:
           | "if you wish to opt out of future collection of voice
           | samples, you may do so by disabling voice-related features in
           | the BoldVoice app. Please note that this may limit the
           | functionality of certain services."
           | 
           | Yeah, I can opt out. By not using any voice-related feature
           | in their voice training app.
        
             | anadalakra wrote:
             | If you're still actively using the app, the voice will be
             | retained and processed so that you can receive instant
             | feedback, and also so that you receive additional
             | personalized practice items and video lessons based on your
             | speech needs. If you don't want the samples saved "in
             | perpetuity", you can request them to be deleted once you
             | decide that you're done with the application. Hope this
             | helps!
        
       | joshjhargreaves wrote:
       | Damn, this is really cool.
        
         | oscar120 wrote:
         | thanks!
        
       | vessenes wrote:
       | This is super cool.
       | 
       | A suggestion and some surprise: I'm surprised by your assertion
       | that there's no clustering. I see the representation shows no
       | clustering, and believe you that there is therefore no broad
       | high-dimensional clustering. I also agree that the demo where
       | Victor's voice moves closer to Eliza's sounds more native.
       | 
       | But, how can it be that you can show directionality toward
       | "native" without clustering? I would read this as a _problem_
       | with my embedding, not a feature. Perhaps there are some smaller-
       | dimensional sub-axes that do encode what sort of accent someone
       | has?
       | 
       | Suggestion for the BoldVoice team: if you'd like to go viral, I
       | suggest you dig into American idiolects -- two that are hard not
       | to talk about / opine on / retweet are AAVE and Gay male speech
       | (not sure if there's a more formal name for this, it's what
       | Wikipedia uses).
       | 
       | I'm in a mixed race family, and we spent a lot of time playing
       | with ChatGPT's AAVE abilities which have, I think sadly, been
       | completely nerfed over the releases. Chat seems to have no sense
       | of shame when it says speaking like one of my kids is harmful; I
       | imagine the well intentioned OpenAI folks were sort of thinking
       | the opposite when they cut it out. It seems to have a list of
       | "okay" and "bad" idiolects baked in - for instance, it will give
       | you a thick Irish accent, a Boston accent, a NY/Bronx accent, but
       | no Asian/SE Asian accents.
       | 
       | I like the idea of an idiolect-manager, something that could help
       | me move my speech more or less toward a given idiolect. Similarly
       | England is a rich minefield of idiolects, from scouse to highly
       | posh.
       | 
       | I'm guessing you guys are aimed at the call center market based
       | on your demo, but there could be a lot more applications! Voice
       | coaches in Hollywood (the good ones) charge hundreds of dollar
       | per hour, so there's a valuable if small market out there for
       | much of this. Thanks for the demo and write up. Very cool.
        
         | BalinKing wrote:
         | (Minor nitpick, but I think "dialect" is a more appropriate
         | word than "idiolect" here--at least according to Wikipedia,
         | "idiolect" refers to a _single_ person 's way of speaking,
         | whereas AAVE et al. are shared and are therefore considered
         | dialects.)
        
           | vessenes wrote:
           | OK, good read for me here. Based on your feedback and some
           | research, I think I should have use 'sociolect' for both in
           | that I was less complaining about ChatGPT's unwillingness to
           | use, say, finna, in a sentence, and more complaining about
           | the vocalized accents. Anyway good catch, thanks!
        
             | retrac wrote:
             | Sociolect is the right term for a dialect used by a
             | particular social group. A related idea is "register" when
             | multiple related and mutually understandable standards
             | exist, and are used in different contexts.
        
         | pjc50 wrote:
         | > It seems to have a list of "okay" and "bad" idiolects baked
         | in
         | 
         | We're back to "AI safety actually means brand safety": inept
         | pushback against being made into an automated racism factory
         | with their name on it.
        
           | vessenes wrote:
           | 100%
        
       | adhsu01 wrote:
       | Super cool work, congrats BoldVoice team! I've always thought
       | that one of the non-obvious applications of voice
       | cloning/matching is the ability to show a language learner what
       | they would sound like with a more native accent.
        
         | oscar120 wrote:
         | this^
        
         | ilyausorov wrote:
         | This and more exciting features are coming to the BoldVoice app
         | soon!
        
       | asveikau wrote:
       | Victor's problem isn't really the vowels or pacing. The final
       | consonants are soft or not really audible. I am not hearing the
       | /NG/ of "long" as the most marked example. It sounds closer to
       | "law". In his "improved" recording he hasn't fixed this.
       | 
       | I sometimes see content on social media encouraging people to
       | sound more native or improve their accent. But IMO it's perfectly
       | ok to have an accent, as long as the speech meets some baseline
       | of intelligibility. (So Victor needs to work on "long" but not
       | "days".) I've even come across people who are trying to mimick a
       | native accent but lose intelligibility, where they'd sound better
       | with their foreign accent. (An example I've seen is a native
       | Spanish speaker trying to imitate the American accent's
       | intervocalic T and D, and I don't understand them. A Spanish /t/
       | or /d/ would be different from most English language accents, but
       | be way more understandable.)
        
         | anadalakra wrote:
         | "If Victor wanted to move beyond this point, the sound-by-sound
         | phonetic analysis available in the BoldVoice app would allow
         | him to understand the patterns in pronunciation and stress that
         | contribute to Eliza's accent and teach him how to apply them in
         | his own speech."
         | 
         | Indeed Victor would likely receive a personalized lesson and
         | practice on the NG sound on the app.
        
         | JoshTko wrote:
         | Thank you for pinpoints my confusion/disconnect on what lack of
         | improvement that I was sensing. There was an improvement on
         | pacing, and cadence, yes, but that was not the main challenge
         | with Victors accent. Visually I'd say victor improved by at
         | most 5% and not 50% as indicated by the visualization. In some
         | regards it was even harder to understand than the original due
         | to speed and cadence without improvement in core pronunciation.
        
         | gxs wrote:
         | Yeah, as long as it's intelligible an accent is perfectly fine
         | 
         | It's also perfectly fine to want to sound like a native speaker
         | - whether it be because they are self conscious, think it will
         | benefit them in some way, or simply want to feel like they are
         | speaking "correctly"
         | 
         | Sorry to pick on you, it's just amazing to me how sensitive we
         | are to "inclusivity" to the point where we almost discourage
         | people wanting to fit in
        
           | orbital-decay wrote:
           | Intelligibility heavily depends on what you expect to hear,
           | and that depends on your native language or even locality.
           | Even a tiny amount of French accent in English makes it sound
           | like gibberish to me (but not others, and I don't have this
           | issue with other thick accents). I'm sure my native accent is
           | also incompatible with someone else's ears. That's the reason
           | people pay accent coaches.
        
             | gxs wrote:
             | Yes, should go without saying that intelligibly is
             | perfectly provided it's intelligible in whatever context
             | you're in
        
           | matsemann wrote:
           | Being legible also means to cater to your audience. I work in
           | an English-speaking company in a country where English isn't
           | the native language, with loads of non-native speakers from
           | around the world. Sometimes the native/best English speakers
           | are the ones being misunderstood, because they use idioms or
           | advanced words. None of us are bad at English, and I don't
           | mean that I need to "dumb it down" (if anything, verbally I'm
           | one of the worser ones), but I don't feel like I'm missing
           | out on speaking simple with an accent.
        
             | dhosek wrote:
             | Generalizing from my own experience, it's easier for me to
             | understand a non-native Spanish speaker than a native
             | Spanish speaker and I would guess that the same applies
             | with ESL speakers. One thing I found really fascinating is
             | that even though I'd never studied French1, I actually had
             | an easier time understanding a conversation between my ex-
             | wife and her aunt in French than when they spoke Spanish in
             | which I was functional (my skill in the language has gone
             | up a great deal since then so that I now read fluently, and
             | speak and listen reasonably well, albeit less well than I
             | would like).
             | 
             | [?]
             | 
             | 1. Thanks to my kids studying French on Duolingo and my
             | joining them, I can no longer say that I've never studied
             | it.
        
       | wbroo wrote:
       | Very interestng! Have you tested for other factors like speaking
       | speed, emotional tone, or microphone quality to see what else is
       | (or isn't) influencing model perception?
        
         | ilyausorov wrote:
         | For sure we did! The training data we used for this was
         | purposely highly varied to account for these various factors so
         | they don't cause too much bias in the model. But there's also
         | an error rate regardless of how good you make it. We keep
         | improving!
        
       | ccppurcell wrote:
       | Oh pssh. There's no such thing as accent strength. There's only
       | accent distance. Accent strength is just an artefact of distance
       | from the accent of a socially dominant group.
        
         | semiquaver wrote:
         | What a silly nitpick. You're just using different words to say
         | the same thing.
        
         | ilyausorov wrote:
         | Sure, that's fair. We apply labels that have a connotation of
         | strength based on the distance, but the underlying calculation
         | is indeed based on distance.
        
         | dmurray wrote:
         | The article defines accent strength in precisely this way, as
         | the difference "relative to native speakers of English".
         | 
         | That group has a vast range of accents, but it's believable
         | that that range occupies an identifiable part of the multi-
         | dimensional accent space, and has very little overlap with, for
         | example, beginner ESL students from China.
         | 
         | Even between native speakers, I bet you could come up with some
         | measure of centrality and measure accent strength as a distance
         | from that. And if language families exist upon a continuum -
         | there must be some point on that continuum where you are no
         | longer speaking English, but say Scots or Friesian or Nigerian
         | Creole instead. Accents close to those points are objectively
         | stronger.
         | 
         | But there is a lot of freedom in how you measure centrality -
         | if you weight by number of speakers, you might expect to get
         | some mid-American or mid-Atlantic accent, but wind up with the
         | dialect of semi-literate Hyderabad call centre workers.
        
           | ilyausorov wrote:
           | Indeed, although the inference output of the model is based
           | on the ratings input that we trained it on. And that rating
           | input was done by American English native speakers, so this
           | iteration of the model is centered towards those accents more
           | than e.g. UK or Australian or other accents of English from
           | outside the US.
        
           | joshuaissac wrote:
           | > relative to native speakers of English
           | 
           | > Even between native speakers, I bet you could come up with
           | some measure of centrality and measure accent strength as a
           | distance from that
           | 
           | Is that what BoldVoice is actually doing? At least from the
           | article is saying, it is measuring the strength of the user's
           | American English accent (maybe GenAm?), and there is no
           | discussion of any user choice of native accent to target.
        
             | dmurray wrote:
             | > Is that what BoldVoice is actually doing?
             | 
             | No, I don't think it is doing that, I'm just taking issue
             | with cccpurcell, who seems to believe that any definition
             | of accent strength is chauvinistic.
        
         | IshKebab wrote:
         | > Accent strength is just ... distance from the accent of a
         | socially dominant group.
         | 
         | Yes, that is a good definition of accent strength.
         | 
         | > There's no such thing as accent strength.
         | 
         | ??! You literally just defined it.
        
       | Goofy_Coyote wrote:
       | Glad to see BoldVoice here.
       | 
       | I've been using it for a few months, and I can confirm it's
       | working.
        
         | ilyausorov wrote:
         | Happy to see a happy BoldVoice user. Please don't hesitate to
         | reach out to our team with feedback or thoughts on how we can
         | continue to improve your learning journey. Helping you succeed
         | is our #1 priority!
        
       | sonny3690 wrote:
       | This is some insanely cool work. It's going to help so many
       | people.
        
         | ilyausorov wrote:
         | Thanks, we're doing our best!
        
       | childintime wrote:
       | I didn't find international english, would have been interesting.
       | 
       | Also, the USA writing convention falls short, like "who put the
       | dot inside the string."
       | 
       | crazy. Rationals "put the dot after the string". No spelling
       | corrector should change that.
        
       | Unearned5161 wrote:
       | I'm always very entertained when I'm talking with someone and
       | pick up on some very slight deviation from the "norm" in their
       | accent. I think it shows two things: that its near impossible to
       | totally wipe that fingerprint of a past tongue, and that our ears
       | are incredibly adept pieces of tooling
        
       | SamBam wrote:
       | Like others recently, I've been extremely impressed by LLM's
       | ability to play GeoGuessr, or, more generally, to geo-locate
       | random snapshots that you give them, with what seem (to me) to be
       | almost no context clues. (I gave ChatGPT loads of holiday
       | snapshots, screenshotted to remove metadata, and it did
       | amazingly.)
       | 
       | I assume that, with enough training, we could get similarly
       | accurate guesses of a person's linguistic history from their
       | voice data.
       | 
       | Obviously it would be extremely tricky for lots of people. For
       | instance, many people think I sound English or Irish. I grew up
       | in France to American parents who both went to Oxford and spent
       | 15 years in England. I wouldn't be surprised, though, if a well-
       | trained model could do much better on my accent than "you sound
       | kinda Irish."
        
         | chris_va wrote:
         | I bet you are right.
         | 
         | I had a forensic linguistics TA during college who was able to
         | identify the island in southeast Asia one of the students grew
         | up on, and where they moved to in the UK as a teenager before
         | coming to the US (if I am remembering this story right).
         | 
         | From what I gather, there are a lot of clues in how we speak
         | that most brains edit out when parsing language.
        
           | dhosek wrote:
           | Or the classic scene in Mrs Doubtfire where Pierce Brosnan
           | attempts to locate the origin of Robin Williams's fake
           | English accent.
        
         | ilyausorov wrote:
         | We actually did something like this for non-native English
         | speakers a few months back. Check out https://accentoracle.com
         | (most mind-blowing if you're a non native English speaker)
        
           | nmeofthestate wrote:
           | I'm 42% Arabic apparently! And 20% Russian. Got an 81%
           | American accent level. I guess it is tuned to non-native-
           | English speaker accents.
        
             | ilyausorov wrote:
             | Was that right? Or what is the correct native language it
             | should have predicted? Note the %s in the accent breakdown
             | section are prediction probabilities
        
           | SamBam wrote:
           | Well, it says I'm Finish. But now I have a new game, where I
           | put on my best Italian or Russian or Greek or Australian
           | accent and try to see how close I am.
           | 
           | I'm terrible, according to the program. My Italian is Russian
           | or Hungarian or Swedish, my Australian is English.
           | 
           | New party game unlocked.
        
             | ilyausorov wrote:
             | Amazing! If you can make it go viral again too, I will love
             | you!
        
             | AJoxo wrote:
             | I've been building that exact game
             | 
             | accentgame.xyz
        
           | owenthejumper wrote:
           | Wow that was actually accurate
        
         | nmstoker wrote:
         | Yes, although I believe this is a speaker embedding model here,
         | so not LLM related.
         | 
         | This kind of speech clustering has been possible for years -
         | the exciting point with their model here is how it's highly
         | focused on accents alone. Here's a video of mine from 2020 that
         | demonstrated this kind of voice clustering in the Mozilla TTS
         | repo (sadly the code got broken + dropped after a refactoring).
         | Bokeh made it possible to directly click on points in a cluster
         | and have them play
         | 
         | https://youtu.be/KW3oO7JVa7Q?si=1w-4pU5488WxYL3l
         | 
         | note: take care when listening as the audio level varies a bit
         | (sorry!)
        
           | ilyausorov wrote:
           | Correct, not LLM
        
         | dhosek wrote:
         | I've seen some online quizzes that based on regional variations
         | in accent (does root rhyme with foot or boot?) and vocabulary
         | (what do you call a sweet fizzy beverage) that did a great job
         | of locating where my Facebook friends back in the day grew up.
         | It got me a bit off largely because while I grew up in Chicago,
         | I had spent most of my adult life in Los Angeles so I tend to
         | prefer "freeway" to "expressway" (changing that answer moved me
         | from Rockford to Chicago).
        
       | dgan wrote:
       | wow always wanted to know an objective measure of my Russian
       | accent in French. I ve been living here for a long, long time and
       | some people tell me it's impossible to recognise where i come
       | from. i d like to put that to test
        
       | oezi wrote:
       | Did you publish that accent dataset somewhere?
        
         | ilyausorov wrote:
         | No, the dataset isn't published beyond what you see on the 2D
         | visualization. Sorry.
        
         | AJoxo wrote:
         | you may be interested in Mozillas CommonVoice dataset
        
       | ccheever wrote:
       | This is really cool.
       | 
       | Just had an employee at our company start expensing BoldVoice.
       | Being able to be understood more easily is a big deal for global
       | remote employees.
       | 
       | (Note - I am a small investor in BoldVoice)
        
       | runelohrhauge wrote:
       | This is fascinating work. Love seeing how you're combining
       | machine learning with practical coaching to support real accent
       | improvement. The concept of an "accent fingerprint" is especially
       | clever, and the visualization of progress in latent space really
       | brings it to life. Excited to see where you take this next!
        
       | WhitneyLand wrote:
       | The hear my own voice without an accent thing is a really cool
       | party trick.
       | 
       | I'd consider making this feature available free with super low
       | friction, maybe no signup required, to get some viral traction.
        
         | ilyausorov wrote:
         | What if it was already available? Try it out at
         | https://accentfilter.com!
        
           | PaulDavisThe1st wrote:
           | Hmmm. Initially impressive but upon retries and reflection
           | ... not that great. It doesn't even maintain timing ...
           | unless that's part of the transform.
        
             | ilyausorov wrote:
             | Indeed yeah that's one of the key weaknesses of the
             | approach that we're using. It overrides the speakers
             | cadence and accent while keeping their voice profile /
             | timbre in place. Different techniques may not do this but
             | also may not copy over the accent to the resulting clip as
             | effectively. So far we're using this to support pedagogical
             | (and lead-gen) use cases where we think it works
             | sufficiently enough.
        
               | PaulDavisThe1st wrote:
               | Let's put it a different way. I grew up in the UK till
               | 24. I've lived in the USA for 36 years. The UK/US accent
               | conversions dramatically altered my voice/accent; the AU
               | one left it mostly unchanged.
               | 
               | This is offensive :))
        
       | rayrah wrote:
       | Cool stuff
        
       | sardines wrote:
       | How's the "accent conversion model" work? Is it all embedding
       | based?
       | 
       | If so--and if you want to transfer-learn new downstream models
       | from embeddings--then seems to me you are onto a very effective
       | way of doing data augmentation. It's expensive to do data
       | augmentation on raw waveforms since you always need to run the
       | STFT again; but if you've pre-computed & cached embeddings and
       | can do data augmentation there, it would be super fast.
        
       | avalys wrote:
       | I (an American from suburban Connecticut) was recently in London
       | for an event and someone misheard me. Another Londoner said "It's
       | because of your accent!", which of course was nonsense to me.
       | What accent?
       | 
       | I'd be really interested to play with this tool and see what it
       | thinks of my accent. Can it tell where I grew up? Can it tell
       | what my parents' native languages are (not English!)
       | 
       | A free tool like this would be great marketing for this company.
        
         | ilyausorov wrote:
         | We did built two free tools, which are geared towards non-
         | native English speakers. You can find them at
         | https://accentoracle.com and https://accentfilter.com. They're
         | less effective for English native speakers, but could still be
         | fun.
        
         | dhosek wrote:
         | What I find interesting is that it seems that folks from the UK
         | tend to focus on consonants in distinguishing accents while in
         | the US we distinguish more on vowels.
        
       ___________________________________________________________________
       (page generated 2025-05-06 23:00 UTC)