[HN Gopher] The Seamless Communication models
___________________________________________________________________
The Seamless Communication models
Author : skadamat
Score : 537 points
Date : 2023-12-01 14:53 UTC (8 hours ago)
(HTM) web link (ai.meta.com)
(TXT) w3m dump (ai.meta.com)
| infotainment wrote:
| It's amazing how far text to speech has come in the past few
| years, but what I'm wondering is when this tech will finally make
| it into local TTS engines baked into the OS (eg for screen
| readers, etc)
| PartiallyTyped wrote:
| The accessibility nerd in me is excited!
| callalex wrote:
| This is already built into recent iOS devices and it's called
| Live Captions.
| freedomben wrote:
| Same with Android (Pixel phones at least).
|
| I'm the most excited for an open source one though, and it
| would be incredible if this could become it. I do 95% of my
| compute on desktop linux and it sucks being behind.
| coffeebeqn wrote:
| We can't be that far off from almost perfect real-time
| translation. There is some latency of course to hear and process
| mrob wrote:
| Differences in verb-subject-object word order will always add
| latency. If you want to translate from German, with the verb at
| the end, to Welsh, where the verb goes at the start, you'll
| have to wait for the complete sentence before you can begin.
| tralarpa wrote:
| It's very impressive what simultaneous interpreters can do.
| They don't wait for the end of the sentence.
| numpad0 wrote:
| Yeah they backtrack on branch prediction failures.
| dylan604 wrote:
| What kind of heartbleed that must introduce.
| Vecr wrote:
| You mean meltdown/Spectre?
| dylan604 wrote:
| probably, but you got the gist anyways
| MrsPeaches wrote:
| Even they struggle with jokes though.
|
| This may be apocryphal but I've heard that in formal
| settings (e.g. UN) they won't translate it and will instead
| give instruction on when to laugh.
| d3m0t3p wrote:
| Not necessarily true, for the first few sentences you won't
| be able to do it. But afterwards, once the context is
| established you don't really need to wait for the verb, you
| can predict it. For example if you are speaking about
| cleaning the house and you detail that you have cleaned the
| kitchen the stove and so on, you can predict the verb with
| only the start of the sentence. I don't have any source to
| back this up, but it sounds plausible
| gberger wrote:
| What if the predicted verb was incorrect, but the model has
| already translated the incorrect prediction? How does it
| tell you about a mistake?
| mrandish wrote:
| A good approach might be to start with how top notch,
| ultra-experienced human translators handle corrections
| for real-time scenarios, for example, the expert
| translators that do the ear monitors at the United
| Nations. I've worked with a few such real-time
| translators when preparing keynote speeches and they seem
| to have rigorous processes that appeared quite deep.
| Probably a ton of domain expertise to be captured there.
|
| That said, I suspect that real-time language translation
| is always going to be somewhat imperfect due to its
| nature. Non-real-time translation of literature is still
| a subjective art form even at the very high-end of human
| expertise.
| shkkmo wrote:
| Once you start predicting what someone is going to say you
| are no longer translating their speech
| Teever wrote:
| Yeah but then you're just introducing branch mispredictions
| which will cause latency and potential confusion down the
| line.
|
| It's all a trade off.
|
| Either way it's extremely exciting that we get to even
| discuss this stuff as real possiblities.
| Innervisio wrote:
| Although true and considering what "mrob" had also replied,
| this will never mean full translation every time, all the time.
| This will work with specific environments and linguistic
| expectations.
|
| I've been learning german since 8 years, and the amount of
| expressions and different ways to say things around the country
| is impressive. There'll be a "interpretative" real-time
| translation, but it won't guarantee fully understanding in so
| many cases, maybe ever.
|
| Other thing, and we have this in common with all languages, is
| the context and this is difficult to address i believe.
|
| Nevertheless, it's impressive how far we've reached and i
| acknowledge the usability of these tools. However, human
| knowledge will be always crucial and primordial if we want to
| guarantee full understanding.
| InCityDreams wrote:
| >I've been learning german since 8 years,
|
| "Since", as used here, would lead me to guess you are not a
| native English speaker?
| WhatsName wrote:
| Did anyone compare this to nllb (also meta) yet?
| trovas wrote:
| in the paper, the results reported show very similar level of
| quality
| jkw wrote:
| We're the same team! We have some comparisons in the paper.
| ukuina wrote:
| Next step is combining the output with few-sample speech
| synthesis so the output is in the original speaker's voice!
| modeless wrote:
| This does that already. At least, to a first approximation.
| Voice cloning is not that great in general right now.
| blovescoffee wrote:
| The voice cloning worked pretty well for me. From english to
| spanish I noticed that the first few words sounded more like
| me than the last few words. Also it doesn't sound like how I
| speak in spanish but that's expected.
| coffeebeqn wrote:
| Voice cloning works pretty well already but not necessarily
| on one 10 sec sample as the source data. If you can give it
| some hours of data it'll work much better
| modeless wrote:
| Do you have examples of it working well? I haven't heard
| anything that really impressed me. Nothing close to a good
| human impersonator. We're a long, long way from replacing
| voice actors, even considering the rapid rate of progress.
| kaycebasques wrote:
| Besides the obvious good news about making it easier for people
| to communicate with each other across languages, it's also
| exciting to me that we're trending towards a world where I can
| tap into all the knowledge that only exists on the non-English
| web. I'm sure there are vast troves of programming knowledge in
| the Japanese-only web for example. The Chinese-only and Russian-
| only web are obvious candidates too but presumably those are
| harder to access for other reasons.
| nickreese wrote:
| My wife was training to be a professional voice actor to do
| dubbing in several languages when we met.
|
| I told her then that the industry would be disrupted by AI before
| she retired.
|
| Glad she pivoted. Really impressive results.
| 0_____0 wrote:
| It won't replace high-end talent, I don't think models can
| replicate the nuance for a long time, however the entire low-
| to-mid end of the market is going to get nuked from low earth
| orbit
| Shish2k wrote:
| I wonder which will happen first - AI evolves to work well at
| the high-end, or high-end humans retire and there's nobody
| left in the low-to-mid end to fill their shoes...
| callalex wrote:
| Given the modern trend of on-screen actors doing voice
| work, I think there will be a supply of talent for at least
| a few more generations.
| crakenzak wrote:
| It will absolutely replace high-end talent. Anything that a
| human can do will be able to be done 10x better by a model --
| especially in such a narrow and well defined domain.
| sushisource wrote:
| Did you hear the output examples? Yeah, I think not. I
| mean, definitely on the way, but there's no way if you need
| quality acting in your dub that you're going with this.
| ygjb wrote:
| These are models specially tuned and sized for near real-
| time, instant translation. It would be naive to think
| that there aren't technical creatives building and
| training models tuned for expressiveness and nuance in a
| more controlled environment.
| crakenzak wrote:
| Maybe not in the current state of the model, but judging
| by the rate of improvement we're all seeing it's just a
| matter of time (and data+compute+research obv).
| dvngnt_ wrote:
| i think the key word is will.
|
| a few more years of improvements if they happen could be
| disruptive
| dontupvoteme wrote:
| That's what they gave us plebs. To think they don't have
| a superior one they can sell...
| chrismorgan wrote:
| It won't _replace_ it, but it's very likely to _supplant_ it,
| just about destroying the segment by reducing demand by being
| _good enough_ and so much cheaper, especially as people get
| more used to it.
|
| Typesetting. Music engraving. Bookbinding. The quality of all
| these fields have been materially harmed by advancements.
|
| Computer typesetting has, by and large, been a significant
| regression, though the gap has largely been made up now if
| you make the right choices.
|
| Published music scores used to be set by experts. Now they're
| set by novices using software that is mechanical in method
| and generally quite insipid. Most are _atrocious_ compared to
| the old masters, and mediocre at best compared to the typical
| published scores from a hundred years ago; and very few
| popular scores are really good (... and if they are, there's
| a reasonably high chance they've used GNU LilyPond, which has
| focused on this problem). But the barrier for entry is _so
| much lower_ , and people have got used to the inferior
| results, so I don't know if _anyone_ engraves music the old
| way, and even people that know better largely just shrug and
| make do with the new. Like with computer typesetting, there
| is hope because things _have_ slowly improved. But most will
| continue to be mediocre.
|
| Books used to be bound with cold glue. It takes time to set,
| but the results are very good, supple and long-lasting. Then
| along came hot-melt glue, and it's just _so_ much friendlier
| for cheap manufacturing because books are finished within a
| few minutes instead of a day or two, that I don't think
| _anyone_ produces books the old way any more, even though the
| results are _abysmal_ in comparison (compare the binding _and
| reading experience_ of a paperback from the '40s or '50s with
| one from the turn of the century; no one after tasting the
| old will desire the new; for he says, the old is good). But
| they're just (barely) good enough. Unlike the other two, I
| don't think there's any hope here--the regressive advancement
| crowded out the superior but dearer option so that no place
| was found for it.
| pclmulqdq wrote:
| You can still get relatively good published music scores
| from a few of the old German shops (Schirmer, Henle, etc.),
| but they are very expensive. They are a joy to use when
| playing, though, since the music is very clearly laid out
| and page turns are in the perfect place, etc. Finale and
| Sibelius are controllable enough that you can use them to
| do fantastic layout, but many people either do not
| understand how to make a score readable or don't care
| enough.
| TeMPOraL wrote:
| That, and what GP describes, is what I see as the overall
| trend of the market to hollow out the middle. It's not
| just about technology (though it plays a big role), as
| all optimization coming from competitive pressure -
| materials, processes, business models, marketing.
|
| What seems to universally happen is, the market
| bifurcates - one part is in a race to the bottom, the
| other (much smaller) aims for super premium tier
| (overpriced quality), because only those two positions
| are sustainable, once the race-to-the-bottom side drags
| all the economies of scale with it. So as a consumer, you
| get to chose between cheap low-quality garbage that's
| barely fit for purpose, and rare, super-expensive,
| professional/elite high-end products. There is no option
| for "good value for reasonable price".
|
| This has been happening to everything - software,
| furniture, construction, electronics, vehicles, _food_ ,
| you name it.
| RowanH wrote:
| I'm using AI for training videos for my startup. Never going
| back to voice actors outside of primary marketing videos.
| tThe sheer convenience of write/listen/tweak cycle on scripts
| is insane. In minutes you can do a voiceover which would have
| have taken hours + days delay prior.
|
| Sure the final result sounds slighty robotic. 99% of people
| wouldn't care, and you can get more training videos done,
| faster for a fraction of the cost.
|
| [Edit] And I'll add the difference from 6 months ago is
| noticeable to today. I imagine every 6 months we can just re-
| download updated voiceovers and every 6 months will sound
| just slightly more polished..
| ggregoire wrote:
| > I told her then that the industry would be disrupted by AI
| before she retired.
|
| Yes. I just discovered there is a text-to-speech addon [1] (now
| a few months old) for World of Warcraft that adds voices for
| every NPC in the game... It is so impressive and game changer
| (pun intended) that I naively asked in the chat of the Twitch
| stream I was watching "when did Blizzard add voices to the
| NPCs??". For an instant I really thought Blizzard contracted
| actors, but no, someone like you and me just used AI to
| generate realistic voices for every character in the game. I
| don't think it's ready yet to completely replace actors in
| video games (surely it will in the near future tho) but voice
| acting is something so expensive to do that I can see studios
| and developers in 2024 already use this tech for all the
| optional dialogues and secondary characters' voices.
|
| [1] https://www.curseforge.com/wow/addons/voiceover
| lyu07282 wrote:
| Another recent example, the finals uses AI voice generation
| for realtime game announcements
|
| https://youtu.be/kZ87wiHps9s
| freedomben wrote:
| I've wondered at what point this would happen. I think it
| could now, but from what I've read the voice actor unions are
| able to prevent it currently (at least for AAA games or non-
| indie devs). Many of them have agreements/contracts in place
| for the foreseeable future, and being the first big company
| to replace them is a heap of terrible press that nobody is
| going to want to touch. I think it's the same reason
| Hollywood reached the AI agreement recently too.
| Halong wrote:
| My wife is paying our mortgage teaching English on Preply. I'm
| extrememely worried about where we'll be in 10 years.
| ilaksh wrote:
| What did she pivot to? I don't think any currently existing job
| is really safe in the medium-to-long term.
| Jayakumark wrote:
| How does this compare to whisper-large-v3 on STT?
| trovas wrote:
| I work on seamless. You can see the results in the paper. M4Tv2
| is significantly ahead (Whisper Large v3 - 16.9 BLEU vs. M4Tv2
| 26.6). These are averages over 81 directions X->english
| 999900000999 wrote:
| Can't wait for someone to roll a language tutor out with this
| tech.
|
| Everyone gets a personal tutor for hours a day.
|
| I would absolutely love a VR game where I just need to work in
| China or Mexico all day and pick up the language that way.
| modeless wrote:
| This is what I'd like to build (the tutor part at least, not
| the VR game part yet). I'm planning to extend my current
| English only rough prototype[1] to support Mandarin. (I happen
| to be learning Mandarin myself at the moment, and there are a
| bunch of open source bilingual Mandarin LLMs and speech
| synthesizers from China to choose from.)
|
| I think a lot of people are working on similar things right
| now. I know of one called http://yourteacher.ai
|
| [1] https://apps.microsoft.com/detail/9NC624PBFGB7
| siraben wrote:
| Is there a high quality speech synthesizer (ideally local)
| for Mandarin you have found? There are some subtleties with
| tone sandhi rules and how they interact with prosody that I
| feel are lacking with current TTS voices I've tried.
| modeless wrote:
| The first one I plan to try is https://github.com/netease-
| youdao/EmotiVoice
|
| I don't have the expertise to judge the quality of Mandarin
| pronunciation myself, being a beginner. But it sounds OK in
| English and it's made by native Mandarin speakers in China
| so I expect that it sounds better in Mandarin than English.
| siraben wrote:
| Sounds pretty good, although still lacking in natural-
| sounding tone sandhi (e.g. try Yi Xia , it should be
| yi2xia4 instead of yi1xia4).
| gattr wrote:
| I love the idea of LLMs being super-efficient language
| tutors. And you have a good point; coming soon: "We've been
| getting a lot of these tourists here lately, they're eerily
| fluent, but all seem to have the same minor speech
| impediment" (read: messed-up weights in a commonly used
| speech model).
| siraben wrote:
| I've been using ChatGPT 4 to translate and explain
| various texts in Mandarin and it's been very on point
| (checking with native speakers from time to time, or
| internet searches). As expected, it has trouble with
| slang and cross-language loanwords from time to time.
| However for languages with much lower information online,
| it hallucinates like crazy.
|
| > coming soon: "We've been getting a lot of these
| tourists here lately, they're eerily fluent, but all seem
| to have the same minor speech impediment"
|
| Haha, if that were to pass, that would still be a far
| better outcome than our current situation of completely
| blind machine translation (this is especially for various
| Asian languages that are very sensitive to phrasing) and
| mispronunciation by non-native speakers.
| bityard wrote:
| > all seem to have the same minor speech impediment
|
| Ah, that is called an accent.
| dontupvoteme wrote:
| Kind of, Accents are typically derived from the
| intersection of natural languages, specifically which
| ones you learned the phonetics of first. (With the
| exception of the Mid-Atlantic accent...)
|
| This would be something quite novel as the speech
| irregularities would not have their origin in people
|
| I don't know what you would call it but it needs at least
| some adjective before accent to differentiate it IMO
| rnjesus wrote:
| the azure neural tts voices in chinese are the best i've
| heard, specifically the "xiaochen" voice. i use it in anki
| daily to generate sentences for my mandarin decks with an
| api key/plugin. it's not something you run locally of
| course, but they have a decent enough free tier.
|
| i'm hoping a voice as realistic as this becomes a local app
| soon, but i've not found anything that's nearly as natural
| sounding yet. (also, honorable mention to chatgpt's "sky."
| she pronounces mandarin with a funnily american accent, but
| it sounds natural and not as robotic as the open-source
| alternatives i've tried)
| meowtimemania wrote:
| There's already a few of them. Checkout https://hallo.ai
| 999900000999 wrote:
| I wouldn't feel good about anything that's not focused on a
| single language.
|
| You end up with the Duolingo problem where you know to say
| the names of 20 different fruits but not how to introduce
| yourself.
| apwell23 wrote:
| > You end up with the Duolingo problem where you know to
| say the names of 20 different fruits but not how to
| introduce yourself.
|
| Not sure if this is a duolingo problem. There are of
| modules in duolingo specifically for saying your name. I
| think its the travel module.
| coldtea wrote:
| Never seen that in Duolingo. It starts with the basics and
| phrases, not random useless vocabulary.
| cptskippy wrote:
| I was going to Italy and started using Duolingo to try
| and help. I learned such useful phrases as "the children
| have bread".
| gs17 wrote:
| Duo has a different problem for me. The lack of focus means
| some languages don't get features. Chinese still doesn't
| have Stories (there's an unofficial version of it, but
| we've been waiting _years_ ).
| numpad0 wrote:
| (Duolingo problem(, AIUI): Duolingo is designed around such
| premise that, by exposing your subconsciousness to such
| small set of words and phrases in target languages, your
| brain should be able to trivially construct output shims
| from Universal Grammar, which must exist, to desired
| languages; but that doesn't work in practice and you end up
| with small set of words and phrases your subconsciousness
| had recorded)
| massimokris wrote:
| the Duolingo's problem it is not because they have a bunch
| of languages, it is because achieving fluency in a target
| language it is about been able to produce/generate phrases,
| and they just move you to consume and sort words and
| phrases. in the case of any AI Language tutor, the student
| must produce phrases in order to practice, and that makes
| them advance in the path to achieving fluency
| jahewson wrote:
| Isn't having the AI do it for you better than having the AI
| teach humans to do it?
| dylan604 wrote:
| Sure, if you're not into personal growth. Not everyone wants
| to become the useless bit of lard sitting in a chair while a
| computer does everything for them. Yet. Some of us still like
| to do the actual things, but just need some assistance along
| the way. We still have a bit of time before we're all the
| humanoids from Wall-E
| ericmcer wrote:
| Yeah thats why I mill my own grain and am getting into
| textiles.
| djvdq wrote:
| I love when people use this pathetic extreme examples,
| when they don't have any meaningul arguments.
| ericmcer wrote:
| That isn't an extreme example at all, people used to mill
| grain and make clothing by hand, now we don't. We somehow
| are not sitting around getting fat even though technology
| takes care of those tasks.
|
| The parents suggestion is that if we don't have to learn
| languages that will lead to us all laying down drinking
| big gulps while robot slaves take care of us. Their take
| is the extreme example. People have literally made this
| same suggestion about every technological advance and it
| never comes true.
| TeMPOraL wrote:
| > _We still have a bit of time before we 're all the
| humanoids from Wall-E_
|
| Obligatory reminder that the movie itself explains that
| people are what they are _not_ because of their lifestyle,
| but because of the time spent in low-gravity environment.
| dylan604 wrote:
| not sure that really matters to the point
| modeless wrote:
| Even a perfect human translator following you around wouldn't
| be anywhere near as good as knowing the language yourself.
| whoisburbansky wrote:
| It depends on what your goal is; for some tasks it's possible
| that getting the AI to do it is best, but, e.g. the existence
| of auto-pilot doesn't mean that hobbyist pilots wouldn't
| benefit from/enjoy exercising the same skills manually.
| swatcoder wrote:
| _Maybe_ prior to fluency, for something like an odd business
| or tourist trip.
|
| But there's a point in language learning where you can come
| to express yourself directly in a new language without
| intermediary "thinking" in your first tongue. The
| communicative and expressive potential of that mode is much
| higher than trying to squeeze one's intent through any kind
| of translation, machine or internal.
|
| Plus, you know, it's fun.
| j33zusjuice wrote:
| Not necessarily. It depends on the use case. For taking a
| vacation, having an AI that can instantly translate to your
| native language would be amazing. That'd solve a lot of real
| world problems, no doubt.
|
| However, translation has a great deal of subjectivity
| embedded in it, particularly when there aren't 1:1
| translations. Case-in-point: there are many English
| translations of the Christian bible, all similar enough, but
| there are enormous variations in some cases. And there are at
| least as many branches of Christianity as there are English
| translations of the Bible. Some of them strictly recommend
| the same translation, and they still disagree on the meaning
| of various passages.
|
| Besides the problems inherent to translation, learning
| another language gives you another paradigm of thinking. The
| words we use, the way we construct sentences, etc., all
| impact our view of the world. Here's a paper that discusses
| the impact of the over-reliance on English in cognitive
| sciences, and how this has downstream effects: https://www.sc
| iencedirect.com/science/article/pii/S136466132...
|
| Learning languages as an adult also has protective benefits.
| It reduces the probability of Alzheimer's (maybe dementia,
| overall?).
| coldtea wrote:
| In the way that watching porn is better than having sex.
| advaith08 wrote:
| seen a lot of these, but none for Indian languages. Would love
| to try an Indian language one!
| 999900000999 wrote:
| Are Indian languages hard for English speakers?
| thinkingtoilet wrote:
| I'm learning Hindi and there are somethings that are easy
| (phonetic alphabet, nothing like 7 different sounds for
| 'ough') but the sentence structure is very different and
| can be hard to get right. Pronunciation isn't too bad for
| the most part but there a few tricky things, for example
| four different 't' sounds and four different 'd' sounds.
| The hardest part is that there really aren't that many
| resources. Even though Hindi is the third most spoken
| language in the world, you will find far more resources for
| many of the less spoken European languages.
| tmountain wrote:
| Started a project to do this a while back. It's pretty fleshed
| out:
|
| https://www.parcero.ai/
|
| I could integrate this instead of Polly pretty easily.
| bilsbie wrote:
| I think it would be so ironic if advanced AI ended up simply
| teaching us new languages quickly instead of translating for
| us.
| toomuchtodo wrote:
| Might be able to generate a better language than what we
| have.
| bilsbie wrote:
| Good point. Maybe they invent a better language and easily
| teach it to everyone.
| dontupvoteme wrote:
| Finally Esperanto has a use case!
| spaceywilly wrote:
| To me the key functionally for any language learning app is
| giving you feedback on your pronunciation and general
| understanding. I've been using Duolingo to learn Mandarin and
| when I try to speak to anyone it's difficult for them to
| understand me, because my pronunciation is all wrong. The app
| is just feeding info to me one way, and I can try my best to
| recreate what I'm hearing, but there's no way to know if I'm
| messing it up. They do have a speaking feature but it doesn't
| work very well, certainly not to the same level as speaking
| with a real person who is fluent in the language and having
| them correct you.
| throwaway4aday wrote:
| As a quick solution, you should try recording yourself
| speaking and then listen to it to check your pronunciation
| against some reference. So for example, find a YouTube video
| in the language you're learning that also has good subtitles
| (use https://filmot.com/ ) and listen to how they say the
| phrase and then record yourself saying the same phrase and
| play it back and compare.
| dog321 wrote:
| I practiced for a long time using the below pronunciation
| trainer and I get a ton of compliments from native speakers
| on how accurate my pronunciation is.
|
| https://fluent-forever.com/product/fluent-forever-
| pronunciat...
| inbread wrote:
| I built just this a month ago with the Azure AI speech API,
| which is already pretty good at multilingual speech.
|
| https://github.com/adrianmfi/gpt-tutor
|
| I look forward to testing if switching to Seamless can improve
| it further, Seamless supporting nearly 100 languages is a nice
| improvement.
| jbird11 wrote:
| Absolutely, what I've noticed is that the current apps are
| great for beginners but after a certain point the only way to
| improve your ability to speak a new language is to well...
| speak it. I built Proseable to help people move beyond the
| generic how to order a coffee or ask to go to the bathroom, and
| have more meaningful conversations in the real world. Check it
| out!
|
| https://www.proseable.com/
| Jeff_Brown wrote:
| > game
|
| Yes! Better yet, you're a spy, or a hostage negotiator, or the
| leader of any kind of enterprise (army, business, aid
| organization) ...
|
| Programming games like that will resemble directing improv
| theater. You can't program every response; you'll have to
| instead fit each character with beliefs and motivations.
|
| I can hardly wait.
| dontupvoteme wrote:
| For Language Acquisition, Input Is All You Need. (Mostly)
|
| What would be really cool is something that can autodub videos
| or audio into your target language. The hardest problem
| learning languages that aren't English is often finding content
| to consume in them.
|
| Disclaimer : I am Krashenist so this take is biased
| massimokris wrote:
| I built one for people in Latam to practice languages in a
| conversational way through a WhatsApp chat
| https://wa.me/+5491162951713?text=hola%20Speakeasy
| flanbiscuit wrote:
| I would love a game that helped you learn a language (not
| necessarily VR though as I don't have that equipment). The game
| drops you into a world (a country of the language the game is
| meant to teach you) where no one speaks your language and you
| have to figure out what people are saying in order to fulfill
| quests. You get some hints, like maybe you have a simple
| translation guide in your inventory or sometimes you meet
| people who can speak a few words of your language. That would
| motivate me to learn faster than self-taught tutorials.
|
| I'd love to learn French and the game would take place in
| locations all around modern France.
|
| It would have to a good story. Maybe something in the style of
| Professor Layton series could be interesting, or something more
| open world.
| dwighttk wrote:
| and the language tutor company could have you pilot around a
| menial labor droid while you are learning...
| zbyforgotp wrote:
| But will people use them?
| pnut wrote:
| I was hoping to find out, that the actor's voice in the demo
| video was generated, or that he had recorded the video speaking
| in another language or something.
|
| That would have been the knockout punch.
| polygamous_bat wrote:
| "The Babel fish is small, yellow, leech-like, and probably the
| oddest thing in the Universe. It feeds on brainwave energy
| received not from its own carrier, but from those around it. It
| absorbs all unconscious mental frequencies from this brainwave
| energy to nourish itself with. It then excretes into the mind of
| its carrier a telepathic matrix formed by combining the conscious
| thought frequencies with nerve signals picked up from the speech
| centres of the brain which has supplied them. The practical
| upshot of all this is that if you stick a Babel fish in your ear
| you can instantly understand anything said to you in any form of
| language. The speech patterns you actually hear decode the
| brainwave matrix which has been fed into your mind by your Babel
| fish. "Now it is such a bizarrely improbable coincidence
| that something so mind-bogglingly useful could have evolved
| purely by chance that some thinkers have chosen to see it as a
| final and clinching proof of the non-existence of God.
| "The argument goes something like this: 'I refuse to prove that I
| exist,' says God, 'for proof denies faith, and without faith, I
| am nothing.' 'But, says Man, the Babel fish is a dead giveaway,
| isn't it? It could not have evolved by chance. It proves you
| exist, and, by your own arguments, you don't. QED.' 'Oh dear,'
| says God, 'I hadn't thought of that,' and vanishes in a puff of
| logic."
| fassssst wrote:
| Try the demo here, you record a video of yourself and it does
| voice cloning and a comparison:
|
| https://seamless.metademolab.com/expressive/?utm_source=meta...
| ceejayoz wrote:
| > This research demo is not open to residents of, or those
| accessing the demo from, the States of Illinois or Texas.
|
| Interesting mix.
| solardev wrote:
| Illinois has a facial recognition / cloud biometrics ban.
| Familiar face detection for doorbells etc. isn't allowed
| there. Wonder if Texas has something similar?
| ceejayoz wrote:
| Ah, that makes sense.
|
| In Texas it seems to be part of AG Paxton's culture war
| stuff. https://www.texastribune.org/2022/05/12/texas-face-
| filters-i...
| aschla wrote:
| Likely related to biometrics laws. I know Illinois has
| restrictions on the collection of biometrics, not sure about
| Texas. Facebook in particular paid out a significant amount
| of money in a class action in Illinois, I know because I got
| a chunk of change from it.
| dylan604 wrote:
| which you mean someone took a dime and carved off a piece
| of it, and then sent you a piece of paper with postage that
| cost more than the value of that chunk? yeah, we all got
| hosed by that one too i'd imagine
| ceejayoz wrote:
| https://www.nbcchicago.com/news/local/illinois-facebook-
| user...
|
| > According to the Settlement Administrator, payments to
| class members between $200 to $400 started going in the
| mail May 9.
|
| I got a $0.19 check from an iTunes settlement once, but
| this wasn't one of those cases.
| jlund-molfese wrote:
| It's because of https://www.ilga.gov/legislation/ilcs/ilcs3.a
| sp?ActID=3004&C...
|
| Facebook has had to pay out hundreds of millions of dollars
| in settlements for related class-action lawsuits, and rather
| than trying to get informed consent, they're deciding not to
| collect biometrics from residents of those states.
| SillyUsername wrote:
| And that demo is now overloaded and fails to translate the
| input :D
| teacpde wrote:
| As someone working in tech and following along the progression
| of AI, I believe I have the right expectation. But still feels
| surreal seeing myself speaking a foreign language in my own
| speech style.
| wedn3sday wrote:
| Well that was spectacularly bad. Failed to translate a single
| word from english->spanish. Admittedly I was using George
| Carlins favorites, but if you're trying to have an expressive
| language translator that refuses to translate "fuck" then what
| you've got is bullshit.
| StrangeDoctor wrote:
| Any more info about the watermarking? Only Meta can make the
| determination?
|
| Edit: I can't find the weights but if I'm reading the paper right
| anyone could train their own detector.
| hadyelsahar wrote:
| Hey! a RS from Meta seamless team here.
|
| Yes, we chose not to release the watermark detector to
| safeguard against adversarial attacks. This decision helps
| prevent any attempts to erase the watermark by malicious users.
|
| The watermark generator and detector are trained together, one
| can use the information in our paper to train your own
| generator and detector model, however in this case the
| watermark signature created will be distinct from the one we
| use to protect our seamless translation models. This approach
| ensures each model maintains its unique security features.
| StrangeDoctor wrote:
| Thanks for clarifying, and seems like a completely reasonable
| approach. Thanks for the great work.
| gagabity wrote:
| I had pretty terrible results when I tried English -> Swahili I'm
| using the Huggingface M4T V2 spaces, it pretty much doesn't work
| most of the time and I just get English back with a different
| voice, Expressive on the other hand only has a few languages it
| seems.
|
| It would be nice if they could layout what exactly is missing in
| terms of data to make a language work better, while the actual AI
| bit is out of reach for most of us maybe we could provide more
| data.
|
| There is also a 60 sec limit and wonder if this is HuggingFace
| limitation or Seamless?
| yorwba wrote:
| > maybe we could provide more data.
|
| If you want to contribute by recording yourself speaking
| Swahili, https://commonvoice.mozilla.org/sw is the place to go.
| Although Meta has access to much larger data sets, they
| nonetheless use Common Voice as a "known good" source. E.g. the
| paper on their SONAR speech encoder reports experiments on
| Common Voice data, coincidentally involving Swahili
| https://ai.meta.com/research/publications/sonar-sentence-lev...
| whbrown wrote:
| Can anyone help demystify the licensing?
|
| Besides the ACCEPTABLE_USE_POLICY, there's a CC BY-NC 4.0
| (NonCommercial) license, a 'SEAMLESS_LICENSE' (NonCommercial),
| but also an MIT license? It would seem these other licenses
| contradict the MIT license, could somebody help clarify how these
| all interact in practice?
| dankle wrote:
| MIT for the code, NonCommercial for the trained models I bet.
| disattention wrote:
| The license details are listed on the project GitHub
|
| https://github.com/facebookresearch/seamless_communication#l...
| jeffbee wrote:
| How will Meta put these models into practice? I understand why
| Google and Apple have models for their mobile OS users, but I
| don't understand where users for Meta speech models come from.
| Are they planning to show Instagram videos with English narration
| in French or what?
| solardev wrote:
| Ads in any language!
| polygamous_bat wrote:
| Ads and Reels (their TikTok competitor) I imagine would be the
| primary use-case. Imagine spreading the "wonders" of TikTok-
| like videos to non-$native_language speaking world.
| dylan604 wrote:
| but isn't that a TikTok shtick to use the obviously fake
| voice in your video?
| crakenzak wrote:
| They have arguably the most diverse userbase of any company,
| with users from pretty much every single country + language
| across all their services & apps. I could easily imagine a
| handful of use cases having a high performing universal
| translation model would be incredibly useful.
| spacemanspiff01 wrote:
| The metaverse will not have any language barriers...
| beders wrote:
| I'm thrilled to see the progress made in the last 30 years.
|
| As a student in the mid-90s I worked on a system called Verbmobil
| at the German Research Center for AI and it did speech-to-speech
| for English, German and Japanese in very limited domain.
|
| This was done via "classical" NLP: You had to model the domain
| with concepts, you needed sentence parsers, semantic engines,
| speech-to-text hand-crafted for 3 languages etc.
|
| As it turns out, this approach is/was a dead-end.
| kapp_in_life wrote:
| Neat. How translatable are tones of voice for intent across
| languages? Like does a person trying to do a "nerdy"
| voice(nasally, whiny, etc.) in English translate to the "nerdy"
| stereotype for a French speaker. Seems to do very good on
| whispers which made me wonder what could be next.
| jeffbee wrote:
| If you don't speak the language into which these models
| translate your inputs, how do you know if or why the model has
| generated, without being commanded to do so, a campy American
| gay male sociolect, or an African American regional accent, or
| some other thing that may convey unintended meaning to native
| listeners?
| apwell23 wrote:
| .
| jvolkman wrote:
| The Google Translate app has a conversation mode.
| wg0 wrote:
| And just the other day StyleTTS[0].
|
| Just text to speech has gone too far. Audio books would be mainly
| generated on the fly like this?
|
| I think some RPGs in some 5 years time might have something like
| this:
|
| - A text file that outlines characters and a lose plot/Story
| line. Human written.
|
| - 3D Mesh Generation based on character description via
| Transformers based models. Auto generated.
|
| - Dialogues for each NPC via LLM.
|
| - This TTS engine again based on such models.
|
| Result - almost unlimited replayability. Or even edit text file,
| have a new world based on a new story line with characters having
| different personas.
|
| [0]. https://news.ycombinator.com/item?id=38335255
| mpalmer wrote:
| How has TTS gone too far?
| wg0 wrote:
| Came a long way, that is. From the days of let's say if I
| recall correctly, from Windows 98 screen reader.
| TheCaptain4815 wrote:
| The demo is so much fun to use. I can't wait for all these
| technologies to start integrating into filmmaking / games.
| anonzzzies wrote:
| How far from a real-time Star Trek translator? Whisper is fast
| enough and light enough, LLMs are getting there, so it's close
| isn't it?
| Sol- wrote:
| Seems like there will always be latency, because it's not
| possible to easily stream over languages that have different
| structure. You need to wait a bit before you can start
| faithfully translating the meaning.
|
| They also mention it in one of the videos about the streaming
| variant of their translator. But I guess 2s delay or what they
| mention is close enough for practical purposes.
|
| I feel like for personal relationships where true real-time is
| required, having a computer intermediary would be weird anyway
| and you have to learn the language, at least for the time being
| and as long as personal relationships are still relevant (in
| the post-AI world they might not be).
| forgot_old_user wrote:
| > You need to wait a bit before you can start faithfully
| translating the meaning
|
| I guess it's possible that the AI learns about a specific
| person over time? That way it can be confident about what's
| being said as soon the person starts saying it
| ziptron wrote:
| If you are multilingual but have young children and plan to
| continue residing in your current English speaking country for
| the foreseeable future, are you opting to teach your children
| those additional languages or are you adhering to the idea that
| they can always learn those languages later if necessary,
| considering it might not be essential (esp with models like
| this)?
| esafak wrote:
| It is easier to learn multiple languages when you are young.
| robga wrote:
| There isn't a lot of good evidence behind this popular
| conception.
|
| If anything, the evidence is that it isn't true, see https://
| journals.plos.org/plosone/article?id=10.1371/journal...
|
| Any apparent causality of age of acquisition seems to be a
| proxy of hours of exposure. It may well be that it is easier
| for young people to rack up a lot of exposure to a second
| language, but not much evidence that age plays much of a
| factor for people of different ages who had the same degree
| of exposure.
| debugnik wrote:
| > we argue that the late learners resort to computationally
| less efficient processing strategies when confronted with
| (lexically determined) syntactic constructions different
| from the L1.
|
| > we show that the ERP signal in response to grammatical
| violations depends on the AoA of an L2 learner, as well as
| on the regularity of the structure under investigation. In
| (lexically determined) syntactic constructions different
| from the L1, we found a gradual change in processing
| strategies that varies by AoA, with a native-like effect
| for early learners and a less efficient neural processing
| strategy for later starters.
|
| Although they do clarify that these effects _could_ be
| confounded with age of acquisition instead of it being the
| cause.
| navbaker wrote:
| Seamless Streaming looks really promising! We just had a new
| employee start a few months back with profound hearing loss and
| our company had no idea what to do with him from an accessibility
| standpoint. They threw out solutions like Dragon, not realizing
| those solutions are not real-time.
|
| He ended up rolling his own solution by standing up Whisper in
| one of our clusters and writing a basic front end and API to take
| his laptop's mic input and chunk it every few seconds to send to
| the model and get back text in pseudo-realtime. We got him a
| pretty beefy Alienware so he wouldn't be tied to the cluster
| GPUs. I can't wait to see what he does with these new models!
| cgb223 wrote:
| Just wanted to say you're a great employer to be so incredibly
| accommodating to the point you get them an Alienware and let
| them roll an accessibility solution
|
| We need more support for employees like this!
| cced wrote:
| Second this!
|
| Also, what about Apple's latest M3 series chips? Are this in
| the same realm as Alienware in terms of AI compute?
| jackson1442 wrote:
| I think generally the consensus of Apple Silicon is that
| they're great _for a laptop_, but still aren't going to
| beat a dedicated graphics card + high-end CPU like i9/Ryzen
| 9. Biggest thing going for apple is the performance/watt
| though which is critical for a laptop.
| cjbprime wrote:
| I think this is missing the main reason to use Apple
| Silicon, which is that your dedicated graphics card
| probably has 24GB or less of RAM, whereas e.g. an M2
| Ultra Mac Studio can have 192GB of RAM with a far
| superior memory bandwidth to anything on x86. This is
| important because even a "small" LLM like Llama2 13B
| would require quantization to fit in the 24GB RAM that
| the dedicated graphics card will give you, whereas the
| Mac could run Llama2 70B without quantization (at FP16).
| aftbit wrote:
| Whisper doesn't need that much RAM though.
| willy_k wrote:
| They definitely are in terms of energy efficiency
| nodja wrote:
| They're better than most consumers x86 CPUs but worse than
| using a GPU. Where they shine is when the ML model can't
| fit the GPU's VRAM since you have better options for ram
| size with macs.
| romwell wrote:
| >Just wanted to say you're a great employer to be so
| incredibly accommodating to the point you get them an
| Alienware
|
| So gracious, to give a software developer some hardware to
| run the software they _need to work_ , that costs a whopping
| _nothing_ more than what other people in the industry get on
| the average.
|
| >and let them roll an accessibility solution
|
| "You're such a good employer! You let your employee build
| _their own_ accessibility ramp to the back entrance _in their
| own time_ , and _even_ got them a mortar spatula to do so! "
| We need more support for employees like this!
|
| >We need more support for employees like this!
|
| And less support for _employers_ like this.
| Solvency wrote:
| Not sure why you're being downvoted. Literally the
| equivalent of building your own ramp.
| freedomben wrote:
| I didn't downvote, but I considered doing so because
| nowhere that I saw in GP does it say _in his own time_ ,
| and that's a critical piece of the equation.
| Hallucinating that datum means they got the argument
| wrong, and worse they were harshly critical of the
| company based on that _wrongly assumed_ information.
|
| It reminds me of the Homer Simpson quote, "I don't mind
| being called a liar when I'm lying, or about to lie, or
| just finished lying, but NOT WHEN I'M TELLING THE TRUTH!"
| I would be equally critical if it was warranted, but when
| it isn't it's deeply unfair to the accused.
|
| If the person _wanted_ to build their own ramp, and the
| employer let them do it on the clock, that 's a
| completely different scenario than the employee having to
| come in during their off-hours to build the ramp just so
| they can go to work.
| qkeast wrote:
| Awesome! I love hearing about places making the effort to be
| inclusive.
|
| As someone who's profoundly deaf myself, another less technical
| approach is to install Rogue Amoeba's Loopback, and use it to
| pipe audio from a given app into a tool like Google Meet or
| Otter.ai using the Loopback device as the audio source. This
| effectively provides real time captions for anything running on
| your existing machine.
| tuukkah wrote:
| Clever use of Google Meet as a tool! Also, Google Pixel
| phones now provide realtime captions to any speech playing on
| the phone (Accessibility > Live Caption). You can also choose
| a "preferred language" and the captions will be automatically
| translated to that language from other languages.
| jallmann wrote:
| Google Chrome [1] also has captioning built-in [2], so this
| could also work from a plain page that hooks into the
| loopback device. Pretty sure it's using the same TTS backend
| that Google Meet uses.
|
| The nice thing about Chrome feature is you can move the
| caption box around and keep it in the foreground while doing
| other things, although styling options seem limited (the text
| might be a little small for some).
|
| [1] on desktop, not sure about mobile
|
| [2] via chrome://settings/accessibility -> Live Caption
| romwell wrote:
| >Awesome! I love hearing about places making the effort to be
| inclusive.
|
| The extent of the effort being getting their employee a
| slightly-more-expensive-than-average tool that would enable
| them to do their job better _regardless_ of the disability?
|
| Such inclusive, much pat-yourself-on-the-back, wow.
|
| "We gave our woodworking shop employee a quality saw so that
| they'd make _their own_ accessibility ramps! "
| callalex wrote:
| What would you have them do instead?
| qkeast wrote:
| I have literally been told in job interviews that the
| company would not be "allowed" to hire me because I'm
| hearing impaired, so yes, making an effort to support an
| employee's disability and their needs is worth recognizing.
| RogerL wrote:
| So what? Okay, in the case of a ramp, if you need one you
| probably are going to have difficulty building one. So pay
| employee Sally to build it instead, absolutely.
|
| But hearing loss does not impair standing up servers and
| software. They can pay the employee who probably is the
| expert at this, the guy with the hearing loss, or go task
| Emil to go do it to ... avoid 'appearances'?
| pawelduda wrote:
| That's very nice of you
| romwell wrote:
| >He ended up rolling his own solution
|
| >That's very nice of you
|
| ...doesn't compute.
|
| What exactly was nice here?
| diab0lic wrote:
| > We got him a pretty beefy Alienware so he wouldn't be
| tied to the cluster GPUs.
|
| Probably this.
| lovich wrote:
| Y'all should turn that into a product, or at least open source
| it and get the positive PR + helping others
| FloatArtifact wrote:
| > Y'all should turn that into a product, or at least open
| source it and get the positive PR + helping others
|
| There you go. https://github.com/dictation-toolbox/dragonfly
| kylixz wrote:
| I recommend checking out: https://talonvoice.com/
| FloatArtifact wrote:
| It's not open source nor does the author intend to open the
| stack.
| aftbit wrote:
| Check out Willow! It does essentially this, using WebRTC. It
| doesn't handle the near-real-time response yet, but it does
| stream the audio to the server and the change would be pretty
| minor.
| FloatArtifact wrote:
| > Check out Willow! It does essentially this, using WebRTC.
| It doesn't handle the near-real-time response yet, but it
| does stream the audio to the server and the change would be
| pretty minor.
|
| Simply voice to text is not what's needed for dictating
| commands. Unless I can load commands of on the fly and decode
| utterances that may be useful.
|
| The client would need to be able to send its commands to the
| server on the fly.
| FloatArtifact wrote:
| Problem with whisper is its not really optimized for command
| recognition versus general dictation.
|
| - Whisper processes 30 second audio chunks. So if you process 5
| seconds of audio you have to pad it out with 25 seconds of
| silence. Hence a loss of efficiency with wasted CPU / GPU
| cycles on 25 seconds per chunk in the case above.
|
| - Whisper most likely can't handle hundreds of commands much
| less than a thousand performantly.
|
| - Whisper doesn't handle short commands very well with a degree
| of accuracy post processing commands from free dictation
| utterances.
|
| Command dictation should be weighted higher than general
| dictation when decoding.
|
| I work with a little under 1500 of commands dragon naturally
| speaking. DNS is hot garbage as a program despite it has the
| best accuracy to date with the feature of commands and
| dictation in one utterance. You get to pay $750 for the
| privilege m
|
| I've yet to see a free and open source speech recognition
| engine that can handle both dictation and commands with a high
| degree of accuracy.
|
| Please please let me know if there's alternatives out there. I
| would definitely pay to support an open source project like
| this that focuses on command and dictation.
|
| Most solutions out there that are open source nowadays focus so
| much on iot command recognition with intents. That's not well
| suited for controlling your computer with grammars containing
| voice commands.
| novok wrote:
| Is 30s the input size set by the model, or programs that wrap
| the model? Is it how it's trained?
| bakkoting wrote:
| It's a property of the model itself.
|
| > Input audio is split into 30-second chunks, converted
| into a log-Mel spectrogram, and then passed into an
| encoder.
|
| https://openai.com/research/whisper
| sagz wrote:
| Do they need realtime transcription?
|
| Computer: webcaptioner.com Android: Live Transcribe
| (g.co/livetranscribe) iOS: Live Caption with the 'mic' icon
| enabled.
|
| Web conferencing: Meet, Zoom, Teams all support realtime CC,
| which is pretty good.
| londons_explore wrote:
| Does "reduce toxic words" and "promoting safer communication"
| mean that if you say something wrong about LGBTQIA+ people it
| will 'correct' what you say?
|
| I'm not sure I want the latest twitter trend to be involved in
| the design of my translator...
| jwineinger wrote:
| Their video said it was to reduce toxic word hallucinations,
| which does seem admirable/useful. I'm testing real-time
| translation in a church setting, and I've witnessed whisper
| hallucinating profanity, which is quite undesirable.
| cgb223 wrote:
| "Toxic word hallucination" would be a great punk rock band
| name
| kelseyfrog wrote:
| It also happens to be quite hilarious.
| mortimerp9 wrote:
| Hi, I work on seamless. What this refers to is added toxicity
| mitigation. We try to detect the level of toxicity in the input
| and make sure that the output toxicity level is not higher.
| This protects the model from doing egregious errors in the
| translation.
|
| There are more details in the paper if you want and the
| mitigation code is all open source if you want to check what it
| actually does.
| Domenic_S wrote:
| > _What this refers to is added toxicity mitigation._
|
| Oh, well _that_ clears it up! </snark>
|
| I don't see any definition of 'toxicity' on the landing page
| - it seems to be one of those 'I know it when I (hear) it'
| kind of words... unless there's some widely-accepted
| definition in this area of study?
| mortimerp9 wrote:
| Sorry if I wasn't clear, internally we've been talking
| about it a lot, but I forgot that it doesn't have such a
| solid definition outside of our work. Thankfully, we try to
| define it in section 7.3 of the NLLB paper:
| https://arxiv.org/pdf/2207.04672.pdf
|
| The tldr is that if you say: "Thank you for this job
| offer." you wouldn't want it to be (mis)translated as "Go
| F*k yourself.". But if you do say "Go F yourself", you
| still want it to be translated as that.
| Reubend wrote:
| That's an awesome feature. I think one of the worst possible
| outcomes of machine translation is something that ends up
| being accidentally offensive, and this is a smart way to
| mitigate that.
| fl7305 wrote:
| > one of the worst possible outcomes of machine translation
| is something that ends up being accidentally offensive
|
| The Hitchhiker's Guide To The Galaxy claims the opposite:
|
| "Meanwhile, the poor Babel fish, by effectively removing
| all barriers to communication between different races and
| cultures, has caused more and bloodier wars than anything
| else in the history of creation."
| SoftTalker wrote:
| Or maybe we'll finally come around to the idea that being
| offended by _words_ doesn 't make a lot of sense.
| dontupvoteme wrote:
| How do you account for colloquial (non-English) language
| which could be naively misconstrued as toxic?
|
| e.g. "geil" (either cool or horny depending on usage) in
| German
|
| It's not fundamentally different than e.g. "wicked" in
| English, but the biggest bias that potentially all these ML
| models exhibit is predisposition towards Anglophoneism
| mortimerp9 wrote:
| Our goal is to have a good recall, sometimes to the
| detriment of precision, so for words with multiple
| meanings, it might consider them toxic when in the actual
| context they are used in, they are not. The toxicity
| mitigation algorithm will search for alternative
| translations that have the correct meaning but not the
| potentially toxic word so that there is no added toxicity
| in the output. This means that sometimes the model might
| prefer a less coloquial phrasing than what a human would.
|
| You can find details on how the multi-language creation of
| the toxicity lists was done in section 7.3 of the NLLB
| paper: https://arxiv.org/pdf/2207.04672.pdf. TLDR: it's not
| just a translation of a base English list, even if we
| started from that, each language has a curated list that
| was built by professional translators.
| dontupvoteme wrote:
| That's significantly less myopic than I pessimistically
| assumed. Thanks!
| novok wrote:
| Is there an ability to turn it off? If you're translating an
| R rated movie with criminals who swear a lot, is it possible
| to get non-toxic filtered output to make sure it's being
| translated properly?
| mortimerp9 wrote:
| it only kicks-in if the output is more "toxic" than the
| input. If the input has a lot of swear words and the output
| has the same amount, then it will be left alone.
| beardicus wrote:
| the site makes it pretty clear in multiple places that they're
| talking about "added" or "hallucinated" toxicity. maybe your
| culture war outrage is misplaced?
| Domenic_S wrote:
| Ok so I know nothing about how this works. It seems like if
| the model was able to properly detect words in the first
| place, it would never hallucinate 'toxicity'; if it _can 't_
| recognize the word with high probability, how will it know
| whether the speaker actually said $toxicWord or whether it
| should print something else?
|
| Perhaps it's taking a Big List of Naughty Words and weighting
| them so that the system must be "extra sure" that's what the
| speaker said, or else fall back to a G-rated word?
| numpad0 wrote:
| Maybe it's for preventing unwarranted fucks[1]? Translation
| is more than just concatenating dictionary definitions, and
| machine translations routinely make this kind of out-of-
| place and technically correct lookups.
|
| 1: https://www.google.com/search?q=engrish+fucking+sign&tbm
| =isc...
| mortimerp9 wrote:
| Meta employee here. The system is not perfect, or it would
| not "hallucinate", while it's pretty good, it does sometime
| make errors (not just hallucination, maybe just some
| mistranslation due to noise in the training data). What we
| want is to avoid these errors to introduce toxicity (think
| swear words) that weren't in the input as this could be
| very bad for the user. There is a separate system that
| double checks the output (compared to the input) and tells
| the translation model to try again if it's too bad.
| madeofpalk wrote:
| Your framing of basic respect as being a "twitter trend" is...
| bizzare.
| jadbox wrote:
| Your comment seems to imply LGBTQIA+ is just a Twitter trend,
| versus people's lived experience and lifelong identity. This is
| as unnecessarily judgment as small identities claiming that
| straight people must self-identify cis.
|
| There is no moral superiority to deny or force label other
| people's identities. You're an attack helicopter? Great, roger
| dodger, let's go get coffee Seahawk.
|
| No one is seriously asking for litter boxes in school bathrooms
| or helicopter refueling stations.
| mpalmer wrote:
| > No one is seriously asking for litter boxes in school
| bathrooms or helicopter refueling stations.
|
| This feels a bit out-of-nowhere.
|
| My read on parent comment was that "Twitter trends" are fast-
| changing norms about what language is (un)acceptable. They
| were not saying that LGBTQIA+ identity itself is a trend.
| jadbox wrote:
| Perhaps so. In light of yesterday's Russia announcement for
| labeling the "international LGBT public movement" as terror
| extremists, I think we should be careful what we label as
| fads or (worse) insidious activity. Source:
| https://www.themoscowtimes.com/2023/11/30/russia-bans-
| intern...
| mpalmer wrote:
| You seem to me to be arguing against points no one is
| making. You're taking the word "trend" and extrapolating
| it to "fad" and "insidious activity" - both of which have
| very different meanings and connotations to the phrase
| "Twitter trend".
|
| The original comment you replied to made the point that
| they don't want their own personal expression curtailed
| or modified according to someone else's opinion of
| acceptable speech.
|
| As someone who repudiates Russia's policies, I support
| and agree with their point.
| sjbase wrote:
| > Please don't use Hacker News for political or ideological
| battle. That tramples curiosity.
|
| From the hackernews guidelines
| zengid wrote:
| If "toxic word hallucinations" isn't a cyberpunk phrase I don't
| know what is.
|
| (quote from the video presentation in the link)
| spacephysics wrote:
| Oh god they're gonna censor the output. Time for musk to make a
| non-censored version lol...
| drexlspivey wrote:
| I am sorry Dave, "merde" is not in the pre-approved word list
| dontupvoteme wrote:
| I wonder if it doesn't understand the common colloquial usage
| of "geil" in German. This sounds like it is going to mess up
| natural language
| troseph wrote:
| I feel like naming something "seamless" is not dissimilar to
| calling the Titanic unsinkable.
| bsza wrote:
| "We need access to your microphone and camera to record your
| voice and translate it with your expressions."
|
| None of the videos shows any modified/lip-synced footage. There
| doesn't seem to be a reason for this thing to need access to my
| camera.
|
| Also, using it with tape over the camera doesn't seem to work
| either. (Perhaps it needs to see facial expressions in order to
| work?)
| Havoc wrote:
| Can this also do straight tts or is it translation only? Is t
| quite clear to me from the site
| tambourine_man wrote:
| Every video in this page is a bit out of sync with the audio.
| Combined with the blandness of feature expressions and the whole
| mood in general, I kept waiting for the moment when the video
| would disclosure that everything on it was created by AI.
| nextworddev wrote:
| RiP elevenlabs?
| Reubend wrote:
| Wow, after trying out the demo, I'm floored by how high quality
| this is. The translations worked perfectly, the voice cloning was
| "good enough", and the emotions conveyed in my voice was retained
| pretty accurately.
|
| I don't think this would fool anyone that I was a real native
| speaker of the target language, but for casual conversation this
| would work pretty much perfectly. It basically avoids all of the
| traditional pitfalls of machine translation, like the unnatural
| robotic voice that it outputs, the slow translation speed and
| huge latency for realtime conversation, and the loss of emotion.
| stephc_int13 wrote:
| As a French native speaker, I am surprised by the low quality
| (frankly ridiculous) voice of the French translation example.
|
| Especially because the head of AI at Meta is a French guy AFAIK
| (Yann Lecun).
| sangnoir wrote:
| They are optimizing for speed (low latency)
| yread wrote:
| Does the spanish expressive sample sound muffled for others too?
| And the french sounds super mechanical. Hopefully, it's more
| impressive the other way.
|
| Also: "This research demo is not open to residents of, or those
| accessing the demo from, the States of Illinois or Texas"
| dentalperson wrote:
| Yes, they all have significant 'ghosting' artifacts where the
| harmonics are a bit fuzzy if you listen closely. AFAIK all of
| the recent neural speech engines have this, from SoundStream to
| EnCodec, especially in low latency causal setups. Wavenet was a
| bit better in that regard but has fallen out of style due to
| complexity and the lack of a bottleneck. It seems like
| something diffusion post processing would be able to clean up.
| TacticalCoder wrote:
| The "expressive" example in french exhibits a _thick_ accent
| which bothers me more than the mechanical aspect of the non-
| expressive french example.
|
| It's not dissimilar to some kind of a "ch'ti" / "chtimi" accent
| or a belgian-french accent (which is not dissimilar to the
| french ch'ti accent, heard in some part of the north of France.
| "Ne partez pooooo" (with a longer "a" which sounds nearly like
| an 'o': that's not proper french at all) instead of "Ne partez
| pas".
|
| That's said I'll take the non-expressive accent any day over
| subtitles for when watching video in a language I don't
| understand: it's clearly good enough.
| grogenaut wrote:
| Illinois is possibly because they don't allow storage of
| biometric data without express permission and I believe
| explicit usage restrictions. So I bet they're keeping all of
| your utterances, which would violate that law.
| iFire wrote:
| LICENSE
|
| Attribution-NonCommercial 4.0 International
|
| https://github.com/facebookresearch/seamless_communication/b...
| iFire wrote:
| Took me 2 minutes to find the Github.
| nathanfig wrote:
| Impressive work, really excited for this.
|
| I will note though that I feel safer getting an occasional bad
| word than I do having a translator straight up deceive me.
|
| For example, "what the fuck" in English->Spanish is giving "que
| diablos" output. Definitely toning down the meaning there.
|
| If someone says something mean to me, I want to know it.
| jonathanlb wrote:
| This may be an intentional decision given that there are
| several ways to say "what the fuck" in Spanish, such as "que
| mierda" or "que carajos". And that's not including regional
| expressions like "que cono" or "que chingados". So, saying "que
| diablos" may be the most common expression across dialects
| conveying the same meaning.
| nathanfig wrote:
| Yeah could be, I still need to read the paper to better
| understand the safety tuning.
|
| Would be interesting to see some work stress-testing the
| ability to convey ill-intent across multiple languages.
| Accurately conveying ill-intent is safety-critical for the
| person being threatened.
| trinovantes wrote:
| Currently Steam bans games from using AI-generated assets (for
| good reason). I wonder if they'll back track on this or carve
| exceptions because this tech seems really useful for indie devs
| to add voice work to their otherwise silent games.
| yjftsjthsd-h wrote:
| Very speculative amateur opinion: My understanding is that
| Valve didn't exactly ban AI, they banned AI that was fed
| copyrighted works that could possibly make the results
| copyright infringement (
| https://www.theverge.com/2023/7/1/23781339/valve-steam-ai-ar...
| ). (Side note: Regardless of individual views on whether AIs
| are just copyright regurgitaters or not, I can understand Valve
| being cautious until courts have actually decided.) So _if_
| speech models can be made purely from assets that their
| creators can prove they have the rights to use, it would
| probably be easy enough to get it approved.
| ChuckMcM wrote:
| I look forward to the day where I'm wearing my headphones in a
| foreign land and hearing all of the discussions in my own
| language.
|
| The "universal translator" which was part of Star Trek and a lot
| of other Sci-Fi I was exposed to as a kid was something I was
| really fascinated with. My Dad worked as a simultaneous
| French->English translator and sadly spent long hours away from
| home and, as a kid, I started trying to build a translator so
| that it could do his work and he could be home more.
|
| Translation is important work and one that could help a lot of
| people. It's my hope that we get to the point where these models
| work entirely on locally carried resources.
| sacvnsune wrote:
| If I am not wrong, Google Pixel buds offer live translate
| feature.
| echelon wrote:
| Not in the voice of the original speaker.
| stevenicr wrote:
| now if I could just get the pixel buds tech to remove the
| voice of the original speaker and translate some youtube
| videos from thick accent english into no accent am-english.
| ChuckMcM wrote:
| This is a really interesting use case. I could definitely
| see this as a service for content providers to get more
| reach and I think you could justify a subscription price
| for the service based on this.
|
| By keeping creating speaker specific tonal ranges and
| profiles you maintain the better cohesion on the final
| product.
| keerthiko wrote:
| Obligatory, not directed at you in particular since I'm
| sure you mean no offense, but just voicing a pet peeve:
|
| I grew up bilingual outside the US, and speak English
| with a hybrid British/Indian/Middle Eastern accent (with
| some of my personal quirks, and mixing increasing amounts
| of various American accents over time). I can understand
| English in nearly any accent (Singaporean, Chinese,
| Vietnamese, Indian, Nigerian, eastern European) as long
| as the words involved are globally used and the grammar
| is passably queen's. Especially after hearing it for
| about an hour. And people who natively speak English with
| these various accents usually can understand my English
| better than they can an average American accent. Yet in
| this country, my accent is belittled, despite being
| perfectly understood and more versatile. Even by others
| who don't speak with the American accent!
|
| This is the problem of the "default accent" anywhere
| being referred to as "no accent", and therefore anything
| deviating is considered "having an accent". This makes
| "accent" a negative trait, scaling from 0-bad to heavy-
| bad. But if the vernacular were such that we said
| "American accent" instead of "no accent", then noone's
| accent is bad, just not used to.
|
| Most of my non-American peers who were raised on English
| have a better command of the language than my American
| ones, yet they are mocked for their accents as if they
| don't know the language, when in reality it's the
| Americans lack of familiarity with the language (as its
| used globally) preventing them from comprehending the
| language.
|
| So yes, put in more work, the world is shrinking and
| English is the global language (for better or worse).
| What you're saying is spoken from a position of privilege
| because the culture allows you to mock others' accents
| and imply your version of it is the correct one that
| everyone else should put in work to provide you with,
| rather than the other way around.
|
| Every time you hear English with an accent other than
| British, American or Australian, remember that it usually
| means the speaker knows at least one entire other
| language as well, probably one that you would sound like
| an idiot if you tried to speak it. Don't be rude or
| dismissive of their command of English.
|
| In fact, you were so close -- you called it a "no accent
| am-english", when you could have just called it what it
| is -- "an american accent".
| freedomben wrote:
| I'm not OP, but doing what you did is a pet peeve of
| _mine_ :
|
| > _What you 're saying is spoken from a position of
| privilege because the culture allows you to mock others'
| accents and imply your version of it is the correct one
| that everyone else should put in work to provide you
| with, rather than the other way around._
|
| > _Every time you hear English with an accent other than
| British, American or Australian, remember that it usually
| means the speaker knows at least one entire other
| language as well, probably one that you would sound like
| an idiot if you tried to speak it. Don 't be rude or
| dismissive of their command of English._
|
| This is so uncharitable an interpretation of GP that it
| makes me wonder if it's Poe's Law at play and you're
| actually trolling. Nevertheless, I will assume you are
| being serious and address your comments as such.
|
| You clearly have some deeply held frustrations (at a
| minimum), but unless you have a history with GP and
| therefore a _lot_ more context on them than I do from
| just reading these comments, or unless GP edited their
| post in between my writing this and reading yours, then
| you are majorly projecting upon them based purely upon a
| negative stereotypes that you harbor against Americans.
| If I 've missed the mocking or rude dismissiveness you
| refer to, then please point it out with a direct quote so
| I can further examine what you are referring to.
|
| There definitely are people (and definitely some
| Americans, though it's certainly not monopolized by them.
| I was once ridiculed by locals in Mexico City for my
| terrible Spanish) who "mock" accents and are generally
| assholes who don't appreciate the difficulty of speaking
| a non-native language, and many of them would deserve the
| criticism you've levelled at GP, but to unload those
| accusations and chastisement at a person without cause, I
| don't think you're behaving any better than the people
| you would criticize who.
| archagon wrote:
| I don't think it's unreasonable to remind people that a
| "default" accent does not exist, and that AI-editing an
| accent out starts to feel a bit like dystopian identity
| erasure and homogenization. Even if we scope ourselves to
| Americans speaking English as a first language, there are
| dozens of diverse accents across the country.
| ChuckMcM wrote:
| I think this is one of those times when my Mom,
| understanding my desire to be understood and to ask
| questions about motives and related understanding, would
| observe the, oblivious to me, effect of inflaming the
| conversation and say, "Charles, this is not the time."
| :-)
| archagon wrote:
| I don't like seeing a comment that's relatively
| reasonable get greyed out just because it grinds
| somebody's gears. Alas, I only have one counter-downvote
| to give, so I feel obliged to comment.
| stevenicr wrote:
| My original statement was wanting a translator device,
| hardware or software, so I could understand and learn
| better.
|
| There was not desire for identity erasure or
| homogenization, leave whoever's voice the way it is
| online, give me an option to translate it. I added more
| about my issue downthread.
|
| Diverse accents across the country. - absolutely! which
| is why I said 'no accent am-english.' (for me, as I can't
| learn well outside that) - and assuming if this tech
| exists it could help me, and perhaps be tweaked to change
| to other accents for other people.. also mentioned in
| downthread reply.
| stevenicr wrote:
| I appreciate your sharing, and stating that you assume I
| meant no offense, and that your thoughts are not directed
| at me specifically.
|
| I could of been more specific, but my request for the
| tech to vary, I think would lead to specific options for
| different people.
|
| And actually to be even more.. not sure the word.. I want
| 'the Chicago accent' I think it's called, or midwest / no
| accent. Personally as much as I enjoy some entertainment
| from Jersy / NY accents, I would not volunteer to watch
| tutorials on tech taught by the Sopranos cast - as funny
| as that might be (and I get if you are from the NE, you
| may be learning just fine being taught with such a
| language style).
|
| As annoying some of the Cali style of language is, I can
| understand the words and meanings without squinting my
| ears and spending double the brain cycles trying to
| understand the words, while then interpreting the
| meaning, and then trying to put together concepts for
| understanding new ways of coding or using tech.
|
| I've run into folks in Louisana that I could not
| understand at all and had to ask for an interpreter at a
| gas station. From Florida to Chicago to Seattle down to
| Miss and Ala - I can hear what people are saying and
| learn without spending lots of extra energy trying to
| understand.
|
| With that being said, I understand there are parts around
| Miami where accents may be thicker (or not) - and with
| some folks even if using the rights words and grammar, I
| may need to slow down the speech to actually learn if
| they were teaching a class.
|
| The slow down and speed up options already exist with
| youtube.
|
| "So yes, put in more work"
|
| - I do try a bit. I don't mind accents with some folks
| and media.For example I can listen to and enjoy Shankar
| sharing via the 'hidden brain' series, partially because
| his accent is limited but also because the media requires
| less thought intensity.
|
| I have tried many youtubes, and bought a few courses
| taught from folks in India and other places where I just
| could not muster the energy. I literally squint with my
| ears and feel like my head gets hot trying to decipher
| what is being said, translate into what is meant, and how
| it should create new patterns of understanding in my
| brain.
|
| I can only do that for so long and I am done. Now I just
| skip any learning video that has non-am English speakers.
| When I consider courses to sign up for or buy, I have to
| research the authors / speakers and find video of them to
| hear the audio, because I just can't learn well that way.
|
| "other than British," - True story, a few years ago I had
| to call an ISP in Britain(?) and the person I got to to
| file an issue with, I could not understand them. I had
| ask 'what did you just say' many times. I laughed at
| myself for even thinking of saying 'can you slow down and
| speak clearer English please' - I mean, crazy... I was
| paying by the minute for the long distance at the time
| and it ended up being a 25 minute call that could of been
| 10 if I had a magic translate without accent device.
|
| "a position of privilege because the culture allows you
| to mock others' accents"
|
| - This is truly not about mocking accents, this is truly
| about my lack of ability to learn well.
|
| Yes, I would defintely sound like an idiot trying to
| speak another language. Like I said, I do not learn as
| well as some others.
|
| Truly not my intent to be rude. I apologize if the
| shortness came off that way, I was trying to be brief in
| the hope that there's a chance that some tech like this
| exists and someone here could point me to it. Before I
| posted, I DDG'ed it and found a couple of things
| attempting to be in that space with a 'speak to sales'
| type of 'you'll never afford this' button for info.
|
| I will never be dismissive of anyone's command of
| English, or other spoken language, or computer language
| or anything like that. There is no way for me to know
| someone else's situation and circumstances led them to
| their current command of whatever language. If someone is
| trying to learn more at any age; I applaud and encourage
| them - being rude or dismissive does not encourage more
| learning.
|
| "no accent am-english", when you could have just called
| it what it is -- "an american accent". - Well maybe, but
| actually I meant to be more specific, as mentioned a bit
| above - I mean '"no accent" American accent' - because
| there are plenty 'American accent' types that I would
| want removed by a magic earpiece to make it easier for me
| to understand and learn.
| keerthiko wrote:
| I appreciate the thoughtful reply. I don't think you're
| rude, and I get what you're saying as someone who thinks
| a lot about accents and languages. However, I still think
| you missed my point.
|
| There is no "no accent". An accent is a baseline feature
| of intelligible human speech, like a voice, or a volume,
| or a language. You can't say stuff without those
| features. When you say "the Chicago accent", or the
| "Midwest accent", that's an accent! Not "no accent".
|
| I understand it's common usage to refer to the default
| "radio accent" as "no accent", but in a country like
| America, all kinds of people with all kinds of accents
| speak English. Reinforcing an expectation that a certain
| (usu. majority-white-spoken) one is the "default" by
| referring to it as "no accent", implicitly suggests all
| others are erroneous affectations, even if I trust that
| is not your personal intent.
|
| All that said, I think your idea for a translation device
| capable of revocalizing what is said with an unfamiliar
| accent into one you are used to is not a bad one, and
| likely easier than translating between languages while
| retaining expressiveness.
| TheHumanist wrote:
| Babel Fish
| dimitrios1 wrote:
| Another lesson we can learn from Sci-Fi is very often different
| species on a planet would have their tribal / local languages
| and dialects but all spoke a common tongue. I think this is the
| more humanizing approach, rather than delegate even more of our
| fleshly processing power to machines.
| somewhereoutth wrote:
| This seems to be what is happening in Europe (and perhaps
| more generally across the globe), with English being the
| common tongue.
|
| Question is, what will happen to the tribal / local
| languages? Will they survive?
| Cthulhu_ wrote:
| It varies. A lot of local languages have gone extinct
| already. There's linguists hard at work to try and document
| / record dying languages, but it won't be the same as
| living the language from childhood.
| micromacrofoot wrote:
| then of course, there's always Darmok and Jalad at Tanagra
| rangestransform wrote:
| how am i supposed to talk shit with my friends about other
| people in public then
| flanbiscuit wrote:
| I'm curious to know how well these models can pick up slang.
| Maybe if you talk shit in as thick a slang as you can it
| won't be able to give a good enough translation.
| kredd wrote:
| With my bi/trilingual friends who speak the same languages,
| we intermix them to make our point more clear. Don't think
| models will be good enough for mixes for a few more years,
| so we're safe!
| smcin wrote:
| Can you show us an example of such a sentence?
| kredd wrote:
| Hm, think of things like "On va bruncher" (we're going to
| brunch). The word "brunch" doesn't exist in french, but
| we add suffixes to fit into the sentence. Very common in
| Montreal. My french isn't very good to do that on the
| fly, but my francophone friends do that all the time.
|
| In my other languages that I am actually fluent in, it's
| kinda the same -- you use specific suffixes to soften or
| embolden your point and so on. Maybe add "exclamation
| making sounds in specific language" too. Eventually your
| nouns and verbs end up in different languages, with
| different suffixes where it "makes sense", yet the person
| whom you're talking to will "get it".
|
| Would be curious to try the new Seamless model on such
| speeches.
| bertil wrote:
| This is extremely common for every new technology:
| "upload," "download," "stream," "google," "FaceTime,"
| most code patterns, all the new ML apps, "venmo" or
| whatever the name of the app you use for payment, etc.
| all of those are taken as is, slapped a verb termination
| and it's good enough. That's true in German, Danish,
| Dutch, French, Italian, and Spanish.
|
| The only thing that doesn't work is if you talk to people
| too young to remember Skype. Then you feel old.
| dontupvoteme wrote:
| I'd love to see a map of how it matches up to regional
| English/British accents and their slang.
| fasquoika wrote:
| Reinventing polari is certainly one way to make yourself
| less understood...
| ugh123 wrote:
| learn Klingon?
| bertil wrote:
| Klingon is definitely going to be in the top 50 languages
| covered...
| csa wrote:
| Speak in metaphor and/or code.
|
| I've been in mixed language communities in which I wasn't
| sure who spoke what, and I have found this to be quite
| effective when done right.
|
| Good time to reference st:ng "darmok" episode and quotes like
| "darmok and jalad at tanagra".
| buryat wrote:
| get better at double speak
| https://en.wikipedia.org/wiki/Doublespeak
| baby wrote:
| I'm wearing the Rayban Meta right now and they are already mind
| blowing, I can already talk to that Meta AI assistant
| seamlessly. I bet one of the future iteration will have exactly
| this.
| figers wrote:
| Curious, what do you ask it besides take a picture / video or
| what's the weather?
|
| I have a pair and have only asked it that so far...
| diob wrote:
| The problem is you need a full sentence, plus surrounding
| sentences to properly translate a lot of things (aka context
| matters).
|
| So no matter what, conversations in your native speech would
| have to be delayed before translation.
| ChuckMcM wrote:
| I think I could adapt to that. But it would be an interesting
| experiment.
| ItsMattyG wrote:
| My understanding is that they trained a separate model to
| specifically estimate when they have enough context to begin
| translating, as a skilled translator would.
| DigiDigiorno wrote:
| Even the native original version needs the proper context.
| Sometimes you need the entire sentence to figure out what the
| sentence was really about.
|
| I'm reminded of Mark Twain complaining about verbs arriving
| at the very end of sentencess in German (among a myriad of
| other complaints)
|
| "The Awful German Language* -Mark Twain
| https://faculty.georgetown.edu/jod/texts/twain.german.html
| scotty79 wrote:
| Sometimes you even need a second sentence of even a few to
| understand what the first sentence was about.
| sexy_seedbox wrote:
| So then we need something like neuralink to get the whole
| thought from one's brain first, then the sentences are
| processed properly for the context, then translated before
| the speech is delivered.
| freetanga wrote:
| What most people have to say is not that interesting, and tech
| won't change that
| btbuildem wrote:
| The near-realtime aspect of this is so promising -- we're getting
| closer and closer to IRL babelfish!
|
| What I would love to see is an ability to add my own voice (yes,
| at the risk of deepfakes) so that the model could "speak" in any
| language and sound more like me, not some random voice actor it
| was trained on.
| gagabity wrote:
| Can this do speech to text English -> English? Get strange
| results if I do a translation to the same language would be an
| interesting alternative to Whisper if it could.
| I_am_tiberius wrote:
| I hope all these AI products will have privacy focused
| alternatives quicker than when web2 happened.
| mkagenius wrote:
| Yet again, Hindi (the major language in India) is not even in the
| samples. India is the largest user base of facebook (and probably
| 1/3rd of the engineers working there are Indians) but never will
| facebook put enough effort to contribute back. Only use the DAU
| from India in investor calls.
| cafed00d wrote:
| By "samples" do you mean examples on the marketing/landing
| page? It sure looks like the model supports many major Indian
| languages like Telugu, Tamil & Kannada.
| https://huggingface.co/facebook/seamless-m4t-v2-large
|
| Yeah, I kinda agree with the spirit of your comment; it sure
| would be nice to see a major Indian language like Telugu on
| their landing page for sure. But that's just my Indian-person
| bias speaking.
| mkagenius wrote:
| The lack of focus shows up in the results. The models never
| performs as good as french or spanish on Indian languages.
| This goes for Google, too.
| gorbypark wrote:
| I've been trying (and mostly failing) at settings up a pipeline
| to get system audio into whisper and feed that transcription into
| a seamless m4t text-to-text translation model. It seems like
| seamless streaming is going to solve most of my issues, and
| should significantly reduce latency!
|
| My ultimate goal is to have realtime translations of video
| conferences. I've moved to a new country, and while I'm super
| privileged that most of my colleagues speak English, we still
| have a number of "all hands" meetings that I get lost in pretty
| easily.
| xnx wrote:
| This tech from Google seems similar, but doesn't have a fancy
| demo: https://blog.research.google/2023/12/unsupervised-speech-
| to-...
| jwineinger wrote:
| Any ideas on what kind of hardware this would require to run
| S2ST?
| gloyoyo wrote:
| This is so world changing! Exactly how I wanted to speak so
| confidently!
|
| Thank you Meta!
| mightytravels wrote:
| Like how easy it is to get going but you need to download about
| 20GB and s2st needs 40GB GPU RAM!
|
| It runs but any audio input (you will need to provide wav not
| mp3's) I tried (tried 20s/40s/300s) I get just one short sentence
| returned in target language that seems not related at all to my
| audio input (i.e. Tous les humains sont crees egaux).
|
| Seems like some default text but it runs on full GPU for 10
| minutes. Tons of bug reports in GitHub as well.
|
| Text Translate works but not sure what is the context length of
| the model. Seems short at first glance (haven't looked into it).
|
| Oh and why is Whisper a dependency? Seems not need if FB has
| their own model?
| novok wrote:
| I wonder how well this will perform for automatic comic's
| translation. Current local models are pretty bad.
| MagicMoonlight wrote:
| >Automatically filters out toxic speech >Watermarking
|
| So it can't be trusted at all then
| quickthrower2 wrote:
| How did that page get camera access without my permission?
|
| Edit: by the upvote I guess it wasn't just me?
| rammer wrote:
| Marketing has been heavily involved in this page...there's at
| least one coloured person for every white photo..
| asylteltine wrote:
| It really sucks that a company so irresponsible with all your
| data is one of the leading AI companies now.
| bozhark wrote:
| I want this as a channel in our discord.
|
| Would allow more interactions of people that don't speak the same
| language
___________________________________________________________________
(page generated 2023-12-01 23:00 UTC)