[HN Gopher] SeamlessM4T, a Multimodal AI Model for Speech and Te...
___________________________________________________________________
SeamlessM4T, a Multimodal AI Model for Speech and Text Translation
Author : mchiang
Score : 121 points
Date : 2023-08-22 13:58 UTC (9 hours ago)
(HTM) web link (about.fb.com)
(TXT) w3m dump (about.fb.com)
| gigel82 wrote:
| The speech recognition in their demo is very very bad (~60% in my
| empirical test, vs. 95% with WhisperCPP). The translation is also
| very inaccurate.
|
| That said, I fully support open releases and look forward to
| future versions and improvements.
| Jayakumark wrote:
| Meta is killing it with this open models. Not sure why Tamil
| Language is missing on Output.
| [deleted]
| 1attice wrote:
| ....'M4T', ahem, might mean slightly more than you think it does
| 0cf8612b2e1e wrote:
| Will there be a whispercpp equivalent? Half the reason I love
| whisper is how dead simple it is to get running. I will take
| somewhat lower accuracy for easier operation.
|
| Edit: unless there is native speaker diarization. That would be a
| huge value add.
| jmorgan wrote:
| I'm curious about this too. Lately I've been building an open
| source tool to help bring make pulling + running models easier
| locally - https://github.com/jmorganca/ollama - right now we
| work with the awesome llama.cpp project, however, other model
| types have definitely come up. LLMs are a small section of
| what's available on huggingface for example.
|
| It's especially interesting how you could combine different
| model types - e.g. translation + text completion (or image
| generation) - it could be a pretty powerful combination...
| genpfault wrote:
| They have a smaller model[1] that might be amenable to Whisper-
| ization.
|
| _Much_ smaller language matrix though.
|
| [1]:
| https://github.com/facebookresearch/seamless_communication/b...
| jimmies wrote:
| Lol, they botched the first example - that it translates "Our
| goal is to create a more connected world" to Vietnamese: It has a
| glancing typo at the end of the sentence "hon" instead of "ho."
| Also it really messed up the pronounciation: It read "Chung toi"
| as "Chung ta" - they are totally different words phonetically.
| The pronunciation also sounds like it's made by someone who is
| mentally sick. So they botched in both translation and
| pronunciation.
|
| That's so embarrassing - especially for something to show how
| good their stuff is (although I think it's probably not the ai's
| fault) - just shows how sloppy their people are.
|
| I know they have plenty of Vietnamese engineers there. Did the PR
| dept just throw this final version of the video out without
| reviewing with them?
| [deleted]
| msp26 wrote:
| All I want is llama-2-34b (seriously what's taking so long on
| this specific model) but this is interesting too I guess.
| crakenzak wrote:
| code: https://github.com/facebookresearch/seamless_communication
|
| paper: https://ai.meta.com/research/publications/seamless-m4t/
|
| demo: https://seamless.metademolab.com/
| fotcorn wrote:
| There is also a Hugging Face Space for some quick tests without
| downloading the model:
|
| https://huggingface.co/spaces/facebook/seamless_m4t
| lhl wrote:
| I gave it a spin a little bit ago. Per usual, install docs didn't
| quite work OOTB, here's how I got it working: https://llm-
| tracker.info/books/howto-guides/page/speech-to-t...
|
| One limitation that seems undocumented, the current code only
| supports relatively short clips so isn't suitable for long
| transcriptions:
|
| > ValueError: The input sequence length must be less than or
| equal to the maximum sequence length (4096), but is 99945
| instead.
| nicolashahn wrote:
| Seems like you could easily do a little bash/python script to
| chop up the recording and batch process each, then stitch the
| results together?
| lhl wrote:
| Probably, although you could more easily use WhisperX and get
| the same results twice as fast and without any additional
| scripting.
| houseatrielah wrote:
| SeamlessM4T-Medium { 1.2B params, filesize 6.8 GB }. Wondering
| how it compares to OpenAi's Whisper.
| thewataccount wrote:
| 281M and 235M param models too.
|
| https://github.com/facebookresearch/seamless_communication/b...
|
| I don't really know how the metrics they list compare to
| whisper, I'm very curious if these are fast enough for realtime
| speech2text? I think whisper technically could but it was
| difficult to do or something like that?
| aportnoy wrote:
| Go to the blog and skip to results:
| https://ai.meta.com/blog/seamless-m4t/
| rvz wrote:
| Yet somehow, many here underestimated Meta's position in AI and
| proclaimed that Meta was dying and was not important and far
| behind in the AI race.
|
| How things change dramatically in one year with such exaggeration
| of Meta's collapse in 2022.
|
| Not only they are in the lead in $0 free AI models, they are also
| at the finish line in the AI race to zero.
| jacooper wrote:
| What's the license
| minimaxir wrote:
| CC BY-NC 4.0
| noiseinvacuum wrote:
| I was trying to figure out what does it mean and this is the
| summary from Bard so take it with a grain of salt.
|
| The CC BY-NC 4.0 license allows for the following uses of the
| licensed material:
|
| * Reproduction: You can copy and distribute the licensed
| material in any medium or format.
|
| * Distribution: You can distribute the licensed material to
| others.
|
| * Public performance: You can perform the licensed material
| publicly.
|
| * Public display: You can display the licensed material
| publicly.
|
| * Modification: You can remix, transform, and build upon the
| licensed material.
|
| * Derivative works: You can create derivative works based on
| the licensed material.
|
| However, there are some restrictions on how you can use the
| licensed material under the CC BY-NC 4.0 license:
|
| * Commercial use: You cannot use the licensed material for
| commercial purposes.
|
| * Sublicensing: You cannot sublicense the licensed material.
|
| * Moral rights: The licensor retains all moral rights in the
| licensed material.
|
| Here are some examples of how the CC BY-NC 4.0 license can be
| used:
|
| * A teacher can use a CC BY-NC 4.0 licensed image in a
| presentation for their class.
|
| * A student can create a CC BY-NC 4.0 licensed remix of a
| song.
|
| * A software developer can use a CC BY-NC 4.0 licensed
| library in their open source project.
|
| * A photographer can share their photos on a CC BY-NC 4.0
| licensed website.
| edgyquant wrote:
| Please do not post output from LLMs here. It is against the
| rules and we have plenty of knowledgeable people to answer
| questions. We all have access to these chat bots if we want
| their answer.
| minimaxir wrote:
| You could just Google it:
| https://creativecommons.org/licenses/by-nc/4.0/
| [deleted]
| version_five wrote:
| Importantly, non-commercial. Almost all of Facebooks stuff
| used to be Apache, this new stance is really shitty of them
| and I hope limits adoption. Deigning to allow others to play
| with models (and make improvements, give feedback, build an
| ecosystem) that only you can profit from is not good
| community behavior. I'd rather see them make it freemium or
| paid if that's their goal, this is the equivalent of a kid
| licking a cookie so the others can't eat it.
| sangnoir wrote:
| > Almost all of Facebooks stuff used to be Apache, this new
| stance is really shitty of them and I hope limits adoption
|
| The AI research environment has changed from the earlier
| default-open publication - unlike it's competitors, FAIR is
| still releasing model weights instead of serving the models
| behind an API.
|
| > this is the equivalent of a kid licking a cookie so the
| others can't eat it.
|
| More like the other kid baking a cookie with the words
| "Free Cookie" on it so others can eat it if they are
| hungry, but can't sell it for money. It'd be foolish for
| FAIR to donate preconfigured homing-missiles to OpenAI and
| others via one-way tech transfer.
| version_five wrote:
| It'd be foolish for FAIR to donate preconfigured homing-
| missiles to OpenAI and others via one-way tech transfer.
|
| No, they could GPL it, and I don't think they're worried
| about competition taking the models anyway, there's
| nothing particularly special about the weights or
| training data, just the compute. I think part of it is
| pressure from AI "safety" hangers-on who pretend that AI
| is dangerous so only those who don't want to abide by
| license terms should have unfettered access. The other
| commercial reasons are harder to understand. With pytorch
| they became the standard that everyone builds off of,
| they could do that with their recent AI, particularly
| LLaMA but they chose this silly route.
|
| Also, LLaMA has a more permissive license than this
| translation one, and is a more powerful model, so I don't
| really see the "homing missiles to open AI" angle.
| taneq wrote:
| True, LLaMA2 is more like "donating homing missiles to
| everyone except OpenAI, Google, and Apple."
___________________________________________________________________
(page generated 2023-08-22 23:01 UTC)