[HN Gopher] SeamlessM4T, a Multimodal AI Model for Speech and Te...
       ___________________________________________________________________
        
       SeamlessM4T, a Multimodal AI Model for Speech and Text Translation
        
       Author : mchiang
       Score  : 121 points
       Date   : 2023-08-22 13:58 UTC (9 hours ago)
        
 (HTM) web link (about.fb.com)
 (TXT) w3m dump (about.fb.com)
        
       | gigel82 wrote:
       | The speech recognition in their demo is very very bad (~60% in my
       | empirical test, vs. 95% with WhisperCPP). The translation is also
       | very inaccurate.
       | 
       | That said, I fully support open releases and look forward to
       | future versions and improvements.
        
       | Jayakumark wrote:
       | Meta is killing it with this open models. Not sure why Tamil
       | Language is missing on Output.
        
         | [deleted]
        
       | 1attice wrote:
       | ....'M4T', ahem, might mean slightly more than you think it does
        
       | 0cf8612b2e1e wrote:
       | Will there be a whispercpp equivalent? Half the reason I love
       | whisper is how dead simple it is to get running. I will take
       | somewhat lower accuracy for easier operation.
       | 
       | Edit: unless there is native speaker diarization. That would be a
       | huge value add.
        
         | jmorgan wrote:
         | I'm curious about this too. Lately I've been building an open
         | source tool to help bring make pulling + running models easier
         | locally - https://github.com/jmorganca/ollama - right now we
         | work with the awesome llama.cpp project, however, other model
         | types have definitely come up. LLMs are a small section of
         | what's available on huggingface for example.
         | 
         | It's especially interesting how you could combine different
         | model types - e.g. translation + text completion (or image
         | generation) - it could be a pretty powerful combination...
        
         | genpfault wrote:
         | They have a smaller model[1] that might be amenable to Whisper-
         | ization.
         | 
         |  _Much_ smaller language matrix though.
         | 
         | [1]:
         | https://github.com/facebookresearch/seamless_communication/b...
        
       | jimmies wrote:
       | Lol, they botched the first example - that it translates "Our
       | goal is to create a more connected world" to Vietnamese: It has a
       | glancing typo at the end of the sentence "hon" instead of "ho."
       | Also it really messed up the pronounciation: It read "Chung toi"
       | as "Chung ta" - they are totally different words phonetically.
       | The pronunciation also sounds like it's made by someone who is
       | mentally sick. So they botched in both translation and
       | pronunciation.
       | 
       | That's so embarrassing - especially for something to show how
       | good their stuff is (although I think it's probably not the ai's
       | fault) - just shows how sloppy their people are.
       | 
       | I know they have plenty of Vietnamese engineers there. Did the PR
       | dept just throw this final version of the video out without
       | reviewing with them?
        
         | [deleted]
        
       | msp26 wrote:
       | All I want is llama-2-34b (seriously what's taking so long on
       | this specific model) but this is interesting too I guess.
        
       | crakenzak wrote:
       | code: https://github.com/facebookresearch/seamless_communication
       | 
       | paper: https://ai.meta.com/research/publications/seamless-m4t/
       | 
       | demo: https://seamless.metademolab.com/
        
         | fotcorn wrote:
         | There is also a Hugging Face Space for some quick tests without
         | downloading the model:
         | 
         | https://huggingface.co/spaces/facebook/seamless_m4t
        
       | lhl wrote:
       | I gave it a spin a little bit ago. Per usual, install docs didn't
       | quite work OOTB, here's how I got it working: https://llm-
       | tracker.info/books/howto-guides/page/speech-to-t...
       | 
       | One limitation that seems undocumented, the current code only
       | supports relatively short clips so isn't suitable for long
       | transcriptions:
       | 
       | > ValueError: The input sequence length must be less than or
       | equal to the maximum sequence length (4096), but is 99945
       | instead.
        
         | nicolashahn wrote:
         | Seems like you could easily do a little bash/python script to
         | chop up the recording and batch process each, then stitch the
         | results together?
        
           | lhl wrote:
           | Probably, although you could more easily use WhisperX and get
           | the same results twice as fast and without any additional
           | scripting.
        
       | houseatrielah wrote:
       | SeamlessM4T-Medium { 1.2B params, filesize 6.8 GB }. Wondering
       | how it compares to OpenAi's Whisper.
        
         | thewataccount wrote:
         | 281M and 235M param models too.
         | 
         | https://github.com/facebookresearch/seamless_communication/b...
         | 
         | I don't really know how the metrics they list compare to
         | whisper, I'm very curious if these are fast enough for realtime
         | speech2text? I think whisper technically could but it was
         | difficult to do or something like that?
        
         | aportnoy wrote:
         | Go to the blog and skip to results:
         | https://ai.meta.com/blog/seamless-m4t/
        
       | rvz wrote:
       | Yet somehow, many here underestimated Meta's position in AI and
       | proclaimed that Meta was dying and was not important and far
       | behind in the AI race.
       | 
       | How things change dramatically in one year with such exaggeration
       | of Meta's collapse in 2022.
       | 
       | Not only they are in the lead in $0 free AI models, they are also
       | at the finish line in the AI race to zero.
        
       | jacooper wrote:
       | What's the license
        
         | minimaxir wrote:
         | CC BY-NC 4.0
        
           | noiseinvacuum wrote:
           | I was trying to figure out what does it mean and this is the
           | summary from Bard so take it with a grain of salt.
           | 
           | The CC BY-NC 4.0 license allows for the following uses of the
           | licensed material:
           | 
           | * Reproduction: You can copy and distribute the licensed
           | material in any medium or format.
           | 
           | * Distribution: You can distribute the licensed material to
           | others.
           | 
           | * Public performance: You can perform the licensed material
           | publicly.
           | 
           | * Public display: You can display the licensed material
           | publicly.
           | 
           | * Modification: You can remix, transform, and build upon the
           | licensed material.
           | 
           | * Derivative works: You can create derivative works based on
           | the licensed material.
           | 
           | However, there are some restrictions on how you can use the
           | licensed material under the CC BY-NC 4.0 license:
           | 
           | * Commercial use: You cannot use the licensed material for
           | commercial purposes.
           | 
           | * Sublicensing: You cannot sublicense the licensed material.
           | 
           | * Moral rights: The licensor retains all moral rights in the
           | licensed material.
           | 
           | Here are some examples of how the CC BY-NC 4.0 license can be
           | used:
           | 
           | * A teacher can use a CC BY-NC 4.0 licensed image in a
           | presentation for their class.
           | 
           | * A student can create a CC BY-NC 4.0 licensed remix of a
           | song.
           | 
           | * A software developer can use a CC BY-NC 4.0 licensed
           | library in their open source project.
           | 
           | * A photographer can share their photos on a CC BY-NC 4.0
           | licensed website.
        
             | edgyquant wrote:
             | Please do not post output from LLMs here. It is against the
             | rules and we have plenty of knowledgeable people to answer
             | questions. We all have access to these chat bots if we want
             | their answer.
        
             | minimaxir wrote:
             | You could just Google it:
             | https://creativecommons.org/licenses/by-nc/4.0/
        
             | [deleted]
        
           | version_five wrote:
           | Importantly, non-commercial. Almost all of Facebooks stuff
           | used to be Apache, this new stance is really shitty of them
           | and I hope limits adoption. Deigning to allow others to play
           | with models (and make improvements, give feedback, build an
           | ecosystem) that only you can profit from is not good
           | community behavior. I'd rather see them make it freemium or
           | paid if that's their goal, this is the equivalent of a kid
           | licking a cookie so the others can't eat it.
        
             | sangnoir wrote:
             | > Almost all of Facebooks stuff used to be Apache, this new
             | stance is really shitty of them and I hope limits adoption
             | 
             | The AI research environment has changed from the earlier
             | default-open publication - unlike it's competitors, FAIR is
             | still releasing model weights instead of serving the models
             | behind an API.
             | 
             | > this is the equivalent of a kid licking a cookie so the
             | others can't eat it.
             | 
             | More like the other kid baking a cookie with the words
             | "Free Cookie" on it so others can eat it if they are
             | hungry, but can't sell it for money. It'd be foolish for
             | FAIR to donate preconfigured homing-missiles to OpenAI and
             | others via one-way tech transfer.
        
               | version_five wrote:
               | It'd be foolish for FAIR to donate preconfigured homing-
               | missiles to OpenAI and others via one-way tech transfer.
               | 
               | No, they could GPL it, and I don't think they're worried
               | about competition taking the models anyway, there's
               | nothing particularly special about the weights or
               | training data, just the compute. I think part of it is
               | pressure from AI "safety" hangers-on who pretend that AI
               | is dangerous so only those who don't want to abide by
               | license terms should have unfettered access. The other
               | commercial reasons are harder to understand. With pytorch
               | they became the standard that everyone builds off of,
               | they could do that with their recent AI, particularly
               | LLaMA but they chose this silly route.
               | 
               | Also, LLaMA has a more permissive license than this
               | translation one, and is a more powerful model, so I don't
               | really see the "homing missiles to open AI" angle.
        
               | taneq wrote:
               | True, LLaMA2 is more like "donating homing missiles to
               | everyone except OpenAI, Google, and Apple."
        
       ___________________________________________________________________
       (page generated 2023-08-22 23:01 UTC)