_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
 (HTM) Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
 (HTM)   Qwen3-Omni-Flash-2025-12-01:a next-generation native multimodal large model
       
       
        andy_ppp wrote 8 hours 39 min ago:
        Qwen seem to be deliberately confusing about if they are releasing
        models open weight or not. I think largely not any more and you can go
        on quite a wild goose chase looking for different things that are
        implied they are released but are actually only available via API.
       
        mohsen1 wrote 14 hours 44 min ago:
        Having lots of success with Gemini Flash Live 2.5. I am hoping 3.0 to
        come out soon. Benchmarks here claim better results that Gemini Live
        but have to test it. In past I've always been disappointed with Qwen
        Omni models in my English-first case...
       
        forgingahead wrote 21 hours 25 min ago:
        I truly enjoy how the naming conventions seem to follow how I did
        homework assignments back in the day: finalpaper-1-dec2nd,
        finalpaper-2-dec4th, etc etc.
       
        vessenes wrote 1 day ago:
        Interesting - when I asked the omni model at qwen.com what version it
        was, I got a testy "I don't have a version" and then was told my chat
        was blocked for inappropriate content. A second try asking for
        knowledge cutoff got me the more equivocal "2024, but I know stuff
        after that date, too".
        
        No idea how to check if this is actually deployed on qwen.com right
        now.
       
          mh- wrote 1 day ago:
          For what it's worth, that's not a reliable way to check what model
          you're interacting with.
       
            vessenes wrote 4 hours 51 min ago:
            It’s a good positive signal, but not a good negative one.
            
            It would be convincing if it said “I’m
            qwen-2025-12-whatever”. I agree it’s not dispositive if it
            refuses or claims to be llama 3 say. Generally most models I talk
            to do not hallucinate future versions of themselves, in fact it can
            be quite difficult to get them to use recent model designations;
            they will often autocorrect to older models silently.
       
          zamadatix wrote 1 day ago:
          > No idea how to check if this is actually deployed on qwen.com right
          now.
          
          Assuming you mean qwen.ai, when you run a query it should take you to
          chat.qwen.ai with the list of models in the top left. None of the
          options appear to be the -Omni variant (at least when anonymously
          accessing it).
       
            vessenes wrote 1 day ago:
            Thanks - yes - I did. The blog post suggests clicking the 'voice'
            icon on the bottom right - that's what I did.
       
        devinprater wrote 1 day ago:
        Wow, just 32B? This could almost run on a good device with 64 GB RAM.
        Once it gets to Ollama I'll have to see just what I can get out of
        this.
       
          apexalpha wrote 1 day ago:
          I run these on a 48gb Mac because of the universal ram.
       
          plipt wrote 1 day ago:
          I see that their HuggingFace link goes to some Qwen3-Omni-30B-A3B
          models that show a last updated date of September
          
          The benchmark table in their article shows 
          Qwen3-Omni-Flash-2025-12-01 (and the previous Flash) as beating
          Qwen3-235B-A22B. How is that possible if this is only a 30B-A3B
          model? Also confusing how that comparison column starts out with one
          model but changes them as you descend down the table.
          
          I don't see any FLASH variant listed on their Hugginface. Am i just
          missing it or are these specifying a model only used for their API
          service and there are no open weights to download?
       
        aschobel wrote 1 day ago:
        Looks to be API only. Bummer.
       
          readyplayeremma wrote 1 day ago:
          The models are right here, one of the first links in the post: [1]
          edit: Nevermind, in spite of them linking it at the top, they are the
          old models. Also, the HF demo is calling their API and not using HF
          for compute.
          
 (HTM)    [1]: https://huggingface.co/collections/Qwen/qwen3-omni
       
            aschobel wrote 1 day ago:
            It is super confusing. I also thought this initially was open
            weights.
       
        terhechte wrote 1 day ago:
        Is there a way to run these Omni models on a Macbook quantized via GGUF
        or MLX? I know I can run it in LMStudio or Llama.cpp but they don't
        have streaming microphone support or streaming webcam support.
        
        Qwen usually provides example code in Python that requires Cuda and a
        non-quantized model. I wonder if there is by now a good open source
        project to support this use case?
       
          tgtweak wrote 1 day ago:
          You can probably follow the vLLM instructions for omni here, then use
          the included voice demo html to interface with it: [1]
          
 (HTM)    [1]: https://github.com/QwenLM/Qwen3-Omni#vllm-usage
 (HTM)    [2]: https://github.com/QwenLM/Qwen3-Omni?tab=readme-ov-file#laun...
       
          mobilio wrote 1 day ago:
          Yes - there is a way:
          
 (HTM)    [1]: https://github.com/ggml-org/whisper.cpp
       
            novaray wrote 1 day ago:
            Whisper and Qwen Omni models have completely different
            architectures as far as I know
       
        stevenhuang wrote 1 day ago:
        Wayback for those that can't reach
        
 (HTM)  [1]: https://web.archive.org/web/20251210164048/https://qwen.ai/blo...
       
        sim04ful wrote 1 day ago:
        The main issue I'm facing with realtime responses (speech output) is
        how to separate non-diegetic outputs (e.g thinking, structured outputs)
        from outputs meant to be heard by the end user.
        
        I'm curious how anyone has solved this
       
          artur44 wrote 1 day ago:
          A simple way is to split the model’s output stream before TTS.  
          Reasoning/structured tokens go into one bucket, actual user-facing
          text into another. Only the second bucket is synthesized. Most
          thinking out loud issues come from feeding the whole stream directly
          into audio.
       
            pugio wrote 1 day ago:
            There is no TTS here. It's a native audio output model which
            outputs audio tokens directly. (At least, that's how the other
            real-time models work. Maybe I've misunderstood the Qwen-Omni
            architecture.)
       
              artur44 wrote 1 day ago:
              True, but even with native audio-token models you still need to
              split the model’s output channels. Reasoning/internal tokens
              shouldn't go into the audio stream  only user-facing content
              should be emitted as audio. The principle is the same, whether
              the last step is TTS or audio token generation.
       
                regularfry wrote 7 hours 27 min ago:
                There's an assumption there that the audio stream contains an
                equivalent of the / tokens.  Every reason to think it should,
                but without seeing the tokeniser config it's a bit of a guess.
       
        Aissen wrote 1 day ago:
        Is this a new proprietary model?
       
        gardnr wrote 1 day ago:
        This is a 30B parameter MoE with 3B active parameters and is the
        successor to their previous 7B omni model. [1] You can expect this
        model to have similar performance to the non-omni version. [2] There
        aren't many open-weights omni models so I consider this a big deal. I
        would use this model to replace the keyboard and monitor in an
        application while doing the heavy lifting with other tech behind the
        scenes. There is also a reasoning version, which might be a bit amusing
        in an interactive voice chat if it pronounces the thinking tokens while
        working through to a final answer.
        
        1. [1] 2.
        
 (HTM)  [1]: https://huggingface.co/Qwen/Qwen2.5-Omni-7B
 (HTM)  [2]: https://artificialanalysis.ai/models/qwen3-30b-a3b-instruct
       
          andy_ppp wrote 13 hours 5 min ago:
          Haha, you could hear how it’s mind thinks, maybe by putting a lot
          of reverb on the thinking tokens or some other effect…
       
          plipt wrote 1 day ago:
          I dont think the Flash model discussed in the article is 30B
          
          Their benchmark table shows it beating Qwen3-235B-A22B
          
          Does "Flash" in the name of a Qwen model indicate a
          model-as-a-service and not open weights?
       
            red2awn wrote 1 day ago:
            Flash is a closed weight version of [1] (it is 30B but with
            addtional training on top of the open weight release).
            They deploy the flash version on Qwen's own chat.
            
 (HTM)      [1]: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
       
              plipt wrote 1 day ago:
              Thanks
              
              Was it being closed weight obvious to you from the article?
              Trying to understand why I was confused. Had not seen the "Flash"
              designation before
              
              Also 30B models can beat a semi-recent 235B with just some
              additional training?
       
                red2awn wrote 1 day ago:
                They had a Flash variant released alongside the original open
                weight release. It is also mentioned in Section 5 of the paper:
                [1] For the evals it's probably just trained on a lot of the
                benchmark adjacent datasets compared to the 235B model. Similar
                thing happened on other model today: [2] (a 30B model trained
                specifically to do well in maths get near SOTA scores)
                
 (HTM)          [1]: https://arxiv.org/pdf/2509.17765
 (HTM)          [2]: https://x.com/NousResearch/status/1998536543565127968
       
          red2awn wrote 1 day ago:
          This is a stack of models:
          
          - 650M Audio Encoder
          
          - 540M Vision Encoder
          
          - 30B-A3B LLM
          
          - 3B-A0.3B Audio LLM
          
          - 80M Transformer/200M ConvNet audio token to waveform
          
          This is a closed source weight update to their Qwen3-Omni model. They
          had a previous open weight release Qwen/Qwen3-Omni-30B-A3B-Instruct
          and a closed version Qwen3-Omni-Flash.
          
          You basically can't use this model right now since none of the open
          source inference framework have the model fully implemented. It works
          on transformers but it's extremely slow.
       
          andy_xor_andrew wrote 1 day ago:
          > This is a 30B parameter MoE with 3B active parameters
          
          Where are you finding that info? Not saying you're wrong; just saying
          that I didn't see that specified anywhere in the linked page, or on
          their HF.
       
            plipt wrote 1 day ago:
            The link[1] at the top of their article to HuggingFace goes to some
            models named Qwen3-Omni-30B-A3B that were last updated in
            September. None of them have "Flash" in the name.
            
            The benchmark table shows this Flash model beating their
            Qwen3-235B-A22B. I dont see how that is possible if it is a 30B-A3B
            model.
            
            I don't see a mention of a parameter count anywhere in the article.
            Do you? This may not be an open weights model.
            
            This article feels a bit deceptive
            
            1:
            
 (HTM)      [1]: https://huggingface.co/collections/Qwen/qwen3-omni
       
          tensegrist wrote 1 day ago:
          > There is also a reasoning version, which might be a bit amusing in
          an interactive voice chat if it pronounces the thinking tokens while
          working through to a final answer.
          
          last i checked (months ago) claude used to do this
       
          olafura wrote 1 day ago:
          Looks like it's not open source:
          
 (HTM)    [1]: https://www.alibabacloud.com/help/en/model-studio/qwen-omni#...
       
            coder543 wrote 1 day ago:
            No... that website is not helpful. If you take it at face value, it
            is claiming that the previous Qwen3-Omni-Flash wasn't open either,
            but that seems wrong? It is very common for these blog posts to get
            published before the model weights are uploaded.
       
              red2awn wrote 1 day ago:
              The previous -Flash weight is closed source. They do have weights
              for the original model that is slightly behind in performance
              
 (HTM)        [1]: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
       
                coder543 wrote 1 day ago:
                Based on things I had read over the past several months,
                Qwen3-Flash seemed to just be a weird marketing term for the
                Qwen3-Omni-30B-A3B series, not a different model. If they are
                not the same, then that is interesting/confusing.
       
                  red2awn wrote 1 day ago:
                  It is an in-house closed weight model for their own chat
                  platform, mentioned in Section 5 of the original paper: [1]
                  I've seen it in their online materials too but can't seem to
                  find it now.
                  
 (HTM)            [1]: https://arxiv.org/pdf/2509.17765
       
          gardnr wrote 1 day ago:
          I can't find the weights for this new version anywhere. I checked
          modelscope and huggingface. It looks like they may have extended the
          context window to 200K+ tokens but I can't find the actual weights.
       
            pythux wrote 1 day ago:
            They link to: [1] from the blog post but it does seem like this
            redirects to their main space on HF so maybe they didn't yet make
            the model public?
            
 (HTM)      [1]: https://huggingface.co/collections/Qwen/qwen3-omni-68d100a...
       
        banjoe wrote 1 day ago:
        Wow, crushing 2.5 Flash on every benchmark is huge. Time to move all of
        my LLM workloads to a local GPU rig.
       
          skrunch wrote 12 hours 18 min ago:
          Except the image benchmarks are compared against 2.0, which seems
          suspicious that they would casually drop to an older model for those.
       
          red2awn wrote 1 day ago:
          Why would you use an Omni model for text only workload... There is
          Qwen3-30B-A3B.
       
          embedding-shape wrote 1 day ago:
          Just remember to benchmark it yourself first with you private task
          collection, so you can actually measure them against each other.
          Pretty much any public benchmark is unreliable at this moment, and
          making model choices based on other's benchmarks is bound to leave
          you disappointed.
       
            MaxikCZ wrote 1 day ago:
            This. Last benchmarks of DSv3.2spe hinted at beating basically
            everything, yet in my testing even sonnet is miles ahead both in
            terms of speed and accuracy
       
        rarisma wrote 1 day ago:
        GPT4o in the charts is crazy.
       
          BoorishBears wrote 1 day ago:
          Why? gpt-realtime is finalized gpt-4o. Gemini Live is still 2.5.
          
          Not their fault frontier labs are letting their speech to speech
          offerings languish.
       
        binsquare wrote 1 day ago:
        Does anyone else find that there's hard to pin down reason of
        life-lessness in the speech of these voice models?
        
        Especially in the fruit pricing portion of the video for this model.
        Sounds completely normal but I can immediately tell it is ai. Maybe
        it's intonation or the overly stable rate of speech?
       
          vessenes wrote 1 day ago:
          I'm not convinced its end-to-end multimodal - in that case, you'll
          have a speech synthesis section and this will be some of the result.
          You could test by having it sing or do some accents, or have it talk
          back to you in an accent you give it.
       
          Lapel2742 wrote 1 day ago:
          IMHO it's not lifeless. It's just not overly emotional. I definitely
          prefer it that way. I do not want the AI to be excited. It feels so
          contrived.
          
          On the video itself: Interesting, but "ideal" was pronounced wrong in
          German. For a promotional video, they should have checked that with
          native speakers. On the other hand its at least honest.
       
            nunodonato wrote 1 day ago:
            I hate with a passion the over-americanized "accent" of chatgpt
            voices. Give me a bland one any day of the week
       
              wkat4242 wrote 15 hours 28 min ago:
              Yeah that overly fake-excited voice type. Doesn't work for Europe
              at all. But indeed common in American customer service scenarios.
       
          sosodev wrote 1 day ago:
          I think it's because they've crammed vision, audio, multiple voices,
          prosody control, multiple languages, etc into just 30 billion
          parameters.
          
          I think ChatGPT has the most lifelike speech with their voice models.
          They seem to have invested heavily in that area while other labs
          focused elsewhere.
       
          esafak wrote 1 day ago:
          > Sounds completely normal but I can immediately tell it is ai.
          
          Maybe that's a good thing?
       
          colechristensen wrote 1 day ago:
          I'm perfectly ok with and would prefer an AI "accent".
       
        sosodev wrote 1 day ago:
        Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at
        their documentation it doesn't seem like it does.
        
        Are there any open weight models that do? Not talking about speech to
        text -> LLM -> text to speech btw I mean a real voice <-> language
        model.
        
        edit:
        
        It does support real-time conversation! Has anybody here gotten that to
        work on local hardware? I'm particularly curious if anybody has run it
        with a non-nvidia setup.
       
          potatoman22 wrote 1 day ago:
          From what I can tell, their official chat site doesn't have a native
          audio -> audio model yet. I like to test this through homophones
          (e.g. record and record) and asking it to change its pitch or produce
          sounds.
       
            dragonwriter wrote 19 hours 0 min ago:
            “record and record”, if you mean the verb for persisting
            something and the noun for the thing persisted, are heteronyms
            (homographs which are not homophones), which incidentally is also
            what you would probably want to test what you are talking about
            here (distinguishing homophones would test use of context to
            understand meaning, but wouldn’t test anything about whether or
            not logic was working directly on audio or only working on text
            processed from audio, failing to distinguish heteronyms is
            suggestive of processing occurring on text, not audio directly.)
       
              potatoman22 wrote 22 min ago:
              Ah I meant heteronyms. Thanks!
       
              bakeman wrote 18 hours 38 min ago:
              There are homophones of “record”, such as:
              
              “He’s on record saying he broke the record for spinning a
              record.”
       
                dragonwriter wrote 17 hours 46 min ago:
                True.
                
                OTOH my point that the thing being suggested to be tested is
                not testable by seeing whether or not the system is capable of
                distinguishing homophones, but might be by seeing whether or
                not it distingishes heteronyms still stands. (The speculation
                that the record/record distinction intended was one that is
                actually a pair of heteronyms and that the error was merely the
                use of the word “homophone" in place of “heteronym”,
                rather than the basic logic of the comment is somewhat
                tangential to the main point.)
       
            djtango wrote 19 hours 6 min ago:
            Is record a homophone? At least in the UK we use different
            pronunciations for the meanings. Re-cord for the verb, rec-ord for
            the noun.
       
              potatoman22 wrote 23 min ago:
              I was mistaken about what homophone means!
       
            sosodev wrote 23 hours 1 min ago:
            Huh, you're right. I tried your test and it clearly can't
            understand the difference between homophones. That seems to imply
            they're using some sort of TTS mechanism. Which is really weird
            because Qwen3-Omni claims to support direct audio input into the
            model. Maybe it's a cost saving measure?
       
              potatoman22 wrote 21 min ago:
              To be fair, discerning heteronyms might just be a gap in its
              training.
       
              sosodev wrote 17 hours 40 min ago:
              Weirdly, I just tried it again and it seems to understand the
              difference between record and record just fine. Perhaps if
              there's heavy demand for voice chat, like after a new release,
              they load shed by using TTS to a smaller model.
              
              However, It still doesn't seem capable of producing any of the
              sounds, like laughter, that I would expect from a native voice
              model.
       
          ivape wrote 1 day ago:
          That's exciting. I doubt there are any polished voice chat local apps
          yet that you can easily plug this into (I doubt the user experience
          is "there" yet). Even stuff like Silly Tavern is near unusable, lots
          of work to be done on the local front. Local voice models are what's
          going to enable that whole Minority Report workflow soon enough
          (especially if commands and intent are determined at the local level,
          and the meat of the prompt is handled by a larger remote model).
          
          This is part of programming that I think is the new field. There will
          be tons of work for those that can build the new workflows which will
          need to be primarily natural language driven.
       
            sosodev wrote 1 day ago:
            I did find this app: [1] The creator posted a little demo of it
            working with Qwen3 Omni that is quite impressive: [2] He didn't
            include any details regarding how the model was running though
            
 (HTM)      [1]: https://github.com/gabber-dev/gabber
 (HTM)      [2]: https://www.youtube.com/watch?v=5DBFVe3cLto
       
          red2awn wrote 1 day ago:
          None of inference frameworks (vLLM/SGLang) supports the full model,
          let alone non-nvidia.
       
            whimsicalism wrote 20 hours 36 min ago:
            Makes sense, I think streaming audio->audio inference is a
            relatively big lift.
       
              red2awn wrote 13 hours 38 min ago:
              Correct, it's breaks the single prompt, single completion
              assumption baked into the frameworks. Conceptually it's still
              prompt/completion but for low latency response you have to do
              streaming KV cache prefill with a websocket server.
       
                whimsicalism wrote 5 hours 22 min ago:
                I imagine you have to start decoding many speculative
                completions in parallel to have true low latency.
       
            AndreSlavescu wrote 1 day ago:
            We actually deployed working speech to speech inference that builds
            on top of vLLM as the backbone. The main thing was to support the
            "Talker" module, which is currently not supported on the qwen3-omni
            branch for vLLM.
            
            Check it out here:
            
 (HTM)      [1]: https://models.hathora.dev/model/qwen3-omni
       
              sosodev wrote 1 day ago:
              Is your work open source?
       
              red2awn wrote 1 day ago:
              Nice work. Are you working on streaming input/output?
       
                AndreSlavescu wrote 1 day ago:
                Yeah, that's something we currently support. Feel free to try
                the platform out! No cost to you for now, you just need a valid
                email to sign up on the platform.
       
                  valleyer wrote 12 hours 52 min ago:
                  I tried this out, and it's not passing the record (n.) vs.
                  record (v.) test mentioned elsewhere in this thread.  (I can
                  ask it to repeat one, and it often repeats the other.)    Am I
                  not enabling the speech-to-speech-ness somehow?
       
            sosodev wrote 1 day ago:
            That's unfortunate but not too surprising. This type of model is
            very new to the local hosting space.
       
          dsrtslnd23 wrote 1 day ago:
          it seems to be able to do native speech-speech
       
            sosodev wrote 1 day ago:
            It does for sure. I did some more digging and it does real-time
            too. That's fascinating.
       
        mettamage wrote 1 day ago:
        I wonder if with that music analysis mode, you can also make your own
        synths
       
        dvh wrote 1 day ago:
        I asked: "How many resistors are used in fuzzhugger phantom octave
        guitar pedal?". It replied 29 resistors and provided a long list.
        Answer is 2 resistors:
        
 (HTM)  [1]: https://tagboardeffects.blogspot.com/2013/04/fuzzhugger-phanto...
       
          bongodongobob wrote 20 hours 40 min ago:
          Lol I asked it how many rooms I have in my house and it got that
          wrong. Llms are useless amirite
       
          strangattractor wrote 1 day ago:
          Maybe it thinks some of those 29 are in series:)
       
          brookst wrote 1 day ago:
          Where did you try it? I don’t see this model listed in the linked
          Qwen chat.
       
          esafak wrote 1 day ago:
          This is just trivia. I would not use it to test computers -- or
          humans.
       
            littlestymaar wrote 1 day ago:
            It's good way to assess the model with respect to hallucinations
            though.
            
            I don't think a model should know the answer, but it must be able
            to know that it doesn't know if you want to use it reliably.
       
              esafak wrote 1 day ago:
              No model is good at this yet. I'd expect the flagships to solve
              the first.
       
            parineum wrote 1 day ago:
            Everything is just trivia until you have a use for the answer.
            
            OP provided a we link with the answer, aren't these models supposed
            to be trained on all of that data?
       
              DennisP wrote 1 day ago:
              Just because it's in the training data doesn't mean the model can
              remember it. The parameters total 60 gigabytes, there's only so
              much trivia that can fit in there so it has to do lossy
              compression.
       
              esafak wrote 1 day ago:
              There is nothing useful you can do with this information. You
              might as well memorize the phone book.
              
              The model has a certain capacity -- quite limited in this case --
              so there is an opportunity cost in learning one thing over
              another. That's why it is important to train on quality data;
              things you can build on top of.
       
                parineum wrote 20 hours 45 min ago:
                What if you are trying to fix one of these things and needed a
                list of replacement parts?
       
                  esafak wrote 20 hours 31 min ago:
                  Not the right problem for this model. Any RAG-backed SLM
                  would do; the important part is being backed by a search
                  engine, like
                  
 (HTM)            [1]: https://google.com/ai
       
          iFire wrote 1 day ago:
          > How many resistors are used in fuzzhugger phantom octave guitar
          pedal?
          
          Weird, as someone not having a database of the web, I wouldn't be
          able to calculate either result.
       
            kaoD wrote 1 day ago:
            > as someone not having a database of the web, I wouldn't be able
            to calculate either result
            
            And that's how I know you're not an LLM!
       
            dvh wrote 1 day ago:
            "I don't know" would be perfectly reasonable answer
       
              MaxikCZ wrote 1 day ago:
              I feel like theres a time in near future where LLMs will be too
              cautious to answer any questions they arent sure about, and most
              of the human effort will go into pleading the LLM to at least try
              to give an answer, which will almost always be correct anyways.
       
                littlestymaar wrote 1 day ago:
                It's not going to happen as the user would just leave the
                platform.
                
                It would be better for most API usage though, as for business
                doing just a fraction of the job with 100% accuracy is often
                much preferable than claiming to do 100% but 20% is garbage.
       
                plufz wrote 1 day ago:
                That would be a great if you could have a setting like
                temperature 0.0-1.0 (Only answer if you are 100% to guess as
                much as you like).
       
            iFire wrote 1 day ago:
            I tend to pick things where I think the answer is in the
            introduction material like exams that test what was taught.
       
       
 (DIR) <- back to front page