[HN Gopher] LLMs can see and hear without any training
       ___________________________________________________________________
        
       LLMs can see and hear without any training
        
       Author : T-A
       Score  : 152 points
       Date   : 2025-04-26 13:38 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | underdeserver wrote:
       | Paper: https://arxiv.org/pdf/2501.18096
        
         | suddenlybananas wrote:
         | I don't understand how the title relates to the content of this
         | article at all. They're even using CLIP which definitely has
         | been trained.
        
           | dragonwriter wrote:
           | You don't have to train the LLM soecifically for the tasks
           | and even the auxiliary tools aren't trained on the tasks they
           | are used as scorers for (because they aren't doing the
           | task,just evaluating how well the LlM is), so there is no
           | task-specific training.
        
             | suddenlybananas wrote:
             | Task-specific training sure, but the title implies that
             | vision itself is not trained.
        
       | jagged-chisel wrote:
       | Computers can receive input without any programming. Not sure
       | what's interesting here.
        
         | fortran77 wrote:
         | Really? How?
        
           | skydhash wrote:
           | The base layer is just electronic circuitry. As long there is
           | electricity it will do stuff (like a radio producing noise).
           | GPU, CPU, is mostly software embedded in hardware.
        
           | barotalomey wrote:
           | Primarily, processing input.
        
         | amelius wrote:
         | There's more to seeing and hearing than just receiving inputs.
         | 
         | Anyway, this looks like a case of human trying to understand
         | article without reading it.
        
         | dragonwriter wrote:
         | This isn't receiving input, its generating output competitive
         | with models with task-specific training.
         | 
         | I'm guessing the iterative approach burns a lot of tokens
         | though, though that may not matter too much with 8B Llama as
         | the LLM.
        
       | scribu wrote:
       | This seems to be a system to generate better prompts to be fed
       | into a base multimodal model.
       | 
       | Interesting, but title is definitely clickbait.
        
         | wangii wrote:
         | Exactly! There is definitely something wrong with FAIR.
        
         | throwaway4aday wrote:
         | They only did that for image generation. The more interesting
         | part is that an LLM can approach or find the correct caption
         | for an image, video or audio during test time with no training
         | using only the score as a guide. It's essentially working blind
         | almost like the game Marco Polo where the scorer is saying
         | "warmer" or "colder" while the LLM is finding its way towards
         | the goal. This is an example of emergent capabilities since
         | there are no examples of this in the training data.
        
         | matt123456789 wrote:
         | Actually, it's the name of the paper. And while the team also
         | developed and released a system to elicit the behavior by doing
         | what you described, it's entirely possible that the researchers
         | thought the title to be the most important finding in their
         | work.
        
       | EncomLab wrote:
       | My photoresistor nightlight can "see" that it is dark and it
       | "knows" to turn on the light - not only does it not have
       | training, it does not have any code!
       | 
       | And if you think that is amazing, my bi-metallic strip thermostat
       | "feels" the temperature and then modifies the environment because
       | it "knows" if it's hot to turn on the A/C, and if it's cold to
       | turn on the heat - no training or code!
       | 
       | All of this AI stuff is just unbelievably incredible - what a
       | brave new world (of word games)!
        
         | JoBrad wrote:
         | The nightlight and thermostat's response to stimulus is nowhere
         | near analyzing a picture of a clock tower and responding with
         | "Image of a city's tallest, historic landmark with a sepia
         | filter." To me, recognizing the umbrella in the spoon is one of
         | the most impressive items they list.
        
           | EncomLab wrote:
           | It's not the technology that is bad - it's the extreme
           | anthropomorphizing language that's used to describe it.
        
             | horacemorace wrote:
             | It might be bad if its behavior wasn't so anthropomorphic.
        
       | viraptor wrote:
       | That looks like a classic Actor/Critic setup, yet it's not
       | mentioned even once in the paper. Am I missing some large
       | difference here?
        
         | oneseven wrote:
         | Yes, apparently they've developed new names: Generator and
         | Scorer. This feels a bit like "Tai's Model"
         | https://news.ycombinator.com/item?id=17863514
        
           | lukeinator42 wrote:
           | Haha "Tai's Model" is absolutely hilarious, that gave me a
           | good chuckle. I checked and it currently is cited 568 times.
        
         | dawnofdusk wrote:
         | In actor/critic the actor and critic are normally learned,
         | i.e., their weights are adjusted during the process. The paper
         | is correct that their method is zero-shot, but it doesn't
         | mention that their method is essentially equivalent to a few
         | rounds of training but then discarding the training update.
         | 
         | Anyone who works with deep architectures and momentum-based
         | optimizers knows that the first few updates alone provide large
         | improvements in loss. In this paper the breakthrough is that
         | computing these first few updates at test time enables one to
         | describe the algorithm as "without training" and therefore
         | attract hype.
        
           | fc417fc802 wrote:
           | > discarding the training update
           | 
           | But they aren't updating the model weights. They're
           | iteratively updating the prompt. It's automating the process
           | that humans use with generative models.
           | 
           | Agreed that it's conceptually equivalent though.
        
       | JoBrad wrote:
       | Exactly how little training is "without any"? I'm assuming that
       | companies haven't been spending billions trying to train LLMs to
       | better understand things when they can do it without any
       | training.
        
       | 3rdworldeng wrote:
       | Find me Jose Monkey will do that too :-)
        
       | sega_sai wrote:
       | The paper certainly contradicts my expectation from the title.
       | I.e. it does not present an LLM that can generate images without
       | any access to images before.
        
       | vessenes wrote:
       | I've read the paper and the skeptical comments here, to wit: it's
       | just an actor/critic pipeline by another name.
       | 
       | I'll bite and say this is actually interesting -- and the paper
       | title is misleading.
       | 
       | What they've done here is hooked up a text-only LLM to multimodal
       | critics, given it (mostly) an image diffusion generation task,
       | and asked it to improve its prompting of the multimodal
       | generation by getting a set of scores back.
       | 
       | This definitely works, based on their outputs. Which is to say,
       | LLMs can, zero shot, with outside tool feedback, iteratively
       | improve their prompting using only that tooling feedback.
       | 
       | Why is this interesting? Well, this did not work in the GPT-3
       | era; it seems to do so now. I see this as an interesting line to
       | be added in the 'model capabilities' box as our models get larger
       | and more sophisticated -- the LLMs can perform some sort of
       | internally guided search against a black box generator and use a
       | black box scorer to improve at inference time.
       | 
       | That's pretty cool. It's also generalizable, and I think is worth
       | keeping in mind on the stack of possible approaches for, say
       | agentic coding, that you can use a critic to not just 'improve'
       | generated output, but most likely do some guided search through
       | output space.
        
         | nightski wrote:
         | Are they using the same diffusion models as the GPT-3 area?
         | Meaning is it the LLM that has improved or is it the diffusion
         | model? I know it's probably a foolish take but I am really
         | skeptical of the "larger models will solve all our problems"
         | line of thinking.
        
           | vessenes wrote:
           | They don't compare in the paper. I will say I experimented
           | extensively with GPT-3 era LLMs on improving ouput by trying
           | to guide early diffusion models with critical responses. It
           | was a) not successful, and b) pretty clear to me that GPT-3
           | didn't "get" what it was supposed to be doing, or didn't have
           | enough context to keep all this in mind, or couldn't process
           | it properly, or some such thing.
           | 
           | This paper has ablations, although I didn't read that
           | section, so you could see where they say the effectiveness
           | comes from. I bet you thought that it's emergent from a bunch
           | of different places.
           | 
           | FWIW, I don't think LLMS will solve all our problems, so I
           | too am skeptical of that claim. I'm not skeptical of the
           | slightly weaker "larger models have emergent capabilities and
           | we are probably not done finding them as we scale up".
        
             | tomrod wrote:
             | > FWIW, I don't think LLMS will solve all our problems, so
             | I too am skeptical of that claim. I'm not skeptical of the
             | slightly weaker "larger models have emergent capabilities
             | and we are probably not done finding them as we scale up".
             | 
             | 100% agree. I'd classify the time now as identifying the
             | limits of what they can functionally do though, an it's a
             | lot!
        
         | skydhash wrote:
         | > _I think is worth keeping in mind on the stack of possible
         | approaches for, say agentic coding, that you can use a critic
         | to not just 'improve' generated output, but most likely do some
         | guided search through output space._
         | 
         | The one issue I keep finding with those approaches is that
         | there's already good tools for the problem, but we keep
         | searching for wasteful approaches because "natural languages"
         | for something humans are not going to interact without a good
         | deal of training.
         | 
         | I do understand the hope of getting LLMs do the bulk of the
         | work, and then after audit, we fix the errors. But both audit
         | and fixing will require the same mental energy as writing the
         | code in the first place. And possibly more time.
         | 
         | Specialist tools are always more expansive and offer more
         | controls than general public tools. Most approaches with
         | agentic coding is offering general interfaces instead of
         | specialized interfaces, but redirecting you to a bespoke and
         | badly designed specialized interface whenever you want to do
         | anything useful.
        
           | vessenes wrote:
           | I hear that. Counterpoint - if you all you have is a Philips-
           | head screwdriver, all you have is a Philips-head screwdriver.
           | On the other hand if all you have is a six axis CnC mill,
           | well, then you have a lot.
           | 
           | I think of this less as audit misses, and more as developing
           | a permanently useful tool. For open model weights, humanity
           | will not (unless we're talking real zombie apocalypse
           | scenarios) lose these weights. They are an incredible global
           | asset, so making them more generally useful and figuring out
           | how to use them is super helpful.
        
             | skydhash wrote:
             | Maybe they are useful. But I think there's more usefulness
             | in specialized databases and optimized approaches than
             | betting everything on big llms models. Kinda like deriving
             | linting rules and combining it with a rule engines to catch
             | errors. Efficient and useful instead of continuously
             | running a big llm model.
        
             | walleeee wrote:
             | While it is hard to argue with the wisdom of crystallizing
             | intellectual capital into our tools, I do wonder if these
             | models might be as likely to diminish as to develop the
             | person using them, in which case we trade an implement's
             | iterative improvement for ours, in a way
        
               | vessenes wrote:
               | Monks in the Middle Ages: "The Printing Press will
               | destroy people's ability to memorize."
               | 
               | This was accurate. But mostly humans gained from books. I
               | think we will develop the social technology to use these
               | tools over time; giving some things up and gaining
               | others.
               | 
               | If we don't, the Amish can just take over and be like
               | "Stupid English, using the devil's weights." :)
        
         | jorvi wrote:
         | > zero shot
         | 
         | I really wish we would find a different term for this.
         | 
         | Doing something always takes at least one attempt, i.e. "one
         | shotting". "Zero shotting" is an oxymoron, which makes it a
         | term that only creates more confusion rather than succinctly
         | conveying something.
        
           | airstrike wrote:
           | It's a shot from position zero
        
             | nmstoker wrote:
             | No it isn't. The number of shots (examples) is zero.
        
           | 42lux wrote:
           | We say Sure Shot.
        
           | hawk_ wrote:
           | Array indexing can start at 0 or 1.
        
             | layer8 wrote:
             | For an array of zero shots, the indexing doesn't matter.
        
           | Izkata wrote:
           | "One shot" is simply about the action itself, but it says
           | nothing about how much preparation was done beforehand. "Zero
           | shot" additionally implies without training or preparation.
           | 
           | TCGs have a related "zero turn win" concept, where the
           | opponent goes first and you win without getting a turn due to
           | the set of cards you randomly drew and being able to activate
           | them on the opponent's turn.
        
           | vessenes wrote:
           | I think of a shot as an example, _not_ a try: "One shot" is
           | "One example". Zero shot is "Zero examples". I don't love it,
           | but I don't hate it, got a better word for it?
        
           | quantadev wrote:
           | My favorite AI term to ridicule is the recent "Test Time
           | Compute" nonsense, which has nothing whatsoever to do with
           | testing. It literally just means "inference time".
           | 
           | And if I hear someone say "banger", "cooking", "insane", or
           | "crazy", one more time I'm going to sledge hammer my
           | computer. Can't someone, under 40 please pick up a book and
           | read. Yesterday Sam Altman tried to coin "Skillsmaxxing" in a
           | tweet. I threw my coffee cup at my laptop.
        
             | ks2048 wrote:
             | Speaking of old-timers and "inference time" - there was a
             | time when "inference" meant inferring parameters from data
             | (i.e. training). And now it means "test-time". (or maybe
             | the difference is if it's statistics community vs ML
             | community).
             | 
             | e.g. Bishop's textbook says:
             | 
             | 5.2.4 Inference and decision
             | 
             | We have broken the classification problem down into two
             | separate stages, the inference stage in which we use
             | training data to learn a model for p(Ck|x) and the
             | subsequent decision stage in which we use these posterior
             | probabilities to make op- timal class assignments.
        
       | nico wrote:
       | To people curious or skeptical if this could be called "seeing"
       | or "hearing", I recommend listening to the Batman podcast episode
       | on NPR (https://www.npr.org/2015/01/23/379134306/batman-pt-1)
       | 
       | Through the story and experience of a blind man, they end up
       | getting into the question of what does it mean to see
       | 
       | The podcast is pretty straightforward, but it does end up showing
       | that defining "seeing" is a philosophical question, rather than a
       | simple obvious answer
        
       | TheCoreh wrote:
       | Is the LLM essentially playing "Wordle" with an external system
       | that rates the quality of its output, gradually climbing the
       | score ladder until it produces good results?
        
       | robocop_legacy wrote:
       | I think there is potentially a powerful method here.
       | Specifically, the optimal context for a given task can be saved
       | and a meta-learner can be trained to map the task to the context.
       | This would allow fine tuning a model for some specific task
       | without retaining the LLM. For example, generating an SEM image
       | with of some material with a specified porosity and grain size.
        
       | v-rt wrote:
       | "without training" _describes transfer learning_
        
         | v01rt wrote:
         | hey what the hell? it said the username was taken?? bug???
        
       | v01rt wrote:
       | "without training" _describes transfer learning with an actor /
       | critic approach_
        
       | alex1138 wrote:
       | I just remember Zuck's comments about AI and how the idea of it
       | dooming our species is a bit silly, etc
       | 
       | This is the wrong approach to take. At minimum you have to say
       | things like "well yes we're always on the lookout for this kind
       | of thing". With him? Not a care in the world
        
       ___________________________________________________________________
       (page generated 2025-04-26 23:00 UTC)