[HN Gopher] LLMs can see and hear without any training
___________________________________________________________________
LLMs can see and hear without any training
Author : T-A
Score : 152 points
Date : 2025-04-26 13:38 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| underdeserver wrote:
| Paper: https://arxiv.org/pdf/2501.18096
| suddenlybananas wrote:
| I don't understand how the title relates to the content of this
| article at all. They're even using CLIP which definitely has
| been trained.
| dragonwriter wrote:
| You don't have to train the LLM soecifically for the tasks
| and even the auxiliary tools aren't trained on the tasks they
| are used as scorers for (because they aren't doing the
| task,just evaluating how well the LlM is), so there is no
| task-specific training.
| suddenlybananas wrote:
| Task-specific training sure, but the title implies that
| vision itself is not trained.
| jagged-chisel wrote:
| Computers can receive input without any programming. Not sure
| what's interesting here.
| fortran77 wrote:
| Really? How?
| skydhash wrote:
| The base layer is just electronic circuitry. As long there is
| electricity it will do stuff (like a radio producing noise).
| GPU, CPU, is mostly software embedded in hardware.
| barotalomey wrote:
| Primarily, processing input.
| amelius wrote:
| There's more to seeing and hearing than just receiving inputs.
|
| Anyway, this looks like a case of human trying to understand
| article without reading it.
| dragonwriter wrote:
| This isn't receiving input, its generating output competitive
| with models with task-specific training.
|
| I'm guessing the iterative approach burns a lot of tokens
| though, though that may not matter too much with 8B Llama as
| the LLM.
| scribu wrote:
| This seems to be a system to generate better prompts to be fed
| into a base multimodal model.
|
| Interesting, but title is definitely clickbait.
| wangii wrote:
| Exactly! There is definitely something wrong with FAIR.
| throwaway4aday wrote:
| They only did that for image generation. The more interesting
| part is that an LLM can approach or find the correct caption
| for an image, video or audio during test time with no training
| using only the score as a guide. It's essentially working blind
| almost like the game Marco Polo where the scorer is saying
| "warmer" or "colder" while the LLM is finding its way towards
| the goal. This is an example of emergent capabilities since
| there are no examples of this in the training data.
| matt123456789 wrote:
| Actually, it's the name of the paper. And while the team also
| developed and released a system to elicit the behavior by doing
| what you described, it's entirely possible that the researchers
| thought the title to be the most important finding in their
| work.
| EncomLab wrote:
| My photoresistor nightlight can "see" that it is dark and it
| "knows" to turn on the light - not only does it not have
| training, it does not have any code!
|
| And if you think that is amazing, my bi-metallic strip thermostat
| "feels" the temperature and then modifies the environment because
| it "knows" if it's hot to turn on the A/C, and if it's cold to
| turn on the heat - no training or code!
|
| All of this AI stuff is just unbelievably incredible - what a
| brave new world (of word games)!
| JoBrad wrote:
| The nightlight and thermostat's response to stimulus is nowhere
| near analyzing a picture of a clock tower and responding with
| "Image of a city's tallest, historic landmark with a sepia
| filter." To me, recognizing the umbrella in the spoon is one of
| the most impressive items they list.
| EncomLab wrote:
| It's not the technology that is bad - it's the extreme
| anthropomorphizing language that's used to describe it.
| horacemorace wrote:
| It might be bad if its behavior wasn't so anthropomorphic.
| viraptor wrote:
| That looks like a classic Actor/Critic setup, yet it's not
| mentioned even once in the paper. Am I missing some large
| difference here?
| oneseven wrote:
| Yes, apparently they've developed new names: Generator and
| Scorer. This feels a bit like "Tai's Model"
| https://news.ycombinator.com/item?id=17863514
| lukeinator42 wrote:
| Haha "Tai's Model" is absolutely hilarious, that gave me a
| good chuckle. I checked and it currently is cited 568 times.
| dawnofdusk wrote:
| In actor/critic the actor and critic are normally learned,
| i.e., their weights are adjusted during the process. The paper
| is correct that their method is zero-shot, but it doesn't
| mention that their method is essentially equivalent to a few
| rounds of training but then discarding the training update.
|
| Anyone who works with deep architectures and momentum-based
| optimizers knows that the first few updates alone provide large
| improvements in loss. In this paper the breakthrough is that
| computing these first few updates at test time enables one to
| describe the algorithm as "without training" and therefore
| attract hype.
| fc417fc802 wrote:
| > discarding the training update
|
| But they aren't updating the model weights. They're
| iteratively updating the prompt. It's automating the process
| that humans use with generative models.
|
| Agreed that it's conceptually equivalent though.
| JoBrad wrote:
| Exactly how little training is "without any"? I'm assuming that
| companies haven't been spending billions trying to train LLMs to
| better understand things when they can do it without any
| training.
| 3rdworldeng wrote:
| Find me Jose Monkey will do that too :-)
| sega_sai wrote:
| The paper certainly contradicts my expectation from the title.
| I.e. it does not present an LLM that can generate images without
| any access to images before.
| vessenes wrote:
| I've read the paper and the skeptical comments here, to wit: it's
| just an actor/critic pipeline by another name.
|
| I'll bite and say this is actually interesting -- and the paper
| title is misleading.
|
| What they've done here is hooked up a text-only LLM to multimodal
| critics, given it (mostly) an image diffusion generation task,
| and asked it to improve its prompting of the multimodal
| generation by getting a set of scores back.
|
| This definitely works, based on their outputs. Which is to say,
| LLMs can, zero shot, with outside tool feedback, iteratively
| improve their prompting using only that tooling feedback.
|
| Why is this interesting? Well, this did not work in the GPT-3
| era; it seems to do so now. I see this as an interesting line to
| be added in the 'model capabilities' box as our models get larger
| and more sophisticated -- the LLMs can perform some sort of
| internally guided search against a black box generator and use a
| black box scorer to improve at inference time.
|
| That's pretty cool. It's also generalizable, and I think is worth
| keeping in mind on the stack of possible approaches for, say
| agentic coding, that you can use a critic to not just 'improve'
| generated output, but most likely do some guided search through
| output space.
| nightski wrote:
| Are they using the same diffusion models as the GPT-3 area?
| Meaning is it the LLM that has improved or is it the diffusion
| model? I know it's probably a foolish take but I am really
| skeptical of the "larger models will solve all our problems"
| line of thinking.
| vessenes wrote:
| They don't compare in the paper. I will say I experimented
| extensively with GPT-3 era LLMs on improving ouput by trying
| to guide early diffusion models with critical responses. It
| was a) not successful, and b) pretty clear to me that GPT-3
| didn't "get" what it was supposed to be doing, or didn't have
| enough context to keep all this in mind, or couldn't process
| it properly, or some such thing.
|
| This paper has ablations, although I didn't read that
| section, so you could see where they say the effectiveness
| comes from. I bet you thought that it's emergent from a bunch
| of different places.
|
| FWIW, I don't think LLMS will solve all our problems, so I
| too am skeptical of that claim. I'm not skeptical of the
| slightly weaker "larger models have emergent capabilities and
| we are probably not done finding them as we scale up".
| tomrod wrote:
| > FWIW, I don't think LLMS will solve all our problems, so
| I too am skeptical of that claim. I'm not skeptical of the
| slightly weaker "larger models have emergent capabilities
| and we are probably not done finding them as we scale up".
|
| 100% agree. I'd classify the time now as identifying the
| limits of what they can functionally do though, an it's a
| lot!
| skydhash wrote:
| > _I think is worth keeping in mind on the stack of possible
| approaches for, say agentic coding, that you can use a critic
| to not just 'improve' generated output, but most likely do some
| guided search through output space._
|
| The one issue I keep finding with those approaches is that
| there's already good tools for the problem, but we keep
| searching for wasteful approaches because "natural languages"
| for something humans are not going to interact without a good
| deal of training.
|
| I do understand the hope of getting LLMs do the bulk of the
| work, and then after audit, we fix the errors. But both audit
| and fixing will require the same mental energy as writing the
| code in the first place. And possibly more time.
|
| Specialist tools are always more expansive and offer more
| controls than general public tools. Most approaches with
| agentic coding is offering general interfaces instead of
| specialized interfaces, but redirecting you to a bespoke and
| badly designed specialized interface whenever you want to do
| anything useful.
| vessenes wrote:
| I hear that. Counterpoint - if you all you have is a Philips-
| head screwdriver, all you have is a Philips-head screwdriver.
| On the other hand if all you have is a six axis CnC mill,
| well, then you have a lot.
|
| I think of this less as audit misses, and more as developing
| a permanently useful tool. For open model weights, humanity
| will not (unless we're talking real zombie apocalypse
| scenarios) lose these weights. They are an incredible global
| asset, so making them more generally useful and figuring out
| how to use them is super helpful.
| skydhash wrote:
| Maybe they are useful. But I think there's more usefulness
| in specialized databases and optimized approaches than
| betting everything on big llms models. Kinda like deriving
| linting rules and combining it with a rule engines to catch
| errors. Efficient and useful instead of continuously
| running a big llm model.
| walleeee wrote:
| While it is hard to argue with the wisdom of crystallizing
| intellectual capital into our tools, I do wonder if these
| models might be as likely to diminish as to develop the
| person using them, in which case we trade an implement's
| iterative improvement for ours, in a way
| vessenes wrote:
| Monks in the Middle Ages: "The Printing Press will
| destroy people's ability to memorize."
|
| This was accurate. But mostly humans gained from books. I
| think we will develop the social technology to use these
| tools over time; giving some things up and gaining
| others.
|
| If we don't, the Amish can just take over and be like
| "Stupid English, using the devil's weights." :)
| jorvi wrote:
| > zero shot
|
| I really wish we would find a different term for this.
|
| Doing something always takes at least one attempt, i.e. "one
| shotting". "Zero shotting" is an oxymoron, which makes it a
| term that only creates more confusion rather than succinctly
| conveying something.
| airstrike wrote:
| It's a shot from position zero
| nmstoker wrote:
| No it isn't. The number of shots (examples) is zero.
| 42lux wrote:
| We say Sure Shot.
| hawk_ wrote:
| Array indexing can start at 0 or 1.
| layer8 wrote:
| For an array of zero shots, the indexing doesn't matter.
| Izkata wrote:
| "One shot" is simply about the action itself, but it says
| nothing about how much preparation was done beforehand. "Zero
| shot" additionally implies without training or preparation.
|
| TCGs have a related "zero turn win" concept, where the
| opponent goes first and you win without getting a turn due to
| the set of cards you randomly drew and being able to activate
| them on the opponent's turn.
| vessenes wrote:
| I think of a shot as an example, _not_ a try: "One shot" is
| "One example". Zero shot is "Zero examples". I don't love it,
| but I don't hate it, got a better word for it?
| quantadev wrote:
| My favorite AI term to ridicule is the recent "Test Time
| Compute" nonsense, which has nothing whatsoever to do with
| testing. It literally just means "inference time".
|
| And if I hear someone say "banger", "cooking", "insane", or
| "crazy", one more time I'm going to sledge hammer my
| computer. Can't someone, under 40 please pick up a book and
| read. Yesterday Sam Altman tried to coin "Skillsmaxxing" in a
| tweet. I threw my coffee cup at my laptop.
| ks2048 wrote:
| Speaking of old-timers and "inference time" - there was a
| time when "inference" meant inferring parameters from data
| (i.e. training). And now it means "test-time". (or maybe
| the difference is if it's statistics community vs ML
| community).
|
| e.g. Bishop's textbook says:
|
| 5.2.4 Inference and decision
|
| We have broken the classification problem down into two
| separate stages, the inference stage in which we use
| training data to learn a model for p(Ck|x) and the
| subsequent decision stage in which we use these posterior
| probabilities to make op- timal class assignments.
| nico wrote:
| To people curious or skeptical if this could be called "seeing"
| or "hearing", I recommend listening to the Batman podcast episode
| on NPR (https://www.npr.org/2015/01/23/379134306/batman-pt-1)
|
| Through the story and experience of a blind man, they end up
| getting into the question of what does it mean to see
|
| The podcast is pretty straightforward, but it does end up showing
| that defining "seeing" is a philosophical question, rather than a
| simple obvious answer
| TheCoreh wrote:
| Is the LLM essentially playing "Wordle" with an external system
| that rates the quality of its output, gradually climbing the
| score ladder until it produces good results?
| robocop_legacy wrote:
| I think there is potentially a powerful method here.
| Specifically, the optimal context for a given task can be saved
| and a meta-learner can be trained to map the task to the context.
| This would allow fine tuning a model for some specific task
| without retaining the LLM. For example, generating an SEM image
| with of some material with a specified porosity and grain size.
| v-rt wrote:
| "without training" _describes transfer learning_
| v01rt wrote:
| hey what the hell? it said the username was taken?? bug???
| v01rt wrote:
| "without training" _describes transfer learning with an actor /
| critic approach_
| alex1138 wrote:
| I just remember Zuck's comments about AI and how the idea of it
| dooming our species is a bit silly, etc
|
| This is the wrong approach to take. At minimum you have to say
| things like "well yes we're always on the lookout for this kind
| of thing". With him? Not a care in the world
___________________________________________________________________
(page generated 2025-04-26 23:00 UTC)