[HN Gopher] V*: Guided Visual Search as a Core Mechanism in Mult...
___________________________________________________________________
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Author : jonbaer
Score : 36 points
Date : 2024-01-16 17:34 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| dbish wrote:
| Interesting, I wonder if there's a similar approach that can be
| taken for complex tasks in the "multimodal" virtual environment
| of a computer desktop or phone, working across/between multiple
| applications and web pages with various inputs and interactions.
| Taking a more general approach that might be closer to how humans
| work then ACT-1 and the like.
___________________________________________________________________
(page generated 2024-01-16 23:00 UTC)