[HN Gopher] V*: Guided Visual Search as a Core Mechanism in Mult...
       ___________________________________________________________________
        
       V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
        
       Author : jonbaer
       Score  : 36 points
       Date   : 2024-01-16 17:34 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | dbish wrote:
       | Interesting, I wonder if there's a similar approach that can be
       | taken for complex tasks in the "multimodal" virtual environment
       | of a computer desktop or phone, working across/between multiple
       | applications and web pages with various inputs and interactions.
       | Taking a more general approach that might be closer to how humans
       | work then ACT-1 and the like.
        
       ___________________________________________________________________
       (page generated 2024-01-16 23:00 UTC)