[HN Gopher] ScreenAI: A visual LLM for UI and visually-situated ...
       ___________________________________________________________________
        
       ScreenAI: A visual LLM for UI and visually-situated language
       understanding
        
       Author : gfortaine
       Score  : 123 points
       Date   : 2024-04-09 17:15 UTC (5 hours ago)
        
 (HTM) web link (research.google)
 (TXT) w3m dump (research.google)
        
       | namanyayg wrote:
       | I was looking for something similar recently and had found
       | CogAgent[0] that looks quite interesting, has anyone tried
       | anything similar?
       | 
       | 0. https://github.com/THUDM/CogVLM?tab=readme-ov-file#gui-
       | agent...
        
         | maciejgryka wrote:
         | I haven't read through it yet, but there's FerretUI from Apple
         | (mobile-specific, but I think a lot of learnings are generic)
         | https://arxiv.org/abs/2404.05719
        
       | S0y wrote:
       | I find it quite ironic that google are the biggest players in
       | creating solutions that actively contribute in defeating their
       | very own anti automation software.
       | 
       | Makes you wonder if the goal of their captcha system was ever
       | really to stop people from botting.
        
         | warthog wrote:
         | huh exactly what I thought when I saw this
        
           | passion__desire wrote:
           | Isn't creating dataset for this the most easiest? we have
           | source text of html and how they are rendered with all the
           | intermediate info about tags, css layout etc available from
           | most modern browsers.
        
         | spxneo wrote:
         | that would be the 4d chess move: imagine when you get a captcha
         | with "click 1 thing from things that do not fly" but you
         | actually helping select drone targets somewhere in middle east
        
           | htrp wrote:
           | https://xkcd.com/1897/
        
           | notahacker wrote:
           | this is what happens at the intersection of unlimited VC
           | money for "AI" and wannabe entrepreneurs that read Ender's
           | Game and thought "business opportunity here"
        
         | echelon wrote:
         | "Remove the ads from this page"
         | 
         | I can't wait for AI to become the ultimate ad-removal tool.
         | 
         | There might be an arms race, but the anti-ad side will win as
         | long as there isn't a unilateral winner (strongest models,
         | biggest platform).
         | 
         | There will be enough of a shake up to the current regime --
         | search, browsers, etc. -- that there is opportunity for new
         | players to attack multiple fronts. Given choice, I don't think
         | users will accept a hamstrung interface that forces a subpar
         | experience.
         | 
         | We basically just need to make sure Google, Microsoft/OpenAI,
         | or some other industry giant doesn't win or that we don't wind
         | up living under a cabal of just a few players.
         | 
         | I'm already hopefully imagining AI agents working for us to not
         | just remove advertising noise, but to actively route around all
         | of the times and places we're taken advantage of. That would be
         | an excellent future.
        
           | notduncansmith wrote:
           | "It looks like this entire article is an advertorial piece
           | for a book. Would you still like to read it?"
        
           | Xenoamorphous wrote:
           | Everything behind a paywall then?
        
       | piecerough wrote:
       | "We are also releasing three new datasets: Screen Annotation to
       | evaluate the layout understanding capability of the model, as
       | well as ScreenQA Short and Complex ScreenQA for a more
       | comprehensive evaluation of its QA capability."
       | 
       | Looks useful to me for replicating some things. Good stuff!
        
       | f38zf5vdt wrote:
       | Google claims SoTA but it appears that, according to Apple, they
       | may already be out of date: https://arxiv.org/abs/2404.05719
        
         | moandcompany wrote:
         | The core aspects of this research, datasets, and use cases
         | discussed here have been in progress for quite a long time at
         | Google (it's been WIP for many many years). The same can
         | probably be said of Apple's paper though!
         | 
         | Congrats to all the folks involved :)
        
       | twobitshifter wrote:
       | is this similar to apple's Realm?
        
       | ilaksh wrote:
       | How does this compare to the new GPT-4-turbo vision or Claude 3
       | Opus vision? Also, is this open source or can we access it with
       | Vertex AI?
        
         | abrichr wrote:
         | We haven't been able to use Claude 3 Opus vision yet because
         | we're in Canada, but GPT-4-V works extremely well (when
         | combined with Segment Anything). See:
         | https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (work in
         | progress).
         | 
         | Unfortunately we can't compare it to ScreenAI directly since as
         | far as I can tell it is not generally available. However
         | ScreenAI does not appear to use a separate segmentation step,
         | which we needed to implement in order to get good results.
        
           | spxneo wrote:
           | can you elaborate on "extremely well"? where is it currently
           | falling short?
        
       | abrichr wrote:
       | At OpenAdapt we have had excellent results combining Segment
       | Anything Model (SAM) with GPT-4 for screen understanding.
       | 
       | Work-in-progress:
       | https://github.com/OpenAdaptAI/OpenAdapt/pull/610
        
         | spxneo wrote:
         | MIT license too (getting rare)! mad respects, thank you.
        
       ___________________________________________________________________
       (page generated 2024-04-09 23:00 UTC)