[HN Gopher] ScreenAI: A visual LLM for UI and visually-situated ...
___________________________________________________________________
ScreenAI: A visual LLM for UI and visually-situated language
understanding
Author : gfortaine
Score : 123 points
Date : 2024-04-09 17:15 UTC (5 hours ago)
(HTM) web link (research.google)
(TXT) w3m dump (research.google)
| namanyayg wrote:
| I was looking for something similar recently and had found
| CogAgent[0] that looks quite interesting, has anyone tried
| anything similar?
|
| 0. https://github.com/THUDM/CogVLM?tab=readme-ov-file#gui-
| agent...
| maciejgryka wrote:
| I haven't read through it yet, but there's FerretUI from Apple
| (mobile-specific, but I think a lot of learnings are generic)
| https://arxiv.org/abs/2404.05719
| S0y wrote:
| I find it quite ironic that google are the biggest players in
| creating solutions that actively contribute in defeating their
| very own anti automation software.
|
| Makes you wonder if the goal of their captcha system was ever
| really to stop people from botting.
| warthog wrote:
| huh exactly what I thought when I saw this
| passion__desire wrote:
| Isn't creating dataset for this the most easiest? we have
| source text of html and how they are rendered with all the
| intermediate info about tags, css layout etc available from
| most modern browsers.
| spxneo wrote:
| that would be the 4d chess move: imagine when you get a captcha
| with "click 1 thing from things that do not fly" but you
| actually helping select drone targets somewhere in middle east
| htrp wrote:
| https://xkcd.com/1897/
| notahacker wrote:
| this is what happens at the intersection of unlimited VC
| money for "AI" and wannabe entrepreneurs that read Ender's
| Game and thought "business opportunity here"
| echelon wrote:
| "Remove the ads from this page"
|
| I can't wait for AI to become the ultimate ad-removal tool.
|
| There might be an arms race, but the anti-ad side will win as
| long as there isn't a unilateral winner (strongest models,
| biggest platform).
|
| There will be enough of a shake up to the current regime --
| search, browsers, etc. -- that there is opportunity for new
| players to attack multiple fronts. Given choice, I don't think
| users will accept a hamstrung interface that forces a subpar
| experience.
|
| We basically just need to make sure Google, Microsoft/OpenAI,
| or some other industry giant doesn't win or that we don't wind
| up living under a cabal of just a few players.
|
| I'm already hopefully imagining AI agents working for us to not
| just remove advertising noise, but to actively route around all
| of the times and places we're taken advantage of. That would be
| an excellent future.
| notduncansmith wrote:
| "It looks like this entire article is an advertorial piece
| for a book. Would you still like to read it?"
| Xenoamorphous wrote:
| Everything behind a paywall then?
| piecerough wrote:
| "We are also releasing three new datasets: Screen Annotation to
| evaluate the layout understanding capability of the model, as
| well as ScreenQA Short and Complex ScreenQA for a more
| comprehensive evaluation of its QA capability."
|
| Looks useful to me for replicating some things. Good stuff!
| f38zf5vdt wrote:
| Google claims SoTA but it appears that, according to Apple, they
| may already be out of date: https://arxiv.org/abs/2404.05719
| moandcompany wrote:
| The core aspects of this research, datasets, and use cases
| discussed here have been in progress for quite a long time at
| Google (it's been WIP for many many years). The same can
| probably be said of Apple's paper though!
|
| Congrats to all the folks involved :)
| twobitshifter wrote:
| is this similar to apple's Realm?
| ilaksh wrote:
| How does this compare to the new GPT-4-turbo vision or Claude 3
| Opus vision? Also, is this open source or can we access it with
| Vertex AI?
| abrichr wrote:
| We haven't been able to use Claude 3 Opus vision yet because
| we're in Canada, but GPT-4-V works extremely well (when
| combined with Segment Anything). See:
| https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (work in
| progress).
|
| Unfortunately we can't compare it to ScreenAI directly since as
| far as I can tell it is not generally available. However
| ScreenAI does not appear to use a separate segmentation step,
| which we needed to implement in order to get good results.
| spxneo wrote:
| can you elaborate on "extremely well"? where is it currently
| falling short?
| abrichr wrote:
| At OpenAdapt we have had excellent results combining Segment
| Anything Model (SAM) with GPT-4 for screen understanding.
|
| Work-in-progress:
| https://github.com/OpenAdaptAI/OpenAdapt/pull/610
| spxneo wrote:
| MIT license too (getting rare)! mad respects, thank you.
___________________________________________________________________
(page generated 2024-04-09 23:00 UTC)