https://research.google/blog/screenai-a-visual-language-model-for-ui-and-visually-situated-language-understanding/

Jump to Content
 
Research
 
Research

  * Who we are
    Back to Who we are menu
    -----------------------------------------------------------------
   
    Defining the technology of today and tomorrow.

      + Philosophy

        We strive to create an environment conducive to many
        different types of research across many different time scales
        and levels of risk.

        Learn more about our Philosophy Learn more
        Philosophy
      + People

        Our researchers drive advancements in computer science
        through both fundamental and applied research.

        Learn more about our People Learn more
        People
      + Teams

        Our research teams have the opportunity to impact technology
        used by billions of people every day.

        Learn more about our Teams Learn more
        Teams
  * Research areas
    Back to Research areas menu
    -----------------------------------------------------------------
      + AI/ML Foundations 
& Capabilities

          o Machine Intelligence
          o Machine Perception
          o Machine Translation
          o Natural Language Processing
          o Speech Processing
        AI/ML Foundations 
& Capabilities
        Back to AI/ML Foundations 
& Capabilities menu
        -------------------------------------------------------------
          o Machine Intelligence
          o Machine Perception
          o Machine Translation
          o Natural Language Processing
          o Speech Processing
      + Algorithms & Optimization

          o Algorithms & Theory
          o Data Management
          o Data Mining & Modeling
          o Information Retrieval & the Web
        Algorithms & Optimization
        Back to Algorithms & Optimization menu
        -------------------------------------------------------------
          o Algorithms & Theory
          o Data Management
          o Data Mining & Modeling
          o Information Retrieval & the Web
      + Computing Paradigms

          o Distributed Systems & Parallel
Computing
          o Hardware & Architecture
          o Mobile Systems
          o Networking
          o Quantum Computing
          o Robotics
          o Security, Privacy, & Abuse
Prevention
          o Software Engineering
          o Software Systems
        Computing Paradigms
        Back to Computing Paradigms menu
        -------------------------------------------------------------
          o Distributed Systems & Parallel
Computing
          o Hardware & Architecture
          o Mobile Systems
          o Networking
          o Quantum Computing
          o Robotics
          o Security, Privacy, & Abuse
Prevention
          o Software Engineering
          o Software Systems
      + Responsible Human-Centric Technology

          o Human-Computer Interaction
& Visualization
          o Responsible AI
        Responsible Human-Centric Technology
        Back to Responsible Human-Centric Technology menu
        -------------------------------------------------------------
          o Human-Computer Interaction
& Visualization
          o Responsible AI
      + Science & Societal Impact

          o Climate & Sustainability
          o Economics & Electronic Commerce
          o Education Innovation
          o General Science
          o Health & Bioscience
        Science & Societal Impact
        Back to Science & Societal Impact menu
        -------------------------------------------------------------
          o Climate & Sustainability
          o Economics & Electronic Commerce
          o Education Innovation
          o General Science
          o Health & Bioscience
    Explore research areas
  * Our work
    Back to Our work menu
    -----------------------------------------------------------------
      + Projects

        We regularly open-source projects with the broader research
        community and apply our developments to Google products.

        Learn more about our Projects Learn more
        Projects
      + Publications

        Publishing our work allows us to share ideas and work
        collaboratively to advance the field of computer science.

        Learn more about our Publications Learn more
        Publications
      + Resources

        We make products, tools, and datasets available to everyone
        with the goal of building a more collaborative ecosystem.

        Learn more about our Resources Learn more
        Resources
  * Programs & events
    Back to Programs & events menu
    -----------------------------------------------------------------
   
    Shaping the future, together.

    Collaborate with us
      + Student programs

        Supporting the next generation of researchers through a wide
        range of programming.

        Learn more about our Student programs Learn more
        Student programs
      + Faculty programs

        Participating in the academic research community through
        meaningful engagement with university faculty.

        Learn more about our Faculty programs Learn more
        Faculty programs
      + Conferences & events

        Connecting with the broader research community through events
        is essential for creating progress in every aspect of our
        work.

        Learn more about our Conferences & events Learn more
        Conferences & events
    Collaborate with us
  * Careers
  * Blog

[                    ]
Search
[Screenshot]

 1. Blog
 2. ScreenAI: A visual language model for UI and visually-situated
    language understanding

ScreenAI: A visual language model for UI and visually-situated
language understanding

March 19, 2024

Posted by Srinivas Sunkara and Gilles Baechler, Software Engineers,
Google Research

We introduce ScreenAI, a vision-language model for user interfaces
and infographics that achieves state-of-the-art results on UI and
infographics-based tasks. We are also releasing three new datasets:
Screen Annotation to evaluate the layout understanding capability of
the model, as well as ScreenQA Short and Complex ScreenQA for a more
comprehensive evaluation of its QA capability.

Quick links

  * Paper
  * Share
      +  
      +  
      +  
      +  
      + 
        [http://research.goog] Copy link
        x

Screen user interfaces (UIs) and infographics, such as charts,
diagrams and tables, play important roles in human communication and
human-machine interaction as they facilitate rich and interactive
user experiences. UIs and infographics share similar design
principles and visual language (e.g., icons and layouts), that offer
an opportunity to build a single model that can understand, reason,
and interact with these interfaces. However, because of their
complexity and varied presentation formats, infographics and UIs
present a unique modeling challenge.

To that end, we introduce "ScreenAI: A Vision-Language Model for UI
and Infographics Understanding". ScreenAI improves upon the PaLI
architecture with the flexible patching strategy from pix2struct. We
train ScreenAI on a unique mixture of datasets and tasks, including a
novel Screen Annotation task that requires the model to identify UI
element information (i.e., type, location and description) on a
screen. These text annotations provide large language models (LLMs)
with screen descriptions, enabling them to automatically generate
question-answering (QA), UI navigation, and summarization training
datasets at scale. At only 5B parameters, ScreenAI achieves
state-of-the-art results on UI- and infographic-based tasks (WebSRC
and MoTIF), and best-in-class performance on Chart QA, DocVQA, and
InfographicVQA compared to models of similar size. We are also
releasing three new datasets: Screen Annotation to evaluate the
layout understanding capability of the model, as well as ScreenQA
Short and Complex ScreenQA for a more comprehensive evaluation of its
QA capability.

ScreenAI

ScreenAI's architecture is based on PaLI, composed of a multimodal
encoder block and an autoregressive decoder. The PaLI encoder uses a
vision transformer (ViT) that creates image embeddings and a
multimodal encoder that takes the concatenation of the image and text
embeddings as input. This flexible architecture allows ScreenAI to
solve vision tasks that can be recast as text+image-to-text problems.

On top of the PaLI architecture, we employ a flexible patching
strategy introduced in pix2struct. Instead of using a fixed-grid
pattern, the grid dimensions are selected such that they preserve the
native aspect ratio of the input image. This enables ScreenAI to work
well across images of various aspect ratios.

The ScreenAI model is trained in two stages: a pre-training stage
followed by a fine-tuning stage. First, self-supervised learning is
applied to automatically generate data labels, which are then used to
train ViT and the language model. ViT is frozen during the
fine-tuning stage, where most data used is manually labeled by human
raters.

play silent looping video pause silent looping video

ScreenAI model architecture.

Data generation

To create a pre-training dataset for ScreenAI, we first compile an
extensive collection of screenshots from various devices, including
desktops, mobile, and tablets. This is achieved by using publicly
accessible web pages and following the programmatic exploration
approach used for the RICO dataset for mobile apps. We then apply a
layout annotator, based on the DETR model, that identifies and labels
a wide range of UI elements (e.g., image, pictogram, button, text)
and their spatial relationships. Pictograms undergo further analysis
using an icon classifier capable of distinguishing 77 different icon
types. This detailed classification is essential for interpreting the
subtle information conveyed through icons. For icons that are not
covered by the classifier, and for infographics and images, we use
the PaLI image captioning model to generate descriptive captions that
provide contextual information. We also apply an optical character
recognition (OCR) engine to extract and annotate textual content on
screen. We combine the OCR text with the previous annotations to
create a detailed description of each screen.

ScreenAI-2

A mobile app screenshot with generated annotations that include UI
elements and their descriptions, e.g., TEXT elements also contain the
text content from OCR, IMAGE elements contain image captions,
LIST_ITEMs contain all their child elements.

LLM-based data generation

We enhance the pre-training data's diversity using PaLM 2 to generate
input-output pairs in a two-step process. First, screen annotations
are generated using the technique outlined above, then we craft a
prompt around this schema for the LLM to create synthetic data. This
process requires prompt engineering and iterative refinement to find
an effective prompt. We assess the generated data's quality through
human validation against a quality threshold.

You only speak JSON. Do not write text that isn't JSON.
You are given the following mobile screenshot, described in words.
Can you generate 5 questions regarding the content of the screenshot
as well as the corresponding short answers to them?

The answer should be as short as possible, containing only the
necessary information. Your answer should be structured as follows:
questions: [
{{question: the question,
answer: the answer
}},
...
]

{THE SCREEN SCHEMA}

A sample prompt for QA data generation.

By combining the natural language capabilities of LLMs with a
structured schema, we simulate a wide range of user interactions and
scenarios to generate synthetic, realistic tasks. In particular, we
generate three categories of tasks:

  * Question answering: The model is asked to answer questions
    regarding the content of the screenshots, e.g., "When does the
    restaurant open?"
  * Screen navigation: The model is asked to convert a natural
    language utterance into an executable action on a screen, e.g.,
    "Click the search button."
  * Screen summarization: The model is asked to summarize the screen
    content in one or two sentences.

ScreenAI-3

Block diagram of our workflow for generating data for QA,
summarization and navigation tasks using existing ScreenAI models and
LLMs. Each task uses a custom prompt to emphasize desired aspects,
like questions related to counting, involving reasoning, etc.

ScreenAI-1

LLM-generated data. Examples for screen QA, navigation and
summarization. For navigation, the action bounding box is displayed
in red on the screenshot.

Experiments and results

As previously mentioned, ScreenAI is trained in two stages:
pre-training and fine-tuning. Pre-training data labels are obtained
using self-supervised learning and fine-tuning data labels comes from
human raters.

We fine-tune ScreenAI using public QA, summarization, and navigation
datasets and a variety of tasks related to UIs. For QA, we use well
established benchmarks in the multimodal and document understanding
field, such as ChartQA, DocVQA, Multi page DocVQA, InfographicVQA,
OCR VQA, Web SRC and ScreenQA. For navigation, datasets used include
Referring Expressions, MoTIF, Mug, and Android in the Wild. Finally,
we use Screen2Words for screen summarization. Along with the
fine-tuning datasets, we evaluate the fine-tuned ScreenAI model using
three novel benchmarks:

 1. Screen Annotation: Enables the evaluation model layout
    annotations and spatial understanding capabilities.
 2. ScreenQA Short: A variation of ScreenQA, where its ground truth
    answers have been shortened to contain only the relevant
    information that better aligns with other QA tasks.
 3. Complex ScreenQA: Complements ScreenQA Short with more difficult
    questions (counting, arithmetic, comparison, and non-answerable
    questions) and contains screens with various aspect ratios.

The fine-tuned ScreenAI model achieves state-of-the-art results on
various UI and infographic-based tasks (WebSRC and MoTIF) and
best-in-class performance on Chart QA, DocVQA, and InfographicVQA
compared to models of similar size. ScreenAI achieves competitive
performance on Screen2Words and OCR-VQA. Additionally, we report
results on the new benchmark datasets introduced to serve as a
baseline for further research.

ScreenAI-6

Comparing model performance of ScreenAI with state-of-the-art (SOTA)
models of similar size.

Next, we examine ScreenAI's scaling capabilities and observe that
across all tasks, increasing the model size improves performances and
the improvements have not saturated at the largest size.

ScreenAI-5

Model performance increases with size, and the performance has not
saturated even at the largest size of 5B params.

Conclusion

We introduce the ScreenAI model along with a unified representation
that enables us to develop self-supervised learning tasks leveraging
data from all these domains. We also illustrate the impact of data
generation using LLMs and investigate improving model performance on
specific aspects with modifying the training mixture. We apply all of
these techniques to build multi-task trained models that perform
competitively with state-of-the-art approaches on a number of public
benchmarks. However, we also note that our approach still lags behind
large models and further research is needed to bridge this gap.

Acknowledgements

This project is the result of joint work with Maria Wang, Fedir
Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin,
Jindong Chen and Abhanshu Sharma. We thank Fangyu Liu, Xi Chen, Efi
Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Oriana Riva,
Gang Li,Yang Li, Radu Soricut, and Tania Bedrax-Weiss for their
insightful feedback and discussions, along with Rahul Aralikatte, Hao
Cheng and Daniel Kim for their support in data preparation. We also
thank Jay Yagnik, Blaise Aguera y Arcas, Ewa Dominowska, David
Petrou, and Matt Sharifi for their leadership, vision and support. We
are very grateful toTom Small for helping us create the animation in
this post.

    Labels:
  * Human-Computer Interaction and Visualization
  * Machine Intelligence

Quick links

  * Paper
  * Share
      +  
      +  
      +  
      +  
      + 
        [http://research.goog] Copy link
        x

Other posts of interest

  *  
    [AutoBNN-1]

    March 28, 2024

    AutoBNN: Probabilistic time series forecasting with compositional
    bayesian neural networks
      + Algorithms & Theory *
      + Machine Intelligence *
      + Open Source Models & Datasets
  *  
    [Computer-a]

    March 20, 2024

    Computer-aided diagnosis for lung cancer screening
      + Health & Bioscience *
      + Human-Computer Interaction and Visualization *
      + Machine Intelligence
  *  
    [Computer-a]

    March 20, 2024

    Computer-aided diagnosis for lung cancer screening
      + Health & Bioscience *
      + Human-Computer Interaction and Visualization *
      + Machine Intelligence

Follow us

  *  
  *  
  *  
  *  

 

  * About Google
  * Google Products
  * Privacy
  * Terms

  * Help
  * Submit feedback