https://research.google/blog/a-return-to-hand-written-notes-by-learning-to-read-write/

Jump to Content
 
Research
 
Research

  * Who we are
    Back to Who we are menu
    -----------------------------------------------------------------
   
    Defining the technology of today and tomorrow.

      + Philosophy

        We strive to create an environment conducive to many
        different types of research across many different time scales
        and levels of risk.

        Learn more about our Philosophy Learn more
        Philosophy
      + People

        Our researchers drive advancements in computer science
        through both fundamental and applied research.

        Learn more about our People Learn more
        People
  * Research areas
    Back to Research areas menu
    -----------------------------------------------------------------
      + Research areas

          o Explore all research areas
        Research areas
        Back to Research areas menu
        -------------------------------------------------------------
          o Explore all research areas
      + Foundational ML & Algorithms

          o Algorithms & Theory
          o Data Management
          o Data Mining & Modeling
          o Information Retrieval & the Web
          o Machine Intelligence
          o Machine Perception
          o Machine Translation
          o Natural Language Processing
          o Speech Processing
        Foundational ML & Algorithms
        Back to Foundational ML & Algorithms menu
        -------------------------------------------------------------
          o Algorithms & Theory
          o Data Management
          o Data Mining & Modeling
          o Information Retrieval & the Web
          o Machine Intelligence
          o Machine Perception
          o Machine Translation
          o Natural Language Processing
          o Speech Processing
      + Computing Systems & Quantum AI

          o Distributed Systems & Parallel
Computing
          o Hardware & Architecture
          o Mobile Systems
          o Networking
          o Quantum Computing
          o Robotics
          o Security, Privacy, & Abuse
Prevention
          o Software Engineering
          o Software Systems
        Computing Systems & Quantum AI
        Back to Computing Systems & Quantum AI menu
        -------------------------------------------------------------
          o Distributed Systems & Parallel
Computing
          o Hardware & Architecture
          o Mobile Systems
          o Networking
          o Quantum Computing
          o Robotics
          o Security, Privacy, & Abuse
Prevention
          o Software Engineering
          o Software Systems
      + Science, AI & Society

          o Climate & Sustainability
          o Economics & Electronic Commerce
          o Education Innovation
          o General Science
          o Health & Bioscience
          o Human-Computer Interaction and Visualization
        Science, AI & Society
        Back to Science, AI & Society menu
        -------------------------------------------------------------
          o Climate & Sustainability
          o Economics & Electronic Commerce
          o Education Innovation
          o General Science
          o Health & Bioscience
          o Human-Computer Interaction and Visualization
  * Our work
    Back to Our work menu
    -----------------------------------------------------------------
      + Projects

        We regularly open-source projects with the broader research
        community and apply our developments to Google products.

        Learn more about our Projects Learn more
        Projects
      + Publications

        Publishing our work allows us to share ideas and work
        collaboratively to advance the field of computer science.

        Learn more about our Publications Learn more
        Publications
      + Resources

        We make products, tools, and datasets available to everyone
        with the goal of building a more collaborative ecosystem.

        Learn more about our Resources Learn more
        Resources
  * Programs & events
    Back to Programs & events menu
    -----------------------------------------------------------------
   
    Shaping the future, together.

    Collaborate with us
      + Student programs

        Supporting the next generation of researchers through a wide
        range of programming.

        Learn more about our Student programs Learn more
        Student programs
      + Faculty programs

        Participating in the academic research community through
        meaningful engagement with university faculty.

        Learn more about our Faculty programs Learn more
        Faculty programs
      + Conferences & events

        Connecting with the broader research community through events
        is essential for creating progress in every aspect of our
        work.

        Learn more about our Conferences & events Learn more
        Conferences & events
    Collaborate with us
  * Careers
  * Blog

[                    ]
Search
[ReadAndWri]

 1. Home
 2. Blog

A return to hand-written notes by learning to read & write

October 28, 2024

Blagoj Mitrevski, Software Engineer, and Andrii Maksai, Staff
Software Engineer, Google Research

We present a model to convert photos of handwriting into a digital
format that reproduces component pen strokes, without the need for
specialized equipment.

Quick links

  * Paper
  * GitHub Repo
  * HuggingFace Repo
  * Additional Info
  * Share
      +  
      +  
      +  
      +  
      + 
        [http://research.goog] Copy link
        x

Digital note-taking is gaining popularity, offering a durable,
editable, and easily indexable way of storing notes in a vectorized
form. However, a substantial gap remains between digital note-taking
and traditional pen-and-paper note-taking, a practice still favored
by a majority of people.

Bridging this gap by converting a note taker's physical writing into
a digital form is a process called derendering. The result is a
sequence of strokes, or trajectories of a writing instrument like a
pen or finger, recorded as points and stored digitally. This is also
known as an "online" representation of writing, or "digital ink".

The conversion to digital ink offers users who still prefer
traditional handwritten notes access to their notes in a digital
form. Instead of simply using optical character recognition (OCR),
which would allow the writing to be transcribed to a text document,
by capturing the handwritten documents as a collection of strokes,
it's possible to reproduce them in a form that can be edited freely
by hand in a way that is more natural. It allows the user to create
documents with a realistic look that captures their handwriting
style, rather than simply a collection of text. This representation
allows the user to later inspect, modify or complete their
handwritten notes, which gives their notes enhanced durability,
seamless organization and integration with other digital content
(images, text, links) or digital assistance.

For these reasons, this field has gained significant interest in both
academia and industry, with software solutions that digitize
handwriting and hardware solutions that leverage smart pens or
special paper for capture. The need for additional hardware and
accompanying software stack is, however, an obstacle for wider
adoption, as it creates both onboarding friction and carries
additional expense for the user.

With this in mind, in "InkSight: Offline-to-Online Handwriting
Conversion by Learning to Read and Write", we propose an approach to
derendering that can take a picture of a handwritten note and extract
the strokes that generated the writing without the need for
specialized equipment. We also remove the reliance on typical
geometric constructs, where gradients, contours, and shapes in an
image are utilized to extract writing strokes. Instead, we train the
model to build an understanding of "reading", so it can recognize
written words, and "writing", so it can output strokes that resemble
handwriting. This results in a more robust model that performs well
across diverse scenarios and appearances, including challenging
lighting conditions, occlusions, etc. You can access the model and
the inference code on our GitHub repo.

Overview

The key goal of this approach is to capture the stroke-level
trajectory details of handwriting. The user can then store the
resulting strokes in the note taking app of their choice.

ReadAndWrite-1-Caroll

Left: Offline handwriting. Right: Output digital ink (online
handwriting). In every word, character colors transition from red to
purple, following the rainbow sequence, ROYGBIV. Within each stroke,
the shade progresses from darker to lighter.

Under the hood, we apply an off the shelf OCR model to identify
handwritten words, then use the model to convert them to strokes. To
foster reproducibility, reusability, and ease of adoption, we combine
the widely popular and readily available ViT encoder with an mT5
encoder-decoder.

Challenges

While the fundamental concept of derendering appears straightforward
-- training a model that generates digital ink representations from
input images -- the practical implementation for arbitrary input
images presents two significant challenges:

 1. Limited Supervised Data: Acquiring paired data with corresponding
    images and ground truth digital ink for supervised training can
    be expensive and time-consuming. To our knowledge, no datasets
    with sufficient variety exist for this task.
 2. Scalability to large images: The model must effectively handle
    arbitrarily large input images with varying resolutions and
    amount of content.

Method

Learning to read and write

To address the first problem while avoiding onerous data collection,
we propose a multi-task training setup that combines recognition and
derendering tasks. This enables the model to generalize on
derendering tasks with various styles of images as input, and injects
the model with both semantic understanding and knowledge of the
mechanics of writing handwritten text.

This approach thus differs from methods that rely on geometric
constructs, where gradients, contours, and shapes in an image are
utilized to extract writing strokes. Learning to read enhances the
model's capability in precisely locating and extracting textual
elements from the images. Learning to write ensures that the
resulting vector representation, the digital ink, closely aligns with
the typical human approach of writing in terms of physical dynamics
and the order of strokes. Combined, these allow us to train a model
in the absence of large amounts of paired samples, which are
difficult to obtain.

System workflow

One solution to the problem of scalability is to train a model with
very high-resolution input images and very long output sequences.
However, this is computationally prohibitive. Instead, we break down
the derendering of a page of notes into three steps: (1) OCR to
extract word-level bounding boxes, (2) derendering each of the words
separately, and (3) replacing the offline (pixel) representation of
the words with the derendered strokes using the color coding
described above to improve visualization.

To narrow the domain gap between the synthetic images of rendered
inks and the real photos, we augment the data in tasks that take
rendered ink as input. Data augmentation is done by randomizing the
ink angle, color, stroke width, and by adding Gaussian noise and
cluttered backgrounds.

Vision-language model for digital ink

We create a training mixture that comprises five different task
types. The first two tasks are derendering tasks (i.e., they generate
a digital ink output). One uses only an image as input and the other
uses both an image and the accompanying text that has been recognized
by the OCR model. The following two tasks are recognition tasks that
produce text output, the first of which leverages real images and the
latter, synthetic ones. Finally, a fifth task is a combination of
recognition and derendering, hence a mixed task with text-and-ink
output.

Each type of task utilizes a task-specific input text, enabling the
model to distinguish between tasks during both training and
inference. Below you will find a recognition and a derendering task.

ReadAndWrite-2-Derender

Derendering with text: Takes an image and a text input and outputs
the ink that would generate that text in the style of the image.

ReadAndWrite-3-Recognition

Recognition of synthetic images: Takes an image and recognizes what
is written within.

To train the system, we pair images of text and corresponding digital
ink. The digital ink is sampled from real-time writing trajectories
and subsequently represented as a sequence of strokes. Each stroke is
represented by a sequence of points, obtained by sampling from the
writing or drawing trajectory at a constant rate (e.g., 50 points per
second). The corresponding image is created by rendering the ink -
creating a bitmap at a prespecified resolution. This creates a
pixel-stroke correspondence, that is a precursor for the model
input-output pairs.

A further necessary step, and a unique one for this modality, is the
ink tokenizer, which represents the points in a format that is
friendly to a large language model (LLM). Each point is converted
into two tokens, one each encoding its x and y coordinates. The token
sequence for this ink begins with b, signifying the beginning of the
stroke, followed by the tokens for the coordinates of the sampled
points.

ReadAndWrite-4-Tokenization

Illustration of the ink tokenization for a single-stroke ink. The
dark red ink depicts the ink stroke, with numbered circles marking
sampled points after time resampling. The color gradient of the
sampled points (1-7) indicates the point order. Each point is
represented with two tokens encoding its coordinates x (left half of
the shaded box) and y (right half). The token sequence for this ink
begins with b, signifying the beginning of the stroke, followed by
the tokens for coordinates of sampled points.

Results

To evaluate the performance of our approach, we first collected an
evaluation dataset. We started with OCR data, and then added paired
samples that we collected manually by asking people to trace text
images they were shown (human-generated traces).

We then trained three variants of the model: Small-p (~340M
parameters, "-p" for "public" setup), Small-i ("-i" for "in-house"),
and Large-i (~1B parameters). We compared our approach to a General
Virtual Sketching (GVS) baseline.

We show that the vector representations produced by our system are
both semantically and geometrically similar to the input images, and
are similar to human-generated digital ink data, as measured by both
automatic and human evaluations.

Qualitative evaluation

We show the performance of our models and GVS compared to two public
evaluation datasets, IAM and IMGUR5K, and an out of domain dataset of
sketches. Our models mostly produce results that accurately reflect
the text content, disregarding semantically irrelevant background.
They can also handle occlusions, highlighting the benefit of the
learned reading prior. In contrast, GVS produces multiple duplicate
strokes and has difficulty distinguishing between background and
foreground. Our Large-i model is further able to retain more details
and accommodate more diverse image styles. See the paper for more
examples.

ReadAndWrite-5-Comparison

Comparison between performance of GVS, Small-i, Small-p, and Large-i
on two public evaluation datasets (Rows 1-3, IAM; rows 4-6, IMGUR5K).

ReadAndWrite-6-OOD

Out-of-domain behavior: Sketch derendering for Small-p, Small-i,
Large-i and GVS. Our models are mostly able to derender simple
sketches, however they do still exhibit significant artifacts like
extraneous or misaligned strokes.

Quantitative evaluation

At present, the field has not established metrics or benchmarks for
quantitative evaluation of this task. So, we conduct both human and
automated evaluation to compare the similarity of our model output to
the original image and to human-generated digital inks.

Here we present the human evaluation results, with numerous other
results derived from automated evaluations and an ablation study in
our paper. We performed a human evaluation of the quality of the
derendered inks produced by the three model variants. We used the
"golden" human traced data from the HierText dataset as the control
group and the output of our model on these samples as the
experimental group.

ReadAndWrite-7-Tracing

Comparison of the performance of our models (Small-p, Small-i and
Large-i) and manual tracing on two samples of text of varying
difficulty from the HierText dataset.

In the figure above, notice the error in the quote for all models on
the top row (the double-quote mark), which the human tracing got
correct. On the bottom row the situation is reversed, with the human
tracing focusing solely on the main word, missing most other
elements. The human tracing is also not perfectly aligned with the
underlying image, emphasizing the complexity and tracing difficulty
of the handwritten parts of the HierText dataset.

Evaluators were shown the original image alongside a rendered digital
ink sample, which was either model-generated or human-traced (unknown
to the evaluators). They were asked to answer two questions: (1) Is
the digital ink output a reasonable tracing of the input image?
(Answers: "Yes, it's a good tracing," "It's an okay tracing, but has
some small errors," "It's a bad tracing, has some major artifacts.")
(2) Could this digital ink output have been produced by a human?
(Answers: "Yes" or "No".) The evaluation included 16 individuals
familiar with digital ink, but not involved in this research. Each
sample was evaluated by three raters and aggregated with majority
voting.

ReadAndWrite-8-Performance

The results show that a majority of derendered inks, generated with
the Large-i model perform about as well as human-generated ones.
Moreover 87% of the Large-i outputs are marked as good or having only
small errors.

Conclusion

In this work we present a first-of-its-kind approach to convert
photos of handwriting into digital ink. We propose a training setup
that works without paired training data. We show that our method is
robust to a variety of inputs, can work on full handwritten notes,
and generalizes to out-of-domain sketches to some extent.
Furthermore, our approach does not require complex modeling and can
be constructed from standard building blocks.

Acknowledgements

We want to thank all the authors of this work, Arina Rak, Julian
Schnitzler, and Chengkun Li, who formed a student team working with
Google Research for the duration of the project, as well as Claudiu
Musat, Henry Rowley and Jesse Berent. All authors, with the exception
of the student team, are now part of Google Deepmind.

    Labels:
  * Generative AI
  * Human-Computer Interaction and Visualization
  * Machine Perception

Quick links

  * Paper
  * GitHub Repo
  * HuggingFace Repo
  * Additional Info
  * Share
      +  
      +  
      +  
      +  
      + 
        [http://research.goog] Copy link
        x

Other posts of interest

  *  
    [probabilis]

    October 21, 2024

    Evaluating and enhancing probabilistic reasoning in language
    models
      + Generative AI *
      + Health & Bioscience *
      + Natural Language Processing
  *  
    [HDREditing]

    October 16, 2024

    HDR photo editing with machine learning
      + Machine Perception *
      + Photography *
      + Product
  *  
    [XR-Objects]

    October 1, 2024

    Augmented object intelligence with XR-Objects
      + Generative AI *
      + Human-Computer Interaction and Visualization *
      + Natural Language Processing

Follow us

  *  
  *  
  *  
  *  

 

  * About Google
  * Google Products
  * Privacy
  * Terms

  * Help
  * Submit feedback