https://github.com/clovaai/donut

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this organization All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}
clovaai / donut Public

  * Notifications
  * Fork 231
  * Star 2.8k

Official Implementation of OCR-free Document Understanding
Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV
2022

arxiv.org/abs/2111.15664

License

MIT license
2.8k stars 231 forks
Star
Notifications

  * Code
  * Issues 96
  * Pull requests 2
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

clovaai/donut

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
master
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 6 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/c]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone clovaa]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@gwkrsrch
gwkrsrch Merge pull request #165 from dotneet/fix/past_key_values
...
a0e94bf Apr 6, 2023
Merge pull request #165 from dotneet/fix/past_key_values

supports latest transformers

a0e94bf

Git stats

  * 55 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
config
fix: update yaml, related to #29
August 19, 2022 06:33
dataset
initial commit
July 20, 2022 23:15
donut
fix: compatibility with latest transformers
March 21, 2023 14:57
misc
initial commit
July 20, 2022 23:15
result
initial commit
July 20, 2022 23:15
synthdog
Fix minor
November 20, 2022 22:29
.gitignore
initial commit
July 20, 2022 23:15
LICENSE
initial commit
July 20, 2022 23:15
NOTICE
initial commit
July 20, 2022 23:15
README.md
Update README.md
January 27, 2023 16:43
app.py
feat: remove bfloat16 for cpu
November 14, 2022 09:15
lightning_module.py
fix: update max_iter, related to 95cde5 #29
August 31, 2022 13:56
setup.py
initial commit
July 20, 2022 23:15
test.py
feat: remove bfloat16 for cpu
November 14, 2022 09:15
train.py
feat: add categorical special tokens (optional), related to #10
August 4, 2022 11:38
View code
[                    ]
Donut  : Document Understanding Transformer Introduction Pre-trained
Models and Web Demos SynthDoG datasets Updates Software installation
Getting Started Data For Document Classification For Document
Information Extraction For Document Visual Question Answering For
(Pseudo) Text Reading Task Training Test How to Cite License

README.md

             Donut  : Document Understanding Transformer

              Paper Conference Demo Demo PyPI Downloads

  Official Implementation of Donut and SynthDoG | Paper | Slide |
                               Poster

 Introduction

Donut , Document understanding transformer, is a new method of
document understanding that utilizes an OCR-free end-to-end
Transformer model. Donut does not require off-the-shelf OCR engines/
APIs, yet it shows state-of-the-art performances on various visual
document understanding tasks, such as visual document classification
or information extraction (a.k.a. document parsing). In addition, we
present SynthDoG , Synthetic Document Generator, that helps the
model pre-training to be flexible on various languages and domains.

Our academic paper, which describes our method in detail and provides
full experimental results and analyses, can be found here:

    OCR-free Document Understanding Transformer.
    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung
    Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han,
    Seunghyun Park. In ECCV 2022.

image

 Pre-trained Models and Web Demos

Gradio web demos are available! Demo Demo
                  image

  * You can run the demo with ./app.py file.
  * Sample images are available at ./misc and more receipt images are
    available at CORD dataset link.
  * Web demos are available from the links in the following table.

     Task       Sec/ Score           Trained Model             Demo
                Img
                                                              gradio
                0.7  91.3  donut-base-finetuned-cord-v2       space
CORD (Document  /    /     (1280) /                           web
Parsing)        0.7  91.1  donut-base-finetuned-cord-v1       demo,
                /    /     (1280) /                           google
                1.2  90.9  donut-base-finetuned-cord-v1-2560  colab
                                                              demo
Train Ticket                                                  google
(Document       0.6  98.7  donut-base-finetuned-zhtrainticket colab
Parsing)                                                      demo
                                                              gradio
                                                              space
RVL-CDIP                                                      web
(Document       0.75 95.3  donut-base-finetuned-rvlcdip       demo,
Classification)                                               google
                                                              colab
                                                              demo
                                                              gradio
                                                              space
DocVQA Task1                                                  web
(Document VQA)  0.78 67.5  donut-base-finetuned-docvqa        demo,
                                                              google
                                                              colab
                                                              demo

The links to the pre-trained backbones are here:

  * donut-base: trained with 64 A100 GPUs (~2.5 days), number of
    layers (encoder: {2,2,14,2}, decoder: 4), input size 2560x1920,
    swin window size 10, IIT-CDIP (11M) and SynthDoG (English,
    Chinese, Japanese, Korean, 0.5M x 4).
  * donut-proto: (preliminary model) trained with 8 V100 GPUs (~5
    days), number of layers (encoder: {2,2,18,2}, decoder: 4), input
    size 2048x1536, swin window size 8, and SynthDoG (English,
    Japanese, Korean, 0.4M x 3).

Please see our paper for more details.

 SynthDoG datasets

image

The links to the SynthDoG-generated datasets are here:

  * synthdog-en: English, 0.5M.
  * synthdog-zh: Chinese, 0.5M.
  * synthdog-ja: Japanese, 0.5M.
  * synthdog-ko: Korean, 0.5M.

To generate synthetic datasets with our SynthDoG, please see ./
synthdog/README.md and our paper for details.

 Updates

2022-11-14 New version 1.0.9 is released (pip install donut-python
--upgrade). See 1.0.9 Release Notes.
2022-08-12 Donut  is also available at huggingface/transformers 
(contributed by @NielsRogge). donut-python loads the pre-trained
weights from the official branch of the model repositories. See 1.0.5
Release Notes.
2022-08-05 A well-executed hands-on tutorial on donut  is published
at Towards Data Science (written by @estaudere).
2022-07-20 First Commit, We release our code, model weights,
synthetic data and generator.

 Software installation

PyPI Downloads

pip install donut-python

or clone this repository and install the dependencies:

git clone https://github.com/clovaai/donut.git
cd donut/
conda create -n donut_official python=3.7
conda activate donut_official
pip install .

We tested donut with:

  * torch == 1.11.0+cu113
  * torchvision == 0.12.0+cu113
  * pytorch-lightning == 1.6.4
  * transformers == 4.11.3
  * timm == 0.5.4

 Getting Started

 Data

This repository assumes the following structure of dataset:

> tree dataset_name
dataset_name
+-- test
|   +-- metadata.jsonl
|   +-- {image_path0}
|   +-- {image_path1}
|             .
|             .
+-- train
|   +-- metadata.jsonl
|   +-- {image_path0}
|   +-- {image_path1}
|             .
|             .
+-- validation
    +-- metadata.jsonl
    +-- {image_path0}
    +-- {image_path1}
              .
              .

> cat dataset_name/test/metadata.jsonl
{"file_name": {image_path0}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
{"file_name": {image_path1}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
     .
     .

  * The structure of metadata.jsonl file is in JSON Lines text format
    , i.e., .jsonl. Each line consists of
      + file_name : relative path to the image file.
      + ground_truth : string format (json dumped), the dictionary
        contains either gt_parse or gt_parses. Other fields
        (metadata) can be added to the dictionary but will not be
        used.
  * donut interprets all tasks as a JSON prediction problem. As a
    result, all donut model training share a same pipeline. For
    training and inference, the only thing to do is preparing
    gt_parse or gt_parses for the task in format described below.

 For Document Classification

The gt_parse follows the format of {"class" : {class_name}}, for
example, {"class" : "scientific_report"} or {"class" :
"presentation"}.

  * Google colab demo is available here.
  * Gradio web demo is available here.

 For Document Information Extraction

The gt_parse is a JSON object that contains full information of the
document image, for example, the JSON object for a receipt may look
like {"menu" : [{"nm": "ICE BLACKCOFFEE", "cnt": "2", ...}, ...],
...}.

  * More examples are available at CORD dataset.
  * Google colab demo is available here.
  * Gradio web demo is available here.

 For Document Visual Question Answering

The gt_parses follows the format of [{"question" :
{question_sentence}, "answer" : {answer_candidate_1}}, {"question" :
{question_sentence}, "answer" : {answer_candidate_2}}, ...], for
example, [{"question" : "what is the model name?", "answer" :
"donut"}, {"question" : "what is the model name?", "answer" :
"document understanding transformer"}].

  * DocVQA Task1 has multiple answers, hence gt_parses should be a
    list of dictionary that contains a pair of question and answer.
  * Google colab demo is available here.
  * Gradio web demo is available here.

 For (Pseudo) Text Reading Task

The gt_parse looks like {"text_sequence" : "word1 word2 word3 ... "}

  * This task is also a pre-training task of Donut model.
  * You can use our SynthDoG  to generate synthetic images for the
    text reading task with proper gt_parse. See ./synthdog/README.md
    for details.

 Training

This is the configuration of Donut model training on CORD dataset
used in our experiment. We ran this with a single NVIDIA A100 GPU.

python train.py --config config/train_cord.yaml \
                --pretrained_model_name_or_path "naver-clova-ix/donut-base" \
                --dataset_name_or_paths '["naver-clova-ix/cord-v2"]' \
                --exp_version "test_experiment"
  .
  .
Prediction: <s_menu><s_nm>Lemon Tea (L)</s_nm><s_cnt>1</s_cnt><s_price>25.000</s_price></s_menu><s_total><s_total_price>25.000</s_total_price><s_cashprice>30.000</s_cashprice><s_changeprice>5.000</s_changeprice></s_total>
Answer: <s_menu><s_nm>Lemon Tea (L)</s_nm><s_cnt>1</s_cnt><s_price>25.000</s_price></s_menu><s_total><s_total_price>25.000</s_total_price><s_cashprice>30.000</s_cashprice><s_changeprice>5.000</s_changeprice></s_total>
Normed ED: 0.0
Prediction: <s_menu><s_nm>Hulk Topper Package</s_nm><s_cnt>1</s_cnt><s_price>100.000</s_price></s_menu><s_total><s_total_price>100.000</s_total_price><s_cashprice>100.000</s_cashprice><s_changeprice>0</s_changeprice></s_total>
Answer: <s_menu><s_nm>Hulk Topper Package</s_nm><s_cnt>1</s_cnt><s_price>100.000</s_price></s_menu><s_total><s_total_price>100.000</s_total_price><s_cashprice>100.000</s_cashprice><s_changeprice>0</s_changeprice></s_total>
Normed ED: 0.0
Prediction: <s_menu><s_nm>Giant Squid</s_nm><s_cnt>x 1</s_cnt><s_price>Rp. 39.000</s_price><s_sub><s_nm>C.Finishing - Cut</s_nm><s_price>Rp. 0</s_price><sep/><s_nm>B.Spicy Level - Extreme Hot Rp. 0</s_price></s_sub><sep/><s_nm>A.Flavour - Salt & Pepper</s_nm><s_price>Rp. 0</s_price></s_sub></s_menu><s_sub_total><s_subtotal_price>Rp. 39.000</s_subtotal_price></s_sub_total><s_total><s_total_price>Rp. 39.000</s_total_price><s_cashprice>Rp. 50.000</s_cashprice><s_changeprice>Rp. 11.000</s_changeprice></s_total>
Answer: <s_menu><s_nm>Giant Squid</s_nm><s_cnt>x1</s_cnt><s_price>Rp. 39.000</s_price><s_sub><s_nm>C.Finishing - Cut</s_nm><s_price>Rp. 0</s_price><sep/><s_nm>B.Spicy Level - Extreme Hot</s_nm><s_price>Rp. 0</s_price><sep/><s_nm>A.Flavour- Salt & Pepper</s_nm><s_price>Rp. 0</s_price></s_sub></s_menu><s_sub_total><s_subtotal_price>Rp. 39.000</s_subtotal_price></s_sub_total><s_total><s_total_price>Rp. 39.000</s_total_price><s_cashprice>Rp. 50.000</s_cashprice><s_changeprice>Rp. 11.000</s_changeprice></s_total>
Normed ED: 0.039603960396039604
Epoch 29: 100%|#############| 200/200 [01:49<00:00,  1.82it/s, loss=0.00327, exp_name=train_cord, exp_version=test_experiment]

Some important arguments:

  * --config : config file path for model training.
  * --pretrained_model_name_or_path : string format, model name in
    Hugging Face modelhub or local path.
  * --dataset_name_or_paths : string format (json dumped), list of
    dataset names in Hugging Face datasets or local paths.
  * --result_path : file path to save model outputs/artifacts.
  * --exp_version : used for experiment versioning. The output files
    are saved at {result_path}/{exp_version}/*

 Test

With the trained model, test images and ground truth parses, you can
get inference results and accuracy scores.

python test.py --dataset_name_or_path naver-clova-ix/cord-v2 --pretrained_model_name_or_path ./result/train_cord/test_experiment --save_path ./result/output.json
100%|#############| 100/100 [00:35<00:00,  2.80it/s]
Total number of samples: 100, Tree Edit Distance (TED) based accuracy score: 0.9129639764131697, F1 accuracy score: 0.8406020841373987

Some important arguments:

  * --dataset_name_or_path : string format, the target dataset name
    in Hugging Face datasets or local path.
  * --pretrained_model_name_or_path : string format, the model name
    in Hugging Face modelhub or local path.
  * --save_path: file path to save predictions and scores.

 How to Cite

If you find this work useful to you, please cite:

@inproceedings{kim2022donut,
  title     = {OCR-Free Document Understanding Transformer},
  author    = {Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, JeongYeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2022}
}

 License

MIT license

Copyright (c) 2022-present NAVER Corp.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

About

Official Implementation of OCR-free Document Understanding
Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV
2022

arxiv.org/abs/2111.15664

Topics

nlp ocr computer-vision document-ai multimodal-pre-trained-model 
eccv-2022

Resources

Readme

License

MIT license

Stars

2.8k stars

Watchers

41 watching

Forks

231 forks
Report repository

Releases 6

 
1.0.9 Latest
Nov 14, 2022
+ 5 releases

Packages 0

No packages published

Used by 5

 

  * @sai937
  * @svjack
  * @adrianbowtie
  * @lucky-verma
  * @shivalikasingh95

Contributors 6

  * @gwkrsrch
  * @moonbings
  * @dotneet
  * @SamSamhuns
  * @eltociear
  * @napatswift

Languages

  * Python 100.0%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.