https://github.com/clovaai/donut
Skip to content Toggle navigation
Sign up
* Product
+
Actions
Automate any workflow
+
Packages
Host and manage packages
+
Security
Find and fix vulnerabilities
+
Codespaces
Instant dev environments
+
Copilot
Write better code with AI
+
Code review
Manage code changes
+
Issues
Plan and track work
+
Discussions
Collaborate outside of code
Explore
+ All features
+ Documentation
+ GitHub Skills
+ Blog
* Solutions
For
+ Enterprise
+ Teams
+ Startups
+ Education
By Solution
+ CI/CD & Automation
+ DevOps
+ DevSecOps
Case Studies
+ Customer Stories
+ Resources
* Open Source
+
GitHub Sponsors
Fund open source developers
+
The ReadME Project
GitHub community articles
Repositories
+ Topics
+ Trending
+ Collections
* Pricing
[ ]
*
#
In this repository All GitHub |
Jump to |
* No suggested jump to results
*
#
In this repository All GitHub |
Jump to |
*
#
In this organization All GitHub |
Jump to |
*
#
In this repository All GitHub |
Jump to |
Sign in
Sign up
{{ message }}
clovaai / donut Public
* Notifications
* Fork 231
* Star 2.8k
Official Implementation of OCR-free Document Understanding
Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV
2022
arxiv.org/abs/2111.15664
License
MIT license
2.8k stars 231 forks
Star
Notifications
* Code
* Issues 96
* Pull requests 2
* Actions
* Projects 0
* Security
* Insights
More
* Code
* Issues
* Pull requests
* Actions
* Projects
* Security
* Insights
clovaai/donut
This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
master
Switch branches/tags
[ ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags
Name already in use
A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 6 tags
Code
* Local
* Codespaces
*
Clone
HTTPS GitHub CLI
[https://github.com/c]
Use Git or checkout with SVN using the web URL.
[gh repo clone clovaa]
Work fast with our official CLI. Learn more about the CLI.
* Open with GitHub Desktop
* Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
@gwkrsrch
gwkrsrch Merge pull request #165 from dotneet/fix/past_key_values
...
a0e94bf Apr 6, 2023
Merge pull request #165 from dotneet/fix/past_key_values
supports latest transformers
a0e94bf
Git stats
* 55 commits
Files
Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
config
fix: update yaml, related to #29
August 19, 2022 06:33
dataset
initial commit
July 20, 2022 23:15
donut
fix: compatibility with latest transformers
March 21, 2023 14:57
misc
initial commit
July 20, 2022 23:15
result
initial commit
July 20, 2022 23:15
synthdog
Fix minor
November 20, 2022 22:29
.gitignore
initial commit
July 20, 2022 23:15
LICENSE
initial commit
July 20, 2022 23:15
NOTICE
initial commit
July 20, 2022 23:15
README.md
Update README.md
January 27, 2023 16:43
app.py
feat: remove bfloat16 for cpu
November 14, 2022 09:15
lightning_module.py
fix: update max_iter, related to 95cde5 #29
August 31, 2022 13:56
setup.py
initial commit
July 20, 2022 23:15
test.py
feat: remove bfloat16 for cpu
November 14, 2022 09:15
train.py
feat: add categorical special tokens (optional), related to #10
August 4, 2022 11:38
View code
[ ]
Donut : Document Understanding Transformer Introduction Pre-trained
Models and Web Demos SynthDoG datasets Updates Software installation
Getting Started Data For Document Classification For Document
Information Extraction For Document Visual Question Answering For
(Pseudo) Text Reading Task Training Test How to Cite License
README.md
Donut : Document Understanding Transformer
Paper Conference Demo Demo PyPI Downloads
Official Implementation of Donut and SynthDoG | Paper | Slide |
Poster
Introduction
Donut , Document understanding transformer, is a new method of
document understanding that utilizes an OCR-free end-to-end
Transformer model. Donut does not require off-the-shelf OCR engines/
APIs, yet it shows state-of-the-art performances on various visual
document understanding tasks, such as visual document classification
or information extraction (a.k.a. document parsing). In addition, we
present SynthDoG , Synthetic Document Generator, that helps the
model pre-training to be flexible on various languages and domains.
Our academic paper, which describes our method in detail and provides
full experimental results and analyses, can be found here:
OCR-free Document Understanding Transformer.
Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung
Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han,
Seunghyun Park. In ECCV 2022.
image
Pre-trained Models and Web Demos
Gradio web demos are available! Demo Demo
image
* You can run the demo with ./app.py file.
* Sample images are available at ./misc and more receipt images are
available at CORD dataset link.
* Web demos are available from the links in the following table.
Task Sec/ Score Trained Model Demo
Img
gradio
0.7 91.3 donut-base-finetuned-cord-v2 space
CORD (Document / / (1280) / web
Parsing) 0.7 91.1 donut-base-finetuned-cord-v1 demo,
/ / (1280) / google
1.2 90.9 donut-base-finetuned-cord-v1-2560 colab
demo
Train Ticket google
(Document 0.6 98.7 donut-base-finetuned-zhtrainticket colab
Parsing) demo
gradio
space
RVL-CDIP web
(Document 0.75 95.3 donut-base-finetuned-rvlcdip demo,
Classification) google
colab
demo
gradio
space
DocVQA Task1 web
(Document VQA) 0.78 67.5 donut-base-finetuned-docvqa demo,
google
colab
demo
The links to the pre-trained backbones are here:
* donut-base: trained with 64 A100 GPUs (~2.5 days), number of
layers (encoder: {2,2,14,2}, decoder: 4), input size 2560x1920,
swin window size 10, IIT-CDIP (11M) and SynthDoG (English,
Chinese, Japanese, Korean, 0.5M x 4).
* donut-proto: (preliminary model) trained with 8 V100 GPUs (~5
days), number of layers (encoder: {2,2,18,2}, decoder: 4), input
size 2048x1536, swin window size 8, and SynthDoG (English,
Japanese, Korean, 0.4M x 3).
Please see our paper for more details.
SynthDoG datasets
image
The links to the SynthDoG-generated datasets are here:
* synthdog-en: English, 0.5M.
* synthdog-zh: Chinese, 0.5M.
* synthdog-ja: Japanese, 0.5M.
* synthdog-ko: Korean, 0.5M.
To generate synthetic datasets with our SynthDoG, please see ./
synthdog/README.md and our paper for details.
Updates
2022-11-14 New version 1.0.9 is released (pip install donut-python
--upgrade). See 1.0.9 Release Notes.
2022-08-12 Donut is also available at huggingface/transformers
(contributed by @NielsRogge). donut-python loads the pre-trained
weights from the official branch of the model repositories. See 1.0.5
Release Notes.
2022-08-05 A well-executed hands-on tutorial on donut is published
at Towards Data Science (written by @estaudere).
2022-07-20 First Commit, We release our code, model weights,
synthetic data and generator.
Software installation
PyPI Downloads
pip install donut-python
or clone this repository and install the dependencies:
git clone https://github.com/clovaai/donut.git
cd donut/
conda create -n donut_official python=3.7
conda activate donut_official
pip install .
We tested donut with:
* torch == 1.11.0+cu113
* torchvision == 0.12.0+cu113
* pytorch-lightning == 1.6.4
* transformers == 4.11.3
* timm == 0.5.4
Getting Started
Data
This repository assumes the following structure of dataset:
> tree dataset_name
dataset_name
+-- test
| +-- metadata.jsonl
| +-- {image_path0}
| +-- {image_path1}
| .
| .
+-- train
| +-- metadata.jsonl
| +-- {image_path0}
| +-- {image_path1}
| .
| .
+-- validation
+-- metadata.jsonl
+-- {image_path0}
+-- {image_path1}
.
.
> cat dataset_name/test/metadata.jsonl
{"file_name": {image_path0}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
{"file_name": {image_path1}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
.
.
* The structure of metadata.jsonl file is in JSON Lines text format
, i.e., .jsonl. Each line consists of
+ file_name : relative path to the image file.
+ ground_truth : string format (json dumped), the dictionary
contains either gt_parse or gt_parses. Other fields
(metadata) can be added to the dictionary but will not be
used.
* donut interprets all tasks as a JSON prediction problem. As a
result, all donut model training share a same pipeline. For
training and inference, the only thing to do is preparing
gt_parse or gt_parses for the task in format described below.
For Document Classification
The gt_parse follows the format of {"class" : {class_name}}, for
example, {"class" : "scientific_report"} or {"class" :
"presentation"}.
* Google colab demo is available here.
* Gradio web demo is available here.
For Document Information Extraction
The gt_parse is a JSON object that contains full information of the
document image, for example, the JSON object for a receipt may look
like {"menu" : [{"nm": "ICE BLACKCOFFEE", "cnt": "2", ...}, ...],
...}.
* More examples are available at CORD dataset.
* Google colab demo is available here.
* Gradio web demo is available here.
For Document Visual Question Answering
The gt_parses follows the format of [{"question" :
{question_sentence}, "answer" : {answer_candidate_1}}, {"question" :
{question_sentence}, "answer" : {answer_candidate_2}}, ...], for
example, [{"question" : "what is the model name?", "answer" :
"donut"}, {"question" : "what is the model name?", "answer" :
"document understanding transformer"}].
* DocVQA Task1 has multiple answers, hence gt_parses should be a
list of dictionary that contains a pair of question and answer.
* Google colab demo is available here.
* Gradio web demo is available here.
For (Pseudo) Text Reading Task
The gt_parse looks like {"text_sequence" : "word1 word2 word3 ... "}
* This task is also a pre-training task of Donut model.
* You can use our SynthDoG to generate synthetic images for the
text reading task with proper gt_parse. See ./synthdog/README.md
for details.
Training
This is the configuration of Donut model training on CORD dataset
used in our experiment. We ran this with a single NVIDIA A100 GPU.
python train.py --config config/train_cord.yaml \
--pretrained_model_name_or_path "naver-clova-ix/donut-base" \
--dataset_name_or_paths '["naver-clova-ix/cord-v2"]' \
--exp_version "test_experiment"
.
.
Prediction: Lemon Tea (L)125.00025.00030.0005.000
Answer: Lemon Tea (L)125.00025.00030.0005.000
Normed ED: 0.0
Prediction: Hulk Topper Package1100.000100.000100.0000
Answer: Hulk Topper Package1100.000100.000100.0000
Normed ED: 0.0
Prediction: Giant Squidx 1Rp. 39.000C.Finishing - CutRp. 0B.Spicy Level - Extreme Hot Rp. 0A.Flavour - Salt & PepperRp. 0Rp. 39.000Rp. 39.000Rp. 50.000Rp. 11.000
Answer: Giant Squidx1Rp. 39.000C.Finishing - CutRp. 0B.Spicy Level - Extreme HotRp. 0A.Flavour- Salt & PepperRp. 0Rp. 39.000Rp. 39.000Rp. 50.000Rp. 11.000
Normed ED: 0.039603960396039604
Epoch 29: 100%|#############| 200/200 [01:49<00:00, 1.82it/s, loss=0.00327, exp_name=train_cord, exp_version=test_experiment]
Some important arguments:
* --config : config file path for model training.
* --pretrained_model_name_or_path : string format, model name in
Hugging Face modelhub or local path.
* --dataset_name_or_paths : string format (json dumped), list of
dataset names in Hugging Face datasets or local paths.
* --result_path : file path to save model outputs/artifacts.
* --exp_version : used for experiment versioning. The output files
are saved at {result_path}/{exp_version}/*
Test
With the trained model, test images and ground truth parses, you can
get inference results and accuracy scores.
python test.py --dataset_name_or_path naver-clova-ix/cord-v2 --pretrained_model_name_or_path ./result/train_cord/test_experiment --save_path ./result/output.json
100%|#############| 100/100 [00:35<00:00, 2.80it/s]
Total number of samples: 100, Tree Edit Distance (TED) based accuracy score: 0.9129639764131697, F1 accuracy score: 0.8406020841373987
Some important arguments:
* --dataset_name_or_path : string format, the target dataset name
in Hugging Face datasets or local path.
* --pretrained_model_name_or_path : string format, the model name
in Hugging Face modelhub or local path.
* --save_path: file path to save predictions and scores.
How to Cite
If you find this work useful to you, please cite:
@inproceedings{kim2022donut,
title = {OCR-Free Document Understanding Transformer},
author = {Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, JeongYeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2022}
}
License
MIT license
Copyright (c) 2022-present NAVER Corp.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
About
Official Implementation of OCR-free Document Understanding
Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV
2022
arxiv.org/abs/2111.15664
Topics
nlp ocr computer-vision document-ai multimodal-pre-trained-model
eccv-2022
Resources
Readme
License
MIT license
Stars
2.8k stars
Watchers
41 watching
Forks
231 forks
Report repository
Releases 6
1.0.9 Latest
Nov 14, 2022
+ 5 releases
Packages 0
No packages published
Used by 5
* @sai937
* @svjack
* @adrianbowtie
* @lucky-verma
* @shivalikasingh95
Contributors 6
* @gwkrsrch
* @moonbings
* @dotneet
* @SamSamhuns
* @eltociear
* @napatswift
Languages
* Python 100.0%
Footer
(c) 2023 GitHub, Inc.
Footer navigation
* Terms
* Privacy
* Security
* Status
* Docs
* Contact GitHub
* Pricing
* API
* Training
* Blog
* About
You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.