https://github.com/deepdoctection/deepdoctection

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this organization All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}
deepdoctection / deepdoctection Public

  * Notifications
  * Fork 31
  * Star 332

A Repo For Document AI

License

Apache-2.0 license
332 stars 31 forks
Star
Notifications

  * Code
  * Issues 4
  * Pull requests 0
  * Discussions
  * Actions
  * Projects 1
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

deepdoctection/deepdoctection

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
master
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
29 branches 12 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/d]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone deepdo]

    Work fast with our official CLI. Learn more.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@JaMe76
JaMe76 Merge pull request #141 from frivas-at-navteca/master
...
164c64e Apr 26, 2023
Merge pull request #141 from frivas-at-navteca/master

Adding HFLayoutLmv3TokenClassifier to the list of token classifiers

164c64e

Git stats

  * 1,030 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
.github
 
 
configs
 
 
deepdoctection
 
 
docker
 
 
docs
 
 
scripts
 
 
tests
 
 
tests_d2
 
 
.gitignore
 
 
.readthedocs.yaml
 
 
LICENSE
 
 
Makefile
 
 
README.md
 
 
mkdocs.yml
 
 
requirements.txt
 
 
setup.cfg
 
 
setup.py
 
 
View code
[                    ]
A Document AI Package Overview Models Datasets and training scripts
Evaluation Inference Documentation Requirements Installation Install
with pip from PyPi Installation from source Credits Problems If you
like deepdoctection ... License

README.md

                        Deep Doctection Logo

                        A Document AI Package

deepdoctection is a Python library that orchestrates document
extraction and document layout analysis tasks using deep learning
models. It does not implement models but enables you to build
pipelines using highly acknowledged libraries for object detection,
OCR and selected NLP tasks and provides an integrated framework for
fine-tuning, evaluating and running models. For more specific text
processing tasks use one of the many other great NLP libraries.

deepdoctection focuses on applications and is made for those who want
to solve real world problems related to document extraction from PDFs
or scans in various image formats.

Check the demo of a document layout analysis pipeline with OCR on 
Hugging Face spaces.

 Overview

deepdoctection provides model wrappers of supported libraries for
various tasks to be integrated into pipelines. Its core function does
not depend on any specific deep learning library. Selected models for
the following tasks are currently supported:

  * Document layout analysis including table recognition in
    Tensorflow with Tensorpack, or PyTorch with Detectron2,
  * OCR with support of Tesseract, DocTr (Tensorflow and PyTorch
    implementations available) and a wrapper to an API for a
    commercial solution,
  * Text mining for native PDFs with pdfplumber,
  * Language detection with fastText,
  * Deskewing and rotating images with jdeskew.
  * Document and token classification with all LayoutLM models
    provided by the Transformer library. (Yes, you can use any
    LayoutLM-model with any of the provided OCR-or pdfplumber tools
    straight away!). Check the notebook repo or the documentation on
    how to train a model on your custom task or how to setup a
    pipeline.
  * Table detection and table structure recognition with
    table-transformer. You can try a pipeline using this script.

deepdoctection provides on top of that methods for pre-processing
inputs to models like cropping or resizing and to post-process
results, like validating duplicate outputs, relating words to
detected layout segments or ordering words into contiguous text. You
will get an output in JSON format that you can customize even further
by yourself.

Have a look at the introduction notebook in the notebook repo for an
easy start.

Check the release notes for recent updates.

 Models

deepdoctection or its support libraries provide pre-trained models
that are in most of the cases available at the Hugging Face Model Hub
or that will be automatically downloaded once requested. For
instance, you can find pre-trained object detection models from the
Tensorpack or Detectron2 framework for coarse layout analysis, table
cell detection and table recognition.

 Datasets and training scripts

Training is a substantial part to get pipelines ready on some
specific domain, let it be document layout analysis, document
classification or NER. deepdoctection provides training scripts for
models that are based on trainers developed from the library that
hosts the model code. Moreover, deepdoctection hosts code to some
well established datasets like Publaynet that makes it easy to
experiment. It also contains mappings from widely used data formats
like COCO and it has a dataset framework (akin to datasets so that
setting up training on a custom dataset becomes very easy. This
notebook shows you how to do this.

 Evaluation

deepdoctection comes equipped with a framework that allows you to
evaluate predictions of a single or multiple models in a pipeline
against some ground truth. Check again here how it is done.

 Inference

Having set up a pipeline it takes you a few lines of code to
instantiate the pipeline and after a for loop all pages will be
processed through the pipeline.

import deepdoctection as dd
from IPython.core.display import HTML
from matplotlib import pyplot as plt

analyzer = dd.get_dd_analyzer()  # instantiate the built-in analyzer similar to the Hugging Face space demo

df = analyzer.analyze(path = "/path/to/your/doc.pdf")  # setting up pipeline
df.reset_state()                 # Trigger some initialization

doc = iter(df)
page = next(doc)

image = page.viz()
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

text

HTML(page.tables[0].html)

table

print(page.get_text())

table

 Documentation

There is an extensive documentation available containing tutorials,
design concepts and the API. We want to present things as
comprehensively and understandably as possible. However, we are aware
that there are still many areas where significant improvements can be
made in terms of clarity, grammar and correctness. We look forward to
every hint and comment that increases the quality of the
documentation.

 Requirements

requirements

Everything in the overview listed below the deepdoctection layer are
necessary requirements and have to be installed separately.

  * Linux or macOS. (Windows is not supported but there is a
    Dockerfile available)
  * Python >= 3.8
  * PyTorch >= 1.8 or Tensorflow >= 2.9 and CUDA. If you want to run
    the models provided by Tensorpack a GPU is required. You can run
    on PyTorch with a CPU only.
  * deepdoctection uses Python wrappers for Poppler to convert PDF
    documents into images.
  * With respect to the Deep Learning framework, you must decide
    between Tensorflow and PyTorch.
  * Tesseract OCR engine will be used through a Python wrapper. The
    core engine has to be installed separately.

 Installation

We recommend using a virtual environment. You can install the package
via pip or from source. Bug fixes or enhancements will be deployed to
PyPi every 4 to 6 weeks.

 Install with pip from PyPi

Depending on which Deep Learning library you have available, use the
following installation option:

For Tensorflow, run

pip install deepdoctection[tf]

For PyTorch,

first install Detectron2 separately as it is not distributed via
PyPi. Check the instruction here. Then run

pip install deepdoctection[pt]

This will install deepdoctection with all dependencies listed above
the deepdoctection layer. Use this setting, if you want to get
started or want to explore all features.

If you want to have more control with your installation and are
looking for fewer dependencies then install deepdoctection with the
basic setup only.

pip install deepdoctection

This will ignore all model libraries (layers above the deepdoctection
layer in the diagram) and you will be responsible to install them by
yourself. Note, that you will not be able to run any pipeline with
this setup.

For further information, please consult the full installation
instructions.

 Installation from source

Download the repository or clone via

git clone https://github.com/deepdoctection/deepdoctection.git

To get started with Tensorflow, run:

cd deepdoctection
pip install ".[tf]"

Installing the full PyTorch setup from source will also install
Detectron2 for you:

cd deepdoctection
pip install ".[source-pt]"

 Credits

We thank all libraries that provide high quality code and pre-trained
models. Without, it would have been impossible to develop this
framework.

 Problems

We try hard to eliminate bugs. We also know that the code is not free
of issues. We welcome all issues relevant to this repo and try to
address them as quickly as possible.

 If you like deepdoctection ...

...you can easily support the project by making it more visible.
Leaving a star or a recommendation will help.

 License

Distributed under the Apache 2.0 License. Check LICENSE for
additional information.

About

A Repo For Document AI

Topics

python nlp ocr tensorflow pytorch document-parser 
document-layout-analysis table-recognition table-detection 
document-understanding publaynet layoutlm document-ai 
document-image-analysis pubtabnet

Resources

Readme

License

Apache-2.0 license

Stars

332 stars

Watchers

3 watching

Forks

31 forks
Report repository

Releases 12

 
v.0.22 Add support for W&B, some new attributes for Image and Page
and small bug fixes Latest
Mar 23, 2023
+ 11 releases

Contributors 4

  * @JaMe76 JaMe76 Janis Meyer
  * @LightAllWorld LightAllWorld
  * @dependabot[bot] dependabot[bot]
  * @frivas-at-navteca frivas-at-navteca Francisco Rivas

Languages

  * Python 99.4%
  * Other 0.6%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.