https://github.com/deepdoctection/deepdoctection Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this organization All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} deepdoctection / deepdoctection Public * Notifications * Fork 31 * Star 332 A Repo For Document AI License Apache-2.0 license 332 stars 31 forks Star Notifications * Code * Issues 4 * Pull requests 0 * Discussions * Actions * Projects 1 * Security * Insights More * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights deepdoctection/deepdoctection This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. master Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 29 branches 12 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/d] Use Git or checkout with SVN using the web URL. [gh repo clone deepdo] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @JaMe76 JaMe76 Merge pull request #141 from frivas-at-navteca/master ... 164c64e Apr 26, 2023 Merge pull request #141 from frivas-at-navteca/master Adding HFLayoutLmv3TokenClassifier to the list of token classifiers 164c64e Git stats * 1,030 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time .github configs deepdoctection docker docs scripts tests tests_d2 .gitignore .readthedocs.yaml LICENSE Makefile README.md mkdocs.yml requirements.txt setup.cfg setup.py View code [ ] A Document AI Package Overview Models Datasets and training scripts Evaluation Inference Documentation Requirements Installation Install with pip from PyPi Installation from source Credits Problems If you like deepdoctection ... License README.md Deep Doctection Logo A Document AI Package deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models. For more specific text processing tasks use one of the many other great NLP libraries. deepdoctection focuses on applications and is made for those who want to solve real world problems related to document extraction from PDFs or scans in various image formats. Check the demo of a document layout analysis pipeline with OCR on Hugging Face spaces. Overview deepdoctection provides model wrappers of supported libraries for various tasks to be integrated into pipelines. Its core function does not depend on any specific deep learning library. Selected models for the following tasks are currently supported: * Document layout analysis including table recognition in Tensorflow with Tensorpack, or PyTorch with Detectron2, * OCR with support of Tesseract, DocTr (Tensorflow and PyTorch implementations available) and a wrapper to an API for a commercial solution, * Text mining for native PDFs with pdfplumber, * Language detection with fastText, * Deskewing and rotating images with jdeskew. * Document and token classification with all LayoutLM models provided by the Transformer library. (Yes, you can use any LayoutLM-model with any of the provided OCR-or pdfplumber tools straight away!). Check the notebook repo or the documentation on how to train a model on your custom task or how to setup a pipeline. * Table detection and table structure recognition with table-transformer. You can try a pipeline using this script. deepdoctection provides on top of that methods for pre-processing inputs to models like cropping or resizing and to post-process results, like validating duplicate outputs, relating words to detected layout segments or ordering words into contiguous text. You will get an output in JSON format that you can customize even further by yourself. Have a look at the introduction notebook in the notebook repo for an easy start. Check the release notes for recent updates. Models deepdoctection or its support libraries provide pre-trained models that are in most of the cases available at the Hugging Face Model Hub or that will be automatically downloaded once requested. For instance, you can find pre-trained object detection models from the Tensorpack or Detectron2 framework for coarse layout analysis, table cell detection and table recognition. Datasets and training scripts Training is a substantial part to get pipelines ready on some specific domain, let it be document layout analysis, document classification or NER. deepdoctection provides training scripts for models that are based on trainers developed from the library that hosts the model code. Moreover, deepdoctection hosts code to some well established datasets like Publaynet that makes it easy to experiment. It also contains mappings from widely used data formats like COCO and it has a dataset framework (akin to datasets so that setting up training on a custom dataset becomes very easy. This notebook shows you how to do this. Evaluation deepdoctection comes equipped with a framework that allows you to evaluate predictions of a single or multiple models in a pipeline against some ground truth. Check again here how it is done. Inference Having set up a pipeline it takes you a few lines of code to instantiate the pipeline and after a for loop all pages will be processed through the pipeline. import deepdoctection as dd from IPython.core.display import HTML from matplotlib import pyplot as plt analyzer = dd.get_dd_analyzer() # instantiate the built-in analyzer similar to the Hugging Face space demo df = analyzer.analyze(path = "/path/to/your/doc.pdf") # setting up pipeline df.reset_state() # Trigger some initialization doc = iter(df) page = next(doc) image = page.viz() plt.figure(figsize = (25,17)) plt.axis('off') plt.imshow(image) text HTML(page.tables[0].html) table print(page.get_text()) table Documentation There is an extensive documentation available containing tutorials, design concepts and the API. We want to present things as comprehensively and understandably as possible. However, we are aware that there are still many areas where significant improvements can be made in terms of clarity, grammar and correctness. We look forward to every hint and comment that increases the quality of the documentation. Requirements requirements Everything in the overview listed below the deepdoctection layer are necessary requirements and have to be installed separately. * Linux or macOS. (Windows is not supported but there is a Dockerfile available) * Python >= 3.8 * PyTorch >= 1.8 or Tensorflow >= 2.9 and CUDA. If you want to run the models provided by Tensorpack a GPU is required. You can run on PyTorch with a CPU only. * deepdoctection uses Python wrappers for Poppler to convert PDF documents into images. * With respect to the Deep Learning framework, you must decide between Tensorflow and PyTorch. * Tesseract OCR engine will be used through a Python wrapper. The core engine has to be installed separately. Installation We recommend using a virtual environment. You can install the package via pip or from source. Bug fixes or enhancements will be deployed to PyPi every 4 to 6 weeks. Install with pip from PyPi Depending on which Deep Learning library you have available, use the following installation option: For Tensorflow, run pip install deepdoctection[tf] For PyTorch, first install Detectron2 separately as it is not distributed via PyPi. Check the instruction here. Then run pip install deepdoctection[pt] This will install deepdoctection with all dependencies listed above the deepdoctection layer. Use this setting, if you want to get started or want to explore all features. If you want to have more control with your installation and are looking for fewer dependencies then install deepdoctection with the basic setup only. pip install deepdoctection This will ignore all model libraries (layers above the deepdoctection layer in the diagram) and you will be responsible to install them by yourself. Note, that you will not be able to run any pipeline with this setup. For further information, please consult the full installation instructions. Installation from source Download the repository or clone via git clone https://github.com/deepdoctection/deepdoctection.git To get started with Tensorflow, run: cd deepdoctection pip install ".[tf]" Installing the full PyTorch setup from source will also install Detectron2 for you: cd deepdoctection pip install ".[source-pt]" Credits We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible to develop this framework. Problems We try hard to eliminate bugs. We also know that the code is not free of issues. We welcome all issues relevant to this repo and try to address them as quickly as possible. If you like deepdoctection ... ...you can easily support the project by making it more visible. Leaving a star or a recommendation will help. License Distributed under the Apache 2.0 License. Check LICENSE for additional information. About A Repo For Document AI Topics python nlp ocr tensorflow pytorch document-parser document-layout-analysis table-recognition table-detection document-understanding publaynet layoutlm document-ai document-image-analysis pubtabnet Resources Readme License Apache-2.0 license Stars 332 stars Watchers 3 watching Forks 31 forks Report repository Releases 12 v.0.22 Add support for W&B, some new attributes for Image and Page and small bug fixes Latest Mar 23, 2023 + 11 releases Contributors 4 * @JaMe76 JaMe76 Janis Meyer * @LightAllWorld LightAllWorld * @dependabot[bot] dependabot[bot] * @frivas-at-navteca frivas-at-navteca Francisco Rivas Languages * Python 99.4% * Other 0.6% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.