[HN Gopher] RAGFlow is an open-source RAG engine based on OCR an...
       ___________________________________________________________________
        
       RAGFlow is an open-source RAG engine based on OCR and document
       parsing
        
       Author : marban
       Score  : 80 points
       Date   : 2024-04-01 17:50 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | esafak wrote:
       | Apparently "deep document understanding" refers to OCR and
       | structured document parsing:
       | https://github.com/infiniflow/ragflow/blob/main/deepdoc/READ...
       | 
       | Since "deep document understanding" is not a term of art, I would
       | have just said "OCR and document parsing".
       | 
       | How well does it work? Please include benchmarks. You may be
       | interested in
       | 
       | https://paperswithcode.com/sota/optical-character-recognitio...
       | 
       | https://paperswithcode.com/task/document-layout-analysis
       | 
       | The models seem to be closed source, hosted here:
       | https://huggingface.co/InfiniFlow/deepdoc
        
         | dang wrote:
         | Ok we've taken deep document understanding out of the title
         | above. Thanks!
        
         | kergonath wrote:
         | I am curious about the performance of their OCR and layout and
         | table detection. Hopefully it's on par with Amazon, Google, or
         | Microsoft's tools.
        
       | gardenfelder wrote:
       | It seems to be limited to certain LLM servers, on of which is
       | OpenAI, none of which includes e.g. Mystral and popular OSS LLMs.
       | 
       | I wonder if that will change - eventually.
       | 
       | Discord channels are named in Chinese, though there are English
       | posts.
        
         | shekhar101 wrote:
         | It's trivial to run a proxy server that routes all OpenAi calls
         | to another LLM, even local ones. See litellm-proxy.
        
         | bschmidt1 wrote:
         | I see a `LocalLLM` chat model where it looks like you can pass
         | a host/port (for example, ollama's)
        
       | bschmidt1 wrote:
       | Is there a JavaScript library? Both LlamaIndex and Langchain have
       | nice JS/TS packages on npm. Could thinly wrap a JS client around
       | this Python API but the community aspect of having an official
       | library is nice.
       | 
       | Also might be helpful to have a simple example on the README
       | showing how to fetch a document and start querying it. I would
       | try it!
        
       | NKosmatos wrote:
       | If only they supported local LLMs out of the box. I have a very
       | specific use case buy it needs to run locally offline only. Any
       | suggestions/recommendations from fellow HN users are more than
       | welcomed :-)
        
       | mpeg wrote:
       | Took me some time to figure out how to run it, but the layout
       | recogniser model hosted on huggingface is pretty good!
       | 
       | It correctly identifies tables that even paid models like the AWS
       | Textract Document Analysis API fails to - for instance tables
       | with one column which often confuse AWS even if they have a clear
       | header and are labelled "Table" in the text.
       | 
       | I would however love to know broadly what kind of document it was
       | trained on, as my results could be pure luck, hard to say without
       | a proper benchmark
       | 
       | Very nice layout recognition, although I can't quite comment on
       | the RAG performance itself - I think some of the architecture
       | decisions are odd, it mixes a bunch of different PDF parsers for
       | example which will all result in different quality and it's not
       | clear to me which one it defaults to as it seems to be different
       | in different places in the code (the simple parser defaults to
       | pypdf2 which is not a great option)
        
       ___________________________________________________________________
       (page generated 2024-04-01 23:00 UTC)