[HN Gopher] Show HN: Morphik - Open-source RAG that understands ...
___________________________________________________________________
Show HN: Morphik - Open-source RAG that understands PDF images,
runs locally
Hey HN, we're Adi and Arnav. A few months ago, we hit a wall trying
to get LLMs to answer questions over research papers and
instruction manuals. Everything worked fine, until the answer lived
inside an image or diagram embedded in the PDF. Even GPT-4o flubbed
it (we recently tried O3 with the same, and surprisingly it flubbed
it too). Naive RAG pipelines just pulled in some text chunks and
ignored the rest. We took an invention disclosure PDF
(https://drive.google.com/file/d/1ySzQgbNZkC5dPLtE3pnnVL2rW_9...)
containing an IRR-vs-frequency graph and asked GPT "From the graph,
at what frequency is the IRR maximized?". We originally tried this
on gpt-4o, but while writing this used the new natively multimodal
model o4-mini-high. After a 30-second thinking pause, it asked for
clarifications, then churned out buggy code, pulled data from the
wrong page, and still couldn't answer the question. We wrote up the
full story with screenshots here:
https://docs.morphik.ai/blogs/gpt-vs-morphik-multimodal. We got
frustrated enough to try fixing it ourselves. We built Morphik to
do multimodal retrieval over documents like PDFs, where images and
diagrams matter as much as the text. To do this, we use Colpali-
style embeddings, which treat each document page as an image and
generate multi-vector representations. These embeddings capture
layout, typography, and visual context, allowing retrieval to get a
whole table or schematic, not just nearby tokens. Along with vector
search, this could now retrieve exact pages with relevant diagrams
and pass them as images to the LLM to get relevant answers. It's
able to answer the question with an 8B llama 3.1 vision running
locally! Early pharma testers hit our system with queries like
"Which EGFR inhibitors at 50 mg showed >= 30% tumor reduction?" We
correctly returned the right tables and plots, but still hit a
bottleneck, we weren't able to join the dots across multiple
reports. So we built a knowledge graph: we tag entities in both
text and images, normalize synonyms (Erlotinib - EGFR inhibitor),
infer relations (e.g. administered_at, yields_reduction), and
stitch everything into a graph. Now a single query could traverse
that graph across documents and surface a coherent, cross-document
answer along with the correct pages as images. To illustrate that,
and just for fun, we built a graph of 100 Paul Graham's essays
here: https://pggraph.streamlit.app/ You can search for various
nodes, (eg. startup, sam altman, paul graham and see corresponding
connections). In our system, we create graphs and store the
relevant text chunks along with the entities, so on querying, we
can extract the relevant entity, do a search on the graph and pull
in the text chunks of all connected nodes, improving cross document
queries. For longer or multi-turn queries, we added persistent KV
caching, which stores intermediate key-value states from
transformer attention layers. Instead of recomputing attention from
scratch every time, we reuse prior layers, speeding up repeated
queries and letting us handle much longer context windows. We're
open-source under the MIT Expat license:
https://github.com/morphik-org/morphik-core Would love to hear
your RAG horror stories, what worked, what didn't and any feedback
on Morphik. We're here for it.
Author : Adityav369
Score : 98 points
Date : 2025-04-22 16:18 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| DavidPP wrote:
| I'm currently building an internal tool using SurrealDB directly,
| but I'm curious to use Morphik since it implement features I
| hadn't the time to figure out yet. (For example, I started with
| hardcoded schemas and I like how you support both).
|
| Minor nitpick, but the README for your ui-component project under
| ee says:
|
| "License This project is part of Morphik and is licensed under
| the MIT License."
|
| However, your ee folder has an "enterprise" license, not the MIT
| license.
| Adityav369 wrote:
| Thanks for pointing that out! Fixed it.
|
| For the metadata extraction, we save these as Column(JSONB) for
| each documents which allows it to be changed on the fly.
|
| Although, I keep wondering if it would have been better to use
| something like mongodb for this part, just because it's more
| natural.
|
| Please let me know if you have questions and how it works out
| for you.
| trollbridge wrote:
| If it's MIT open source, what does the paid part apply to?
| Adityav369 wrote:
| The paid part applies to the ui-component which provides a chat
| user interface. The core code, SDK, api is all under MIT
| license.
| Imanari wrote:
| Looks really nice! How does it handle tables?
| Adityav369 wrote:
| We have two ingestion pathways: 1. regular OCR + text
| embeddings; 2. Colpali. We've observed that Colpali does a much
| better job with tables since it can encode positional stuff and
| layouts as well.
| th0ma5 wrote:
| Whenever I ask people wanting to use such features at scale
| which figure could be out of place or have a transposed digit
| it generally makes the project evaporate.
| MitPitt wrote:
| Should I use this if I don't plan on working with pdfs? What's
| the best RAG currently?
| Adityav369 wrote:
| Depends on your document types.
|
| If you're using txts, then plain RAG built on top of any vector
| database can suffice depending on your queries (if they
| directly reference the text, or can be made to, then similarity
| search is good enough). If they are cross document, setting a
| high number of chunks with plain RAG to retrieve might also do
| a good job.
|
| If you have tables, images, etc. then using a better extraction
| mechanism (maybe unstructured, or other document processors)
| and then creating the embeddings can also work well.
|
| I'd say if docs are simple, then just building your own
| pipeline on top of a vector db is good!
| Alifatisk wrote:
| How could I extract rectangles from PDF and then do something
| like this?
| Adityav369 wrote:
| Do you mean ingesting the extracted rectangles/ bounding boxes?
| We're actually working on bounding boxes, this is a good
| insight and we can add this to the product. However, the way we
| ingest is literally converting each page to an image then
| embedding that so the text, layout, diagrams are all encoded
| in. Would like to know what the exact use case is, can help you
| better
| codegeek wrote:
| We're open-source under the MIT Expat license"
|
| Not quite. You should clarify a bit more. The README has this
| about their license.
|
| "Certain features - such as Morphik Console - are not available
| in the open-source version. Any feature in the ee namespace is
| not available in the open-source version and carries a different
| license. Any feature outside that is open source under the MIT
| expat license."
| Adityav369 wrote:
| Thanks we should have been more clear. The part in ee is our
| UI, which can be used to test or in dev environments. The main
| code, including API, SDK, and the entire backend logic is MIT
| expat.
| thot_experiment wrote:
| I'd love to have something like this but calling a cloud is a no-
| go for me. I have a half baked tool that a friend of mine and I
| applied to the Mozilla Builders Grant with (didn't get in), it's
| janky and I don't have time to work on it right now but it does
| the thing. I also find myself using OpenWebUI's context RAG stuff
| sometimes but I'd really like to have a way to dump all of my
| private documents into a DB and have search/RAG work against them
| locally, preferably in a way that's agnostic of the LLM backend.
|
| Does such a project exist?
| Adityav369 wrote:
| You can run this fully locally using Ollama for inference,
| although you'll need larger models and a beefy machine for
| great results. On my end llama 3.2 8B does a good job on
| technical docs, but bigger the better lol.
| w10-1 wrote:
| The architecture sounds very, very promising. Normalizing
| entities and relations to put in a graph for RAG sounds great.
| (I'm still a bit unclear on ingesting or updating existing
| graphs.)
|
| Curious about suitability of this for PDF's as conference
| presentation slides vs academic papers. Is this sensitive or
| tunable to such distinctions?
|
| Looking for tests/validation; are they all in the evaluation
| folder? A Pharma example would be great.
|
| Thank you for documenting the telemetry. I appreciate the ee
| commercialization dance :)
| Adityav369 wrote:
| For ingesting graphs, you can define a filter, or certain
| document ids. When updating, we look at if any other docs are
| added with that filer (or you can specify new doc ids). We then
| do entity and relationship extraction again, and do entity
| resolution with the existing graph to merge the two.
|
| Creating graphs and entity resolution are both tunable with
| overrides, you can specify domain specific prompts and
| overrides (will add a pharma example!)
| (https://docs.morphik.ai/python-sdk/create_graph#parameters). I
| tried to add code, but was formatting badly, sorry for the
| redirect.
| jkc101 wrote:
| Looks cool! What are the compute requirements or recommendations
| for self-hosting Morphik? What are the scaling limits? Can you
| provide a sense for latencies for ingestion and retrieval as the
| index size grows?
| Adityav369 wrote:
| Depending on the use case, it happily runs on my MacBook air M2
| 16GB ram with mps for small pdfs, and searching over 100-150
| documents with colpali takes a 2-ish minutes. Very rough
| numbers. For ingestion, takes around 15-20-ish seconds a page,
| which is on the slower end. On an A100, it takes 4-5 seconds
| per page for ingestion using Colpali to run (we haven't
| performance optimized, or optimized batch sizes yet tho).
| Without Colpali it is much faster. Ingestion doesn't change
| much as size grows.
|
| I'd be happy to report back after some testing, we are looking
| to optimize more of this soon, as speed is somewhat of a
| missing piece at the moment.
| breadislove wrote:
| I uploaded a file and its been processing for over an hour now.
| No failure or anything. Maybe you should look into that.
| Adityav369 wrote:
| Yeah we had an overload on the ingestion queue. If you try
| again will be much faster as we just moved to a beefier
| machine. (The previous ingestion will still work since it is in
| queue, but new ones will be faster)
___________________________________________________________________
(page generated 2025-04-22 23:00 UTC)