[HN Gopher] Show HN: R2R V2 - A open source RAG engine with prod...
       ___________________________________________________________________
        
       Show HN: R2R V2 - A open source RAG engine with prod features
        
       Hi HN! We're building R2R [https://github.com/SciPhi-AI/R2R], an
       open source RAG answer engine that is built on top of
       Postgres+Neo4j. The best way to get started is with the docs -
       https://r2r-docs.sciphi.ai/introduction.  This is a major update
       from our V1 which we have spent the last 3 months intensely
       building after getting a ton of great feedback from our first Show
       HN (https://news.ycombinator.com/item?id=39510874). We changed our
       focus to building a RAG engine instead of a framework, because this
       is what developers asked for the most. To us this distinction meant
       working on an opinionated system instead of layers of abstractions
       over providers. We built features for multimodal data ingestion,
       hybrid search with reranking, advanced RAG techniques (e.g. HyDE),
       automatic knowledge graph construction alongside the original goal
       of an observable RAG system built on top of a RESTful API that we
       shared back in February.  What's the problem? Developers are
       struggling to build accurate, reliable RAG solutions. Popular tools
       like Langchain are complex and overly abstracted and lack crucial
       production features such as user/document management,
       observability, and a default API. There was a big thread about this
       a few days ago: _Why we no longer use LangChain for building our AI
       agents_ (https://news.ycombinator.com/item?id=40739982)  We
       experienced these challenges firsthand while building a large-scale
       semantic search engine, having users report numerous hallucinations
       and inaccuracies. This highlighted that search+RAG is a difficult
       problem. We're convinced that these missing features, and more, are
       essential to effectively monitor and improve such systems over
       time.  Teams have been using R2R to develop custom AI agents with
       their own data, with applications ranging from B2B lead generation
       to research assistants. Best of all, the developer experience is
       much improved. For example, we have recently seen multiple teams
       use R2R to deploy a user-facing RAG engine for their application
       within a day. By day 2 some of these same teams were using their
       generated logs to tune the system with advanced features like
       hybrid search and HyDE.  Here are a few examples of how R2R can
       outperform classic RAG with semantic search only:  1. "What were
       the UK's top exports in 2023?". R2R with hybrid search can identify
       documents mentioning "UK exports" and "2023", whereas semantic
       search finds related concepts like trade balance and economic
       reports.  2. "List all YC founders that worked at Google and now
       have an AI startup." Our knowledge graph feature allows R2R to
       understand relationships between employees and projects, answering
       a query that would be challenging for simple vector search.  The
       built in observability and customizability of R2R helps you to tune
       and improve your system long after launching. Our plan is to keep
       the API ~fixed while we iterate on the internal system logic,
       making it easier for developers to trust R2R for production from
       day 1.  We are currently working on the following: (1) Improve
       semantic chunking through third party providers or our own custom
       LLMs; (2) Training a custom model for knowledge graph triples
       extraction that will allow KG construction to be 10x more
       efficient. (This is in private beta, please reach out if
       interested!); (3) Ability to handle permissions at a more granular
       level than just a single user; (4) LLM-powered online evaluation of
       system performance + enhanced analytics and metrics.  Getting
       started is easy. R2R is a lightweight repository that you can
       install locally with `pip install r2r`, or run with Docker. Check
       out our quickstart guide: https://r2r-docs.sciphi.ai/quickstart.
       Lastly, if it interests you, we are also working on a cloud
       solution at https://sciphi.ai.  Thanks a lot for taking the time to
       read! The feedback from the first ShowHN was invaluable and gave us
       our direction for the last three months, so we'd love to hear any
       more comments you have!
        
       Author : ocolegro
       Score  : 172 points
       Date   : 2024-06-26 13:27 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Kluless wrote:
       | Interesting. Can you talk a bit about how the process is
       | faster/better optimized for the dev teams? Sounds like there's a
       | big potential to accelerate time to MVP.
        
         | ocolegro wrote:
         | Sure, happy to.
         | 
         | R2R is built around RESTful API and is dockerized, so devs can
         | get started on app development immediately.
         | 
         | The system was designed so that devs can typically scale data
         | ingestion up to provider bottlenecks w/out extra work.
         | 
         | We have implemented user-level permissions and high level
         | document management alongside the vector db, which most devs
         | need to build in a production setting, along with the API and
         | data ingestion scaling.
         | 
         | Lastly, we also log every search and RAG completion that flows
         | through the system. This is really important to find weaknesses
         | and tune the system over time. Most devs end up needing an
         | observability solution for their RAG.
         | 
         | All of these connect to an open source developer dashboard that
         | allows you to see uploaded files, test different configs, etc.
         | 
         | These basic features mean that devs can spend more time on
         | iterating / customizing their application specific features
         | like custom data ingestion, hybrid search and advanced RAG.
        
       | wmays wrote:
       | What's the benefit over langchain? Or other bigger platforms?
        
         | ocolegro wrote:
         | I'm just seeing this now.
         | 
         | The key advantages can be extracted from the response above to
         | Kluless -
         | 
         | R2R is built around RESTful API and is dockerized, so devs can
         | get started on app development immediately.
         | 
         | The system was designed so that devs can typically scale data
         | ingestion up to provider bottlenecks w/out extra work.
         | 
         | We have implemented user-level permissions and high level
         | document management alongside the vector db, which most devs
         | need to build in a production setting, along with the API and
         | data ingestion scaling.
         | 
         | Lastly, we also log every search and RAG completion that flows
         | through the system. This is really important to find weaknesses
         | and tune the system over time. Most devs end up needing an
         | observability solution for their RAG.
         | 
         | All of these connect to an open source developer dashboard that
         | allows you to see uploaded files, test different configs, etc.
         | 
         | These basic features mean that devs can spend more time on
         | iterating / customizing their application specific features
         | like custom data ingestion, hybrid search and advanced RAG.
        
       | davedx wrote:
       | I've checked out quite a few RAG projects now and what I haven't
       | seen really solved is ingestion, it's usually like "this is an
       | endpoint or some connectors, have fun!".
       | 
       | How do I do a bulk/batch ingest of say, 10k html documents into
       | this system?
        
         | namanyayg wrote:
         | What do you want to do with the data after ingesting?
        
         | shepardrtc wrote:
         | LlamaIndex can ingest directories if you want to do bulk.
        
         | vintagedave wrote:
         | I'd like to know this too. A quick: "take these docs as input,
         | ingest and save, now sit there providing an API to get results"
         | service guide.
        
           | ocolegro wrote:
           | Take a look here -
           | https://r2r-docs.sciphi.ai/quickstart#ingest-data and here
           | https://r2r-docs.sciphi.ai/cookbooks/client-server#ingest-
           | do...
           | 
           | Since multiple people have requested we are pushing a quick
           | change to make this emphasized in the docs.
        
             | vintagedave wrote:
             | Thankyou. My own comment giving a quickstart scenario was
             | downvoted :( https://news.ycombinator.com/item?id=40801453
             | but I saw you kindly replied to it! Thankyou, I appreciate
             | that.
        
         | ocolegro wrote:
         | All the pipelines are async, so for ingestion we have typically
         | seen that R2R can saturate the vector db or embedding provider.
         | We don't yet have backpressure so it is up to the client to
         | rate limit.
         | 
         | Ingestion is pretty straightforward, you can call R2R directly
         | or use the client-server interface to pass the html files in
         | directly to the ingest_files endpoint
         | (https://r2r-docs.sciphi.ai/api-
         | reference/endpoint/ingest_fil...).
         | 
         | The data parsers are all fairly simple and easy to customize.
         | Right now we use bs4 for handling HTML but have been
         | considering other approaches.
         | 
         | What specific features around ingestion have you found lacking?
        
           | davedx wrote:
           | Thanks, I'll give it a try!
        
       | sandeepnmenon wrote:
       | Could you provide more details on the multimodal data ingestion
       | process? What types of data can R2R currently handle, and how are
       | non-text data types embedded? Can the ingestion be streaming from
       | logs?
        
         | ocolegro wrote:
         | Currently R2R has out of the box logic for the following:
         | 
         | csv, docx, html, json, md, pdf, pptx, txt, xlsx, gif, jpg, png,
         | svg, mp3, mp4.
         | 
         | There are a lot of good questions around ingestion today, so we
         | will likely figure out how to intelligently expand this.
         | 
         | For mp3s we use whisper to transcribe, for videos we transcribe
         | with whisper and sample frames to "describe" with a multimodal
         | model. For images we again transcribe to a thorough text
         | description - https://r2r-docs.sciphi.ai/cookbooks/multimodal
         | 
         | We have been testing multi-modal embedding models and open
         | source models to do the description generation. If anyone has
         | suggestions on SOTA techniques that work well at scale we would
         | love to chat and work to implement these. Long run we'd like
         | the system to be able to handle multi-modal data locally.
        
       | ldjkfkdsjnv wrote:
       | This looks great, will be giving it a shot today. Not to throw
       | cold water on the release, but I have been look at different RAG
       | platforms. Anyone have any insight into which is the flagship?
       | 
       | It really seems like document chunking is not a problem that can
       | be solved well generically. And RAG really hinges on which
       | documents get retrieved/the correct metadata.
       | 
       | Current approaches around this seem to be using a ReRanker, where
       | we fetch a ton of information and prune it down. But still,
       | document splitting, is tough. Especially when you start to add
       | transcripts of video that can be a few hours long.
        
       | vintagedave wrote:
       | > R2R is a lightweight repository that you can install locally
       | with `pip install r2r`, or run with Docker
       | 
       | Lightweight is good, and running it without having to deal with
       | Docker is excellent.
       | 
       | But your quickstart guide is still huge! It feels very much not
       | "quick". How do you:
       | 
       | * Install via Python
       | 
       | * Throw a folder of documents at it
       | 
       | * Have it set there providing a REST API to get results?
       | 
       | Eg suppose I have an AI service already, so I throw up a private
       | Railway instance of this as a Python app. There's a DB somewhere.
       | As simple as possible. I can mimic it at home just running a
       | local Python server. How do I do that? _That's_ the real
       | quickstart.
        
         | ocolegro wrote:
         | You are right that the quickstart is pretty large, we will
         | think about how we can trim that and show only the essentials.
         | 
         | To do what you are requesting is pretty easy, you can just
         | launch the server and use the client directly. The code would
         | look like this:
         | 
         | ```python
         | 
         | from r2r import R2RClient
         | 
         | base_url = "http://localhost:8000" # or other
         | 
         | client = R2RClient(base_url)
         | 
         | # load my_file_paths
         | 
         | ...
         | 
         | response = client.ingest_files(file_paths=my_file_paths)
         | 
         | # optionally set metadata, document ids, etc..
         | [https://r2r-docs.sciphi.ai/api-
         | reference/endpoint/ingest_fil...]
         | 
         | ```
        
           | vintagedave wrote:
           | Thankyou! I appreciate that, that's a good mini-start, ie
           | quickstart :)
           | 
           | I have an AI service that I need to add RAG too, running as a
           | direct Python server, and I can see running this as a second
           | service being very useful. Much appreciated.
        
       | jonathan-adly wrote:
       | This is excellent. I have been running a very similar stack for 2
       | years, and you got all the tricks of the trade. Pgvector, HyDe,
       | Web Search + document search. Good dashboard with logs and
       | analytics.
       | 
       | I am leaving my position, and I recommended this to basically
       | replace me with a junior dev who can just hit the API endpoints.
        
         | michaelmior wrote:
         | As someone without no experience with RAG in production, I'm
         | curious how effective you've found HyDE to be in practice.
        
           | ocolegro wrote:
           | I can't answer for the kindly poster above (ty), but from our
           | experience techniques like HyDE are great when you are
           | getting a lot of comparative questions.
           | 
           | For instance, if a user asks "How does A compare to B" then
           | the query expansion element of HyDE is incredibly useful. The
           | actual value of translating queries into answers for
           | embedding is a bit unclear, since most embedding models we
           | are using have been ft'ed to map queries onto answers.
        
           | qeternity wrote:
           | Not GP but Hyde is a crutch for having poor semantic indexing
           | imho. Most people just take raw chunks and embed those. You
           | really need a robust preprocessing pipeline.
        
       | hubraumhugo wrote:
       | Do you also see the ingestion process as the key challenge for
       | many RAG systems to avoid "garbage in, garbage out"? How does R2R
       | handle accurate data extraction for complex and diverse document
       | types?
       | 
       | We have a customer who has hundreds of thousands of unstructured
       | and diverse PDFs (containing tables, forms, checkmarks, images,
       | etc.), and they need to accurately convert these PDFs into
       | markdown for RAG usage.
       | 
       | Traditional OCR approaches fall short in many of these cases, so
       | we've started using a combined multimodal LLM + OCR approach that
       | has led to promising accuracy and consistency at scale (ping me
       | if you want to give this a try). The RAG system itself is not a
       | big pain point for them, but the accurate and efficient
       | extraction and structuring of the data is.
        
         | ocolegro wrote:
         | We agree that ingestion and extraction are a big part of the
         | problem for building high quality RAG.
         | 
         | We've talked to a lot of different developers about these
         | problems and haven't found a general consensus on what features
         | are needed, so we are still evaluating advanced approaches.
         | 
         | For now our implementation is more general and designed to work
         | across a variety of documents. R2R was designed to be very easy
         | to override with your own custom parsing logic for these
         | reasons.
         | 
         | Lastly, we have been focusing a lot of our effort on knowledge
         | graphs since they provide an alternative way to enhance RAG
         | systems. We are training our own model for triples extraction
         | that will combine with the automatic knowledge graph
         | construction in R2R. We are planning to release this in the
         | coming weeks and are currently looking for beta testers [we
         | made a signup form, here - https://forms.gle/g9x3pLpqx2kCPmcg6
         | for anyone interested]
        
           | tootie wrote:
           | I'm actually curious what the common patterns for RAG have
           | been. I see a lot of progress in tooling but I have seen
           | relatively few use cases or practical architectures
           | documented.
        
         | machiaweliczny wrote:
         | Try sonnet 3.5 image understanding.
        
           | ocolegro wrote:
           | Have you tried it out yet, how does it compare with gpt-4o?
        
         | LifeIsBio wrote:
         | I want to second this. It seems like document chunking is the
         | most difficult part of the pipeline at this point.
         | 
         | You gave the example of unstructured PDF, but there are
         | challenges with structured docs as well. We've run into docs
         | that are hard to chunk because of this deeply nested and
         | repeated structure. For example, there might be a long
         | experimental protocol with multiple steps; at the end of each
         | step, there's a table "Debugging" for troubleshooting anything
         | that might have gone wrong in that step. The debugging table is
         | a natural chunk, except that once chunked there are a dozen
         | such tables that are semantically similar when decoupled from
         | their original context and position in the tree structure of
         | the document.
         | 
         | This is one example, but there are many other cases where key
         | context for a chunk is nearby in a structured sense, but far
         | away in the flattened document, and therefore completely lost
         | when chunking.
        
           | ocolegro wrote:
           | Is this an example that could benefit from something like
           | knowledge graph construction or structured entity extraction?
           | 
           | I'm just curious because we have theorized and seen in
           | practice that extraction is a way to answer questions which
           | require connected information across disparate chunks, like
           | you can see in the simple cookbook here
           | [https://r2r-docs.sciphi.ai/cookbooks/knowledge-graph].
           | 
           | Or do you think this is something that can just be solved
           | with more advanced multimodal ingestion?
        
           | cyanydeez wrote:
           | I think a LLM could be successful if it wasn't just textually
           | aware, but also spatially. Like, we know these things just
           | chew through forum posts like this one. Knowing where the
           | user name ones, the body of text, submit button, etc, might
           | be foundational in actual problem in, problem out.
        
         | lacoolj wrote:
         | I run into the same issue with an internal company RAG, all
         | unstructured data in PDFs but even once converted to markdown,
         | they still need fine-tuning and a lot of manual intervention.
         | 
         | It feels like we are inching closer to automating this type of
         | thing, or at the very least brute-forcing it in like the LLM
         | race is trying to do with bigger models and larger contexts.
         | 
         | Will have to play with this over a weekend and see what it
         | might help me with :)
        
           | ocolegro wrote:
           | Awesome - interested to hear your thoughts / feelings after
           | you get a chance to try it out.
        
         | constantinum wrote:
         | Any one here exploring to solve extraction/parsing problem for
         | RAG, do try LLMWhisperer[1].
         | 
         | Try it with complex layout documents ->
         | https://pg.llmwhisperer.unstract.com/
         | 
         | If anyone wants to solve for RAG right from loading from
         | source, extraction, and sending processed data to
         | destination/API, try Unstract [2] (it is open-source)
         | 
         | [1] https://unstract.com/llmwhisperer/
         | 
         | [2] https://github.com/Zipstack/unstract
        
         | davedx wrote:
         | Danswer supports pdf natively, I've been trialing it and it
         | works pretty well
        
         | cyanydeez wrote:
         | PaddleOCR seemed to be a good library for locating and
         | translating text. I've been puzzling over how to translate
         | something like a simple letter form into a LLM translatable
         | format.
         | 
         | I think the serious problem is most of these LLMs are already
         | built on-top of garbage so you're already the GI and just
         | trying to match that as best you can.
        
           | serjester wrote:
           | I built a library around this problem [1]. I recently did
           | some experimenting with PaddleOCR but found the results very
           | underwhelming (no spacing between text) - seems like it's
           | heavily optimized for Chinese. There was a 3 year old GitHub
           | issue around it and seems like it still has this issue out of
           | the box. I'd be curious to hear other people's experience
           | with it.
           | 
           | [1] https://github.com/Filimoa/open-parse/
        
         | cpursley wrote:
         | I'm really interested in learning more about this (multimodal
         | LLM + OCR approach for PDFs), do you have a writeup anywhere or
         | something open source?
        
       | causal wrote:
       | Have you integrated with any popular chat front-ends, e.g.
       | OpenWebUI?
        
         | ocolegro wrote:
         | No not yet, I've had difficulty getting these different
         | providers to work together on integrations. If you have any
         | suggestions we are all ears.
         | 
         | In the meantime we've built our own dashboard which shows
         | ingested documents, and has a customizeable chat interface -
         | https://github.com/SciPhi-AI/R2R-Dashboard.
         | 
         | It's still a bit rough though.
        
       | hdjsvdjue7 wrote:
       | I can't wait to try it after work. How would one link it to
       | ollama?
        
         | ocolegro wrote:
         | See the guide here -
         | https://r2r-docs.sciphi.ai/cookbooks/local-rag
         | 
         | we have instructions for getting setup and running w/ ollama.
         | It should be pretty smooth.
        
       | SubiculumCode wrote:
       | I've been interested in building a RAG for my documents, but as
       | an academic project I do not have the funds to spend on costly
       | APIs like a lot of RAG projects out there depend on, not just LLM
       | part, but for the reranking, chunking, etc, like those form
       | Cohere.
       | 
       | Can R2R be built with all processing steps implementing local
       | "open" models?
        
         | ocolegro wrote:
         | Yes, there is a guide to running R2R with local models here -
         | https://r2r-docs.sciphi.ai/cookbooks/local-rag
        
           | SubiculumCode wrote:
           | awesome!
        
       | p1esk wrote:
       | " What were the UK's top exports in 2023?"
       | 
       | "List all YC founders that worked at Google and now have an AI
       | startup."
       | 
       | How to check the accuracy of the answers? Is there some kind of a
       | detailed trace of how the answer was generated?
        
         | ocolegro wrote:
         | great question, I can talk about how we do the more challenging
         | "List all YC founders that worked at Google and now have an AI
         | startup."
         | 
         | For this we have a target dataset (the YC co directory) that we
         | have around 100 questions over. We have found that when feeding
         | an entire company listing in along with a single question we
         | can get an accurate single answer (needle in haystack problem).
         | 
         | So to build our evaluation dataset we feed each question with
         | each sample into the cheapest LLM we can find that reliably
         | handles the job. We then aggregate the results.
         | 
         | This is not perfect but it allows us to have a way to benchmark
         | our knowledge graph construction and querying strategy so that
         | we can tune the system ourselves.
        
           | p1esk wrote:
           | OK, so you have a way to evaluate the accuracy and convince
           | yourself that it's probably works as expected. But what about
           | me, a user? How can I check that the question I asked was
           | answered correctly?
        
             | GTP wrote:
             | I think there's no substitute for doing your own research
             | and comparing the results.
        
               | p1esk wrote:
               | I just want to avoid putting one black box on top of
               | another if possible.
        
       | FriendlyMike wrote:
       | Is there a way to work with source code? I've been looking for a
       | rag solution that can understand the graph of code. For example
       | "what analytics events get called when I click submit"
        
         | ocolegro wrote:
         | No we don't have any explicit code graph tools. Sourcegraph
         | might be a good starting point for you, their SCIP indices are
         | pretty nice
        
       | jhoechtl wrote:
       | Get neo4j out and count me in. No need for that Ressource hog.
        
         | ocolegro wrote:
         | its a optional dep used for kgs
        
           | Onawa wrote:
           | What about swapping out neo4j for EdgeDB? Then you get to
           | keep using Postgres with PG vector, and get knowledge graph
           | all in one shot.
        
       | mentos wrote:
       | Seems like there is an opportunity to make this as easy to use as
       | Dropbox.
        
         | ocolegro wrote:
         | yes, I think so.
        
       | GTP wrote:
       | How does this compare with Google's NotebookLM?
        
         | shreyaspgkr wrote:
         | There are many exciting products that enable users to perform
         | RAG on their own data, the growing number of use cases
         | highlights the need for developer-friendly tools to build such
         | applications.
         | 
         | While building our own RAG system with existing tools, we
         | encountered numerous challenges in experimentation, deployment,
         | and analysis. This led us to create our own solution that is
         | truly developer-friendly.
         | 
         | You can check our docs for more details:
         | https://r2r-docs.sciphi.ai/introduction
        
       | taylorbuley wrote:
       | I could see myself considering this. And not just because it's
       | got a great project name.
        
       | vanillax wrote:
       | The quick start is defiantly not quick. You really should provide
       | a batteries included docker compose with Postgres image (
       | docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0 )
       | 
       | If I want to use dashboard I have to clone another repo? 'git
       | clone git@github.com:SciPhi-AI/R2R-Dashboard.git' ? why not make
       | it available in a docker container so that if im only interested
       | in rag I can plug into the docker container for dashboard?
       | 
       | This project feels like a collection of alot of things thats not
       | really providing any extra ease to development. It feels more
       | like joining a new company and trying to find out all the repo
       | and set everything up.
       | 
       | This really looks cool, but Im struggling to figure out if its a
       | SDK or suite of apps or both but in the later case the suite of
       | apps is really confusing if i have to still write all the python,
       | then it feels more like a SDK?
       | 
       | Perhaps provide better "1 click" install experience to
       | preview/show case all the features and then let devs leverages
       | the r2r lalter...
        
         | ocolegro wrote:
         | thanks, this is really solid feedback - we will make a more
         | inclusive docker image to make the setup easier/faster.
         | 
         | Think of R2R as an SDK with an out of the box admin dashboard /
         | playground that you can plug into.
        
           | rahimnathwani wrote:
           | The installation instructions should be:
           | 
           | 1. Download this docker compose file.
           | 
           | 2. Run docker compose using this command.
           | 
           | 3. Upload your first file (or folder) of content using this
           | command.
           | 
           | It's fine to have to pip install the client, but it might be
           | worth also providing an example curl command for uploading an
           | HTML/text/PDF file.
           | 
           | The quickstart confused me because it started with python -m
           | r2r.quickstart.example or something. It wasn't clear why I
           | need to run some quickstart example, or how I would specify
           | the location of my doc(s) or what command to run to index
           | docs for real. Sure I could go read the source, but then it's
           | not really a quick start.
           | 
           | Also it would be good to know:
           | 
           | - how to control chunk size when uploading a new document
           | 
           | - what type(s) of search are supported. You mention something
           | about hybrid search, but the quickstart example doesn't
           | explain how to choose the type of search (I guess it defaults
           | to vector search).
           | 
           | HTH
        
             | ocolegro wrote:
             | Thanks I agree that would be a more streamlined
             | introduction.
             | 
             | The quickstart clearly has too much content in retrospect,
             | and the feedback here makes it clear we should simplify.
        
       | haolez wrote:
       | On a side note, is there an open source RAG library that's not
       | bound to a rising AI startup? I couldn't find one and I have a
       | simple in-house implementation that I'd like to replace with
       | something more people use.
        
       ___________________________________________________________________
       (page generated 2024-06-26 23:00 UTC)