[HN Gopher] RAG at scale: Synchronizing and ingesting billions o...
       ___________________________________________________________________
        
       RAG at scale: Synchronizing and ingesting billions of text
       embeddings
        
       Author : picohen
       Score  : 60 points
       Date   : 2023-10-09 19:44 UTC (3 hours ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | joewferrara wrote:
       | This is a great article about the technical difficulties of
       | building a RAG system at scale from an engineering perspective.
       | Performance is about speed and compute. A topic that is not
       | addressed is how to evaluate a RAG system where performance is
       | about whether the RAG system is retrieving the correct context
       | and answering questions accurately. A RAG system should be built
       | so that the different parts (retriever, embedder, etc) can easily
       | be taken out and modified to improve the performance of the RAG
       | system at answering questions accurately. Whether a RAG system is
       | answering questions accurately should be assessed during
       | development and then continuously monitored.
        
         | ac2u wrote:
         | Yeah, especially if you're experimenting with training and
         | applying a matrix to the embeddings generated by an off the
         | shelf model to help it surface subtleties unique to your
         | domain.
        
         | ddematheu wrote:
         | Co-author of the article here.
         | 
         | You are right. Retrieval accuracy is important as well. From an
         | accuracy perspective, any tools you have found useful in
         | helping validate retrieval accuracy?
         | 
         | In our current architecture, all the different pieces within
         | the RAG ingestion pipeline are modifiable to be able to improve
         | loading, chunking and embedding.
         | 
         | As part of our development process, we have started to enable
         | other tools that we don't talk as much in the article about
         | including a pre processing and embeddings playground
         | (https://www.neum.ai/post/pre-processing-playground) to be able
         | to test different combinations of modules against a piece of
         | text. The idea being that you can establish you ideal pipeline
         | / transformations that can then be scaled.
        
           | visarga wrote:
           | Did you consider pre-processing each chunk separately to
           | generate useful information - summary, title, topics - that
           | would enrich embeddings and aid retrieval? Embeddings only
           | capture surface form. "Third letter of second word" won't
           | match embedding for letter "t". Info has surface and depth.
           | We get depth through chain-of-thought, but that requires
           | first digesting raw text with an LLM.
           | 
           | Even LLMs are dumb during training but smart during
           | inference. So to make more useful training examples, we need
           | to first "study" them with a model, making the implicit
           | explicit, before training. This allows training to benefit
           | from inference-stage smarts.
           | 
           | Hopefully we avoid cases where "A is B" fails to recall "B is
           | A" (the reversal curse). The reversal should be predicted
           | during "study" and get added to the training set, reducing
           | fragmentation. Fragmented data in the dataset remains
           | fragmented in the trained model. I believe many of the
           | problems of RAG are related to data fragmentation and
           | superficial presentation.
           | 
           | A RAG system should have an ingestion LLM step for retrieval
           | augmentation and probably hierarchical summarisation up to a
           | decent level. It will be adding insight into the system by
           | processing the raw documents into a more useful form.
        
             | dartos wrote:
             | Do you have any more resources on this topic? I'm currently
             | very interested in scaling and verifying RAG systems.
        
       | dluc wrote:
       | We are also developing an open-source solution for those who
       | would like to test it out and/or contribute, it can be consumed
       | as a web service, or embedded into .NET apps. The project is
       | codenamed "Semantic Memory" (available in GitHub) and offers
       | customizable external dependencies, such as using Azure Queues,
       | RabbitMQ, or other alternatives, and options for Azure Cognitive
       | Search, Qdrant (with plans to include Weaviate and more). The
       | architecture is similar, with queues and pipelines.
       | 
       | We believe that enabling custom dependencies and logic, as well
       | as the ability to add/remove pipeline steps, is crucial. As of
       | now, there is no definitive answer to the best chunk size or
       | embedding model, so our project aims to provide the flexibility
       | to inject and replace components and pipeline behavior.
       | 
       | Regarding Scalability, LLM text generators and GPUs remain a
       | limiting factor also in this area, LLMs hold great potential for
       | analyzing input data, and I believe the focus should be less on
       | the speed of queues and storage and more on finding the optimal
       | way to integrate LLMs into these pipelines.
        
       | juxtaposicion wrote:
       | We're also building billion-scale pipeline for indexing
       | embeddings. Like the author, most of our pain has been scaling.
       | If you only had to do millions, this whole pipeline would be a
       | 100 LoC. but billions? Our system is at 20k LoC and growing.
       | 
       | The biggest surprise to me here is using Weavite at the scale of
       | billions -- my understanding was that this would require
       | tremendous memory requirements (of order a TB in RAM) which are
       | prohibitively expensive (10-50k/m for that much memory).
       | 
       | Instead, we've been using Lance, which stores its vector index on
       | disk instead of in memory.
        
         | bryan0 wrote:
         | we've been using pgvector at the 100M scale without any major
         | problems so far, but I guess it depends on your specific use
         | case. we've also been using elastic search dense vector fields
         | which also seems to scale well, but of course its pricey but we
         | already have it in our infra so works well.
        
         | ddematheu wrote:
         | Co-author of article here.
         | 
         | Yeah a ton of the time and effort has gone into building
         | robustness and observability into the process. When dealing
         | with millions of files, a failure half way through it is
         | imperative to be able to recover.
         | 
         | RE: Weaviate: Yeah, we needed to use large amounts of memory
         | with Weaviate which has been a drawback from a cost
         | perspective, but that from a performance perspective delivers
         | on the requirements of our customers. (on Weaviate we explored
         | using product quantization. )
         | 
         | What type of performance have you gotten with Lance both on
         | ingestion and retieval? Is disk retrieval fast enough?
        
       ___________________________________________________________________
       (page generated 2023-10-09 23:00 UTC)