[HN Gopher] RAG at scale: Synchronizing and ingesting billions o...
___________________________________________________________________
RAG at scale: Synchronizing and ingesting billions of text
embeddings
Author : picohen
Score : 60 points
Date : 2023-10-09 19:44 UTC (3 hours ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| joewferrara wrote:
| This is a great article about the technical difficulties of
| building a RAG system at scale from an engineering perspective.
| Performance is about speed and compute. A topic that is not
| addressed is how to evaluate a RAG system where performance is
| about whether the RAG system is retrieving the correct context
| and answering questions accurately. A RAG system should be built
| so that the different parts (retriever, embedder, etc) can easily
| be taken out and modified to improve the performance of the RAG
| system at answering questions accurately. Whether a RAG system is
| answering questions accurately should be assessed during
| development and then continuously monitored.
| ac2u wrote:
| Yeah, especially if you're experimenting with training and
| applying a matrix to the embeddings generated by an off the
| shelf model to help it surface subtleties unique to your
| domain.
| ddematheu wrote:
| Co-author of the article here.
|
| You are right. Retrieval accuracy is important as well. From an
| accuracy perspective, any tools you have found useful in
| helping validate retrieval accuracy?
|
| In our current architecture, all the different pieces within
| the RAG ingestion pipeline are modifiable to be able to improve
| loading, chunking and embedding.
|
| As part of our development process, we have started to enable
| other tools that we don't talk as much in the article about
| including a pre processing and embeddings playground
| (https://www.neum.ai/post/pre-processing-playground) to be able
| to test different combinations of modules against a piece of
| text. The idea being that you can establish you ideal pipeline
| / transformations that can then be scaled.
| visarga wrote:
| Did you consider pre-processing each chunk separately to
| generate useful information - summary, title, topics - that
| would enrich embeddings and aid retrieval? Embeddings only
| capture surface form. "Third letter of second word" won't
| match embedding for letter "t". Info has surface and depth.
| We get depth through chain-of-thought, but that requires
| first digesting raw text with an LLM.
|
| Even LLMs are dumb during training but smart during
| inference. So to make more useful training examples, we need
| to first "study" them with a model, making the implicit
| explicit, before training. This allows training to benefit
| from inference-stage smarts.
|
| Hopefully we avoid cases where "A is B" fails to recall "B is
| A" (the reversal curse). The reversal should be predicted
| during "study" and get added to the training set, reducing
| fragmentation. Fragmented data in the dataset remains
| fragmented in the trained model. I believe many of the
| problems of RAG are related to data fragmentation and
| superficial presentation.
|
| A RAG system should have an ingestion LLM step for retrieval
| augmentation and probably hierarchical summarisation up to a
| decent level. It will be adding insight into the system by
| processing the raw documents into a more useful form.
| dartos wrote:
| Do you have any more resources on this topic? I'm currently
| very interested in scaling and verifying RAG systems.
| dluc wrote:
| We are also developing an open-source solution for those who
| would like to test it out and/or contribute, it can be consumed
| as a web service, or embedded into .NET apps. The project is
| codenamed "Semantic Memory" (available in GitHub) and offers
| customizable external dependencies, such as using Azure Queues,
| RabbitMQ, or other alternatives, and options for Azure Cognitive
| Search, Qdrant (with plans to include Weaviate and more). The
| architecture is similar, with queues and pipelines.
|
| We believe that enabling custom dependencies and logic, as well
| as the ability to add/remove pipeline steps, is crucial. As of
| now, there is no definitive answer to the best chunk size or
| embedding model, so our project aims to provide the flexibility
| to inject and replace components and pipeline behavior.
|
| Regarding Scalability, LLM text generators and GPUs remain a
| limiting factor also in this area, LLMs hold great potential for
| analyzing input data, and I believe the focus should be less on
| the speed of queues and storage and more on finding the optimal
| way to integrate LLMs into these pipelines.
| juxtaposicion wrote:
| We're also building billion-scale pipeline for indexing
| embeddings. Like the author, most of our pain has been scaling.
| If you only had to do millions, this whole pipeline would be a
| 100 LoC. but billions? Our system is at 20k LoC and growing.
|
| The biggest surprise to me here is using Weavite at the scale of
| billions -- my understanding was that this would require
| tremendous memory requirements (of order a TB in RAM) which are
| prohibitively expensive (10-50k/m for that much memory).
|
| Instead, we've been using Lance, which stores its vector index on
| disk instead of in memory.
| bryan0 wrote:
| we've been using pgvector at the 100M scale without any major
| problems so far, but I guess it depends on your specific use
| case. we've also been using elastic search dense vector fields
| which also seems to scale well, but of course its pricey but we
| already have it in our infra so works well.
| ddematheu wrote:
| Co-author of article here.
|
| Yeah a ton of the time and effort has gone into building
| robustness and observability into the process. When dealing
| with millions of files, a failure half way through it is
| imperative to be able to recover.
|
| RE: Weaviate: Yeah, we needed to use large amounts of memory
| with Weaviate which has been a drawback from a cost
| perspective, but that from a performance perspective delivers
| on the requirements of our customers. (on Weaviate we explored
| using product quantization. )
|
| What type of performance have you gotten with Lance both on
| ingestion and retieval? Is disk retrieval fast enough?
___________________________________________________________________
(page generated 2023-10-09 23:00 UTC)