[HN Gopher] Pg_vectorize: Vector search and RAG on Postgres
___________________________________________________________________
Pg_vectorize: Vector search and RAG on Postgres
Author : samaysharma
Score : 269 points
Date : 2024-03-06 08:34 UTC (14 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| rgrieselhuber wrote:
| Tembo has been doing very interesting work.
| KingOfCoders wrote:
| Indeed! Everyone who pushes Postgres is a win in my book.
| DSingularity wrote:
| Why?
| isoprophlex wrote:
| Because it empowers developers instead of salespeople?
| chaps wrote:
| Neat! Any notable gotchas we should know about?
| chuckhend wrote:
| RAG can cost a lot of money if not done thoughtfully. Most
| embedding and chat completion model providers charge by the
| token (think number of words in the request). You'll pay to
| have the data in the database transformed into embeddings, that
| is mostly a one-time fixed cost. Then every time there is a
| search query in RAG, that question needs to be transformed. The
| chat completion model (like ChatGPT 4), will charge for the
| number of tokens in the request + number of tokens in the
| response.
|
| Self-hosting can be a big advantage for cost control, but it
| can be complicated too. Tembo.io's managed service provides
| privately hosted embedding models, but does not have hosted
| chat completion models yet.
| nostrebored wrote:
| Do you work for Tembo?
|
| I only ask because making embedding model consumption based
| charges seem like the norm is off to me -- there are a ton of
| open source, self managed embeddings that you can use.
|
| You can spin up a tool like Marqo and get a platform that
| handles making the embedding calls and chunking strategy as
| well.
| jonplackett wrote:
| I'm using pg_vector in supabase and it seems great in a prototype
| form.
|
| Has anyone tried using it at scale? How does it do vs pine cone /
| Cloudflare before search?
| carlossouza wrote:
| Pg_vector is great! I'm using it at scale here:
|
| https://trendingpapers.com/
|
| But it's Postgres + pg_vector only, no Supabase
| kiwicopple wrote:
| fwiw, we have plenty of companies using pgvector at scale, in
| production - some of them migrating their workload directly
| from pinecone. I'll see if I can dig up some of the high-level
| numbers later today
| philippemnoel wrote:
| I read this paper claiming that there's no fundamental
| limitations to Postgres' vector support down the line. Pretty
| excited for the work Supabase, Neon, Tembo, ourselves at
| ParadeDB and many others are doing to keep pushing it
| forward.
|
| Definitely more and more vector production workloads coming
| to Postgres
|
| Paper: https://www.cs.purdue.edu/homes/csjgwang/pubs/ICDE24_V
| ecDB.p...
| falling_myshkin wrote:
| There's an issue in the pgvector repo about someone having
| several ~10-20million row tables and getting acceptable
| performance with the right hardware and some performance
| tuning: https://github.com/pgvector/pgvector/issues/455
|
| I'm in the early stages of evaluating pgvector myself. but
| having used pinecone I currently am liking pgvector better
| because of it being open source. The indexing algorithm is
| clear, one can understand and modify the parameters.
| Furthermore the database is postgresql, not a proprietary
| document store. When the other data in the problem is stored
| relationally, it is very convenient to have the vectors stored
| like this as well. And postgresql has good observability and
| metrics. I think when it comes to flexibility for specialized
| applications, pgvector seems like the clear winner. But I can
| definitely see pinecone's appeal if vector search is not a core
| component of the problem/business, as it is very easy to use
| and scales very easily
| whakim wrote:
| I've used pgvector at scale (hundreds of millions of rows) and
| while it _does_ work quite well, you'll hit some rough edges:
|
| * Reliance on vacuum means that you really have to watch/tune
| autovacuum or else recall will take a big hit, and the indices
| are huge and hard to vacuum.
|
| * No real support for doing ANN on a subset of embeddings (ie
| filtering) aside from building partial indices or hoping that
| oversampling plus the right runtime parameters get the job done
| (very hard to reason about in practice), which doesn't fit many
| workloads.
|
| * Very long index build times makes it hard to experiment with
| HNSW hyperparameters. This might get better with parallel index
| builds, but how much better is still TBD.
|
| * Some minor annoyances around database api support in Python,
| e.g. you'll usually have to use a DBAPI-level cursor if you're
| working with numpy arrays on the application side.
|
| That being said, overall the pgvector team are doing an awesome
| job considering the limited resources they have.
| tosh wrote:
| Are there any examples for when RAG powered by vectorsearch works
| really well?
|
| I tried best practices like having the llm formulate an answer
| and using the answer for the search (instead of the question) and
| trying different chunk sizes and so on but never got it to work
| in a way that I would consider the result as "good".
|
| Maybe it was because of the type of data or the capabilities of
| the model at the time (GPT 3.5 and GPT 4)?
|
| By now context windows with some models are large enough to fit
| lots of context directly into the prompt which is easier to do
| and yields better results. It is way more costly but cost is
| going down fast so I wonder what this means for RAG +
| vectorsearch going forward.
|
| Where does it shine?
| pid-1 wrote:
| I had similar experiences. Can't understand all the hype around
| RAG when the results are so bad.
| c_ehlen wrote:
| I'm using it in for an internal application and the results
| so far are amazing. Considering it was hacked together in a
| few hours.
|
| It helps a lot with discovery. We have some large PDFs and
| also a large amount of smaller PDFs. Simply asking a
| question, getting an answer with the exact location in the
| pdf is really helpful.
| qrios wrote:
| From our experience simple RAG is often not that helpful as
| the questions itself are not represented in the vector
| space (except you use an FAQ dataset as input). Either a
| preprocessing by an LLM or specific context handling needs
| to be done.
| inductive_magic wrote:
| We're getting very solid results.
|
| Instead of performing rag on the (vectorised) raw source texts,
| we create representations of elements/"context clusters"
| contained within the source, which are then vectorised and
| ranked. That's all I can disclose, hope that helps.
| Merik wrote:
| Thanks for your message. I should say that giving your
| comment to GPT-4, with a request for a solution architecture
| that could produce good results based on the comment,
| produced a very detailed, fascinating solution. https://chat.
| openai.com/share/435a3855-bf02-4791-97b3-4531b8...
| weird-eye-issue wrote:
| A whole lot of noise
| Merik wrote:
| Maybe, but it expanded on the idea in the vague comment
| and together introduced me to the idea of embedding each
| sentence and then clustering the sentences, then taking
| the centroid of the sentences as the embedding to
| index/search against. I had not thought of doing that
| before.
| isoprophlex wrote:
| If only the thing could speak and summarize in plain
| English instead of hollow, overly verbose bulleted lists.
| falling_myshkin wrote:
| After seeing raw source text performance, I agree that
| representational learning of higher-level semantic "context
| clusters" as you say seems like an interesting direction.
| reerdna wrote:
| Sounds a little like this recent paper;
|
| "RAPTOR: Recursive Abstractive Processing for Tree-Organized
| Retrieval"
|
| https://arxiv.org/abs/2401.18059
| theolivenbaum wrote:
| We built a RAG system for one of our clients in the aviation
| industry. >20m technical support messages and associated
| answers / documentation, and we're seeing between 60-80% recall
| for top 3 documents when testing. Definitely pays off to use as
| much of the structure you'll find in the data, plus combining
| multiple strategies (knowledge graph for structured data, text
| embeddings across data types, filtering and boosting based on
| experts experience, etc). The baseline pure RAG-only approach
| was under 25% recall.
| ofermend wrote:
| have you tried Vectara?
| adamcharnock wrote:
| I did a hobby RAG project a little while back, and I'll just
| share my experience here.
|
| 1. First ask the LLM to answer your questions without RAG. It is
| easy to do and you may be surprised (I was, but my data was semi-
| public). This also gives you a baseline to beat.
|
| 2. Chunking of your data needs to be smart. Just chunking every N
| characters wasn't especially fruitful. My data was a book, so it
| was hierarchal (by heading level). I would chunk by book section
| and hand it to the LLM.
|
| 3. Use the context window effectively. There is a weighted
| knapsack problem here, there are chunks of various sizes
| (chars/tokens) with various weightings (quality of match). If
| your data supports it, then the problem is also hierarchal. For
| example, I have 4 excellent matches in this chapter, so should I
| include each match, or should I include the whole chapter?
|
| 4. Quality of input data counts. I spent 30 minutes copy-pasting
| the entire book into markdown format.
|
| This was only a small project. I'd be interested to hear any
| other thoughts/tips.
| nindalf wrote:
| It seems like you put a lot of thought and effort into this.
| Were you happy with the results?
|
| If you could put the entire book into the context window of
| Gemini (1M tokens), how do you think it would compare with your
| approach? It would cost like $15 to do this, so not cheap but
| also cost effective considering the time you've spent chunking
| it.
| s_tim wrote:
| But you would need to spend the $15 on every request whereas
| the RAG approach would be most likely significantly cheaper
| per request.
| adamcharnock wrote:
| At the time I was working with either GPT 3.5 or 4, and -
| looking at my old code - I was limiting myself to around 14k
| tokens.
|
| > Were you happy with the results?
|
| It was somewhat ok. I was testing the system from a
| (copyrighted) psychiatry textbook book and getting feedback
| on the output from a psychotherapist. The idea was to provide
| a tool to help therapists prep for a session, rather than
| help patients directly.
|
| As usual it was somewhat helpful but a little too vague
| sometimes, or missed important specific information for
| situation.
|
| It is possible that it could be improved with a larger
| context window, having more data to select, or different
| prompting. But the frequent response was along the lines of,
| "this is good advice, but it just doesn't drill down enough".
|
| Ultimately we found that GPT3.5/4 could produce responses
| that matched or exceeded our RAG-based solution. This was
| surprising as it is quite domain specific, but also it seemed
| pretty clear that GPT must have be trained on very data very
| similar the (copyrighted) content we were using.
|
| Further steps would be:
|
| 1. Use other LLM models. Is it just GPT3.5/4 that is
| reluctant to drill down?
|
| 2. Used specifically-trained LLMs(/LORA) based on the
| expected response style
|
| I'd be careful of entering this kind of arms race. It seems
| to be a fight against mediocre results, and at any moment
| OpenAI at al may release a new model that eats your lunch.
| CuriouslyC wrote:
| The thing you have to keep in mind is that OpenAI (and all
| the big tech companies) are risk averse. That's the reason
| that model alignment is so overbearing. The foundation
| models are also trying to be good at everything, which
| means they won't be sensitive to the nuances of specific
| types of questions.
|
| RAG and chatbot memory systems are here to stay, and they
| will always provide a benefit.
| WhitneyLand wrote:
| Did you try or consider fine-tuning Gpt 3.5 as a
| complementary approach?
| kjqgqkejbfefn wrote:
| >tree-based approach to organize and summarize text data,
| capturing both high-level and low-level details.
|
| https://twitter.com/parthsarthi03/status/1753199233241674040
|
| processes documents, organizing content and improving
| readability by handling sections, paragraphs, links, tables,
| lists, page continuations, and removing redundancies,
| watermarks, and applying OCR, with additional support for HTML
| and other formats through Apache Tika:
|
| https://github.com/nlmatics/nlm-ingestor
| osigurdson wrote:
| I don't understand. Why build up text chunks from different,
| non-contiguous sections?
| infecto wrote:
| If those non-contiguous sections share similar
| semantic/other meaning, it can make sense from a search
| perspective to group them?
| bageler wrote:
| it starts to look like a graph problem
| osigurdson wrote:
| >> Just chunking every N characters wasn't especially fruitful
|
| Is there any science associated with creative effective
| embedding sets? For a book, you could do every sentence, every
| paragraph, every page or every chapter (or all of these
| options). Eventually people will want to just point their RAG
| system at data and everything works.
| CuriouslyC wrote:
| The easy answer is just use a model to chunk your data for
| you. Phi-2 can chunk and annotate with pre/post summary
| context in one pass, and it's pretty fast/cheap.
|
| There is an optimal chunk size, which IIRC is ~512 tokens
| depending on some factors. You could hierarchically model
| your data with embeddings by chunking the data, then
| generating summaries of those chunks and chunking the
| summaries, and repeating that process ad nauseum until you
| only have a small number of top level chunks.
| drittich wrote:
| How does this work when there is a limited context window.
| You do some pre-chunking?
| CuriouslyC wrote:
| Phi can ingest 2k tokens and the optimal chunk size is
| between 512-1024 depending on the model/application, so
| you just give it a big chunk and tell it to break it down
| into smaller chunks that are semantically related,
| leaving enough room for book-end sentences to enrich the
| context of the chunk. Then you start the next big chunk
| with the remnants of the previous one that the model
| couldn't group.
| drittich wrote:
| Isn't "give it a big chunk" just the same problem at a
| higher level? How do you handle, say, a book?
| CuriouslyC wrote:
| You don't need to handle a whole book, the goal is to
| chunk the book into chunks of the correct size, which is
| less than the context size of the model you're using to
| chunk it semantically. When you're ingesting data, you
| fill up the chunker model's context, and it breaks that
| up into smaller, self relevant chunks and a remainder.
| You then start from the remainder and slurp up as much
| additional text as you can to fill the context and repeat
| the process.
| kordlessagain wrote:
| This is an example of knowledge transfer from a model. I
| used a similar approach to augment chunked texts with
| questions, summaries, and keyterms (which require
| structured output from the LLM). I haven't tried using a
| smaller model to do this as GPT3.5 is fast and cheap
| enough, but I like the idea of running a model in house to
| do things like this.
| CuriouslyC wrote:
| Pro tip, you can use small models like phi-2 to do semantic
| chunking rather than just chunk on based on size, and chunking
| works even better if you book-end chunks with summaries of text
| coming before/after to enrich the context.
|
| Another pro tip, you can again use a small model to
| summarize/extract RAG content to get more actual data in the
| context.
| hack_edu wrote:
| Could you share a bit more about semantic chunking with Phi?
| Any recommendations/examples of prompts?
| CuriouslyC wrote:
| Sure, it'll look something like this:
|
| """ Task: Divide the provided text into semantically
| coherent chunks, each containing between 250-350 words. Aim
| to preserve logical and thematic continuity within each
| chunk, ensuring that sentences or ideas that belong
| together are not split across different chunks.
|
| Guidelines: 1. Identify natural text breaks such as
| paragraph ends or section divides to initiate new chunks.
| 2. Estimate the word count as you include content in a
| chunk. Begin a new chunk when you reach approximately 250
| words, preferring to end on a natural break close to this
| count, without exceeding 350 words. 3. In cases where text
| does not neatly fit within these constraints, prioritize
| maintaining the integrity of ideas and sentences over
| strict adherence to word limits. 4. Adjust the boundaries
| iteratively, refining your initial segmentation based on
| semantic coherence and word count guidelines.
|
| Your primary goal is to minimize disruption to the logical
| flow of content across chunks, even if slight deviations
| from the word count range are necessary to achieve this.
| """
| eurekin wrote:
| Is phi actually able to follow those instructions? How do
| you handle errors?
| CuriouslyC wrote:
| Whether or not it follows the instructions as written, it
| produces good output as long as the chunk size stays on
| the smaller side. You can validate that all the original
| text is present in the chunks and that no additional text
| has been inserted easily enough and automatically re-
| prompt.
| c0brac0bra wrote:
| Any comments about using Sparse Priming Representations
| for achieving similar things?
| CuriouslyC wrote:
| That looks like it'd be an adjunct strategy IMO. In most
| cases you want to have the original source material on
| tap, it helps with explainability and citations.
|
| That being said, it seems that everyone working at the
| state of the art is thinking about using LLMs to
| summarize chunks, and summarize groups of chunks in a
| hierarchical manner. RAPTOR
| (https://arxiv.org/html/2401.18059v1) was just published
| and is close to SoTA, and from a quick read I can already
| think of several directions to improve it, and that's not
| to brag but more to say how fertile the field is.
| WhitneyLand wrote:
| Wild speculation - do you think there could be any
| benefit from creating two sets of chunks with one set at
| a different offset from the first? So like, the boundary
| between chunks in the first set would be near the middle
| of a chunk in the second set?
| CuriouslyC wrote:
| No, it's better to just create summaries of all the
| chunks, and return summaries of chunks that are adjacent
| to chunks that are being retrieved. That gives you edge
| context without the duplication. Having 50% duplicated
| chunks is just going to burn context, or force you to do
| more pre-processing of your context.
| politelemon wrote:
| This just isn't working for me, phi-2 starts summarizing
| the document I'm giving it. I tried a few news articles
| and blog posts. Does using a GGUF version make a
| difference?
| samier-trellis wrote:
| Can you speak a bit to the speed/slowness of doing such
| chunking? We recently started using LLMs to clean the text
| (data quality/text cleanliness is a problem for us), and it
| has increased our indexing time a lot.
| CuriouslyC wrote:
| It's going to depend on what you're running on, but phi2 is
| pretty fast so you can reasonably expect to be hitting ~50
| tokens a second. Given that, if you are ingesting a 100k
| token document you can expect it to take 30-40 minutes if
| done serially, and you can of course spread stuff in
| parallel.
| samier-trellis wrote:
| thanks for the info--good to know we aren't the only ones
| contending with speed for large documents lol
| valstu wrote:
| I assume you need to split the data to suitable sized database
| rows matching your model max length? Or does it do some chunking
| magic automatically?
| chuckhend wrote:
| There is no chunking built into the postgres extension yet, but
| we are working on it.
|
| It does check the context length of the request against the
| limits of the chat model before sending the request, and
| optionally allows you to auto-trim the least relevant documents
| out of the request so that it fits the model's context window.
| IMO its worth spending time getting chunks prepared, sized,
| tuned for your use case though. There are some good
| conversations above discussing methods around this, such as
| using a summarization model to create the chunks.
| nextaccountic wrote:
| Is this only for LLMs?
| DSingularity wrote:
| No. Anything with a vector representation is compatible.
| nextaccountic wrote:
| Ok, so I don't understand what's the difference between
| pg_vectorize and pgvector
|
| I see that pg_vectorize uses pgvector under the hood so it
| does.. more things?
| infecto wrote:
| High-level API that hand waves the embedding and LLM
| search/query implementation.
| nextaccountic wrote:
| > and LLM search/query implementation.
|
| So part of pg_vectorize is specific to LLMs?
| chuckhend wrote:
| There's an API that abstracts vector search only.
| vectorize.search() and that part is not unique to LLMs
| but it does require selection of an embedding model. Some
| people have called embedding models LLMs.
|
| vectorize.rag() requires selection of a chat completion
| model. Thats more specific to LLM than vector search IMO.
| chuckhend wrote:
| pg_vectorize is a wrapper around pgvector. In addition to
| what pgvector provides, vectorize provides hooks into many
| methods to generate your embeddings, implements several
| methods for keeping embeddings updated as your data grows
| or changes, etc. It also handles the transformation of your
| search query for you.
|
| For example, it creates the index for you, create cron job
| to keep embeddings updated (or triggers if thats what you
| prefer), handles inserts/upserts as new data hits the table
| or existing data is updated. When you search for "products
| for mobile electronic devices", that needs to be
| transformed to embeddings, then the vector similarity
| search needs to happen -- this is what the project
| abstracts.
| patresh wrote:
| The high level API seems very smooth to quickly iterate on
| testing RAGs. It seems great for prototyping, however I have
| doubts whether it's a good idea to hide the LLM calling logic in
| a DB extension.
|
| Error handling when you get rate limited, the token has expired
| or the token length is too long would be problematic, and from a
| security point of view it requires your DB to directly call
| OpenAI which can also be risky.
|
| Personally I haven't used that many Postgres extensions, so
| perhaps these risks are mitigated somehow that I don't know?
| infecto wrote:
| I would agree with you. Similar to a Langchain in my mind, some
| interesting ideas but a lot more powerful to implement on your
| own. I would much rather use pgvectors directly.
| throwaway77384 wrote:
| What is RAG in this context? I only know it as red, amber,
| green...
| vmfunction wrote:
| given the context of vector db, it would be Retrieval-Augmented
| Generation (RAG) [1]
|
| 1: https://medium.com/@mutahar789/optimizing-rag-a-guide-to-
| cho...
| throwaway77384 wrote:
| Thanks for the information :)
| armanboyaci wrote:
| I found this explanation:
| https://www.promptingguide.ai/techniques/rag
|
| > General-purpose language models can be fine-tuned to achieve
| several common tasks such as sentiment analysis and named
| entity recognition. These tasks generally don't require
| additional background knowledge.
|
| > For more complex and knowledge-intensive tasks, it's possible
| to build a language model-based system that accesses external
| knowledge sources to complete tasks. This enables more factual
| consistency, improves reliability of the generated responses,
| and helps to mitigate the problem of "hallucination".
|
| > Meta AI researchers introduced a method called Retrieval
| Augmented Generation (RAG) to address such knowledge-intensive
| tasks. RAG combines an information retrieval component with a
| text generator model. RAG can be fine-tuned and its internal
| knowledge can be modified in an efficient manner and without
| needing retraining of the entire model.
| throwaway77384 wrote:
| Thank you!
| mattkevan wrote:
| Funny this has come up - I've literally just finished building a
| RAG search with Postgres this morning.
|
| I run a directory site for UX articles and tools, which I've
| recently rebuilt in Django. It has links to over 5000 articles,
| making it hard to browse, so I thought it'd be fun to use RAG
| with citations to create a knowledge search tool.
|
| The site fetches new articles via RSS, which are chunked,
| embedded and added to the vector store. On conducting a search,
| the site returns a summary as well as links to the original
| articles. I'm using LlamaIndex, OpenAI and Supabase.
|
| It's taken a little while to figure out as I really didn't know
| what I was doing and there's loads of improvements to make, but
| you can try it out here: https://www.uxlift.org/search/
|
| I'd love to hear what you think.
| rjrodger wrote:
| Any chance you'd Open Source it?
| mattkevan wrote:
| Yes, I'm planning to write an article showing the code,
| mostly so I don't forget how I did it but hopefully it'll
| also be useful to others.
|
| I relied on a lot of articles and code examples when putting
| it together so I'm happy to do the same.
| WesleyJohnson wrote:
| Looking forward to this. We're running a Django codebase
| and would like to pull in Confluence articles and run GPT
| queries against them.
| gcanyon wrote:
| This is awesome! I just tried it out, and I have a few bits of
| feedback:
|
| Show the question I asked. I asked something like "What are the
| latest trends in UX for mobile?" and got a good answer that
| started with "The latest trends in UX design for mobile
| include..." but that's not the same as just listing the exact
| thing I asked.
|
| Either show a chat log or stuff it into the browser history
| or...something. There's no way to see the answer to the
| question I asked before this one.
|
| I just went back and did another search, and the results had 6
| links, while the response (only) had cites labeled [7] [6] and
| [8] in that order.
|
| Again, great stuff!
| mattkevan wrote:
| Hey thanks for trying it out and giving your feedback, much
| appreciated!
|
| Good idea about showing the question, I've just added that to
| the search result.
|
| Yes, I agree a chat history is really needed - it's on the
| list, along with a chat-style interface.
|
| Also need to figure out what's going on with the citations in
| the text - the numbers seem pretty random and don't link up
| to the cited articles.
|
| Thanks again!
| netdur wrote:
| Could you share more about your strategy or the approach you
| took for chunking the articles? I'm curious about the criteria
| or methods you used to decide the boundaries of each chunk and
| how you ensured the chunks remained meaningful for the search
| functionality. Thanks!
| mattkevan wrote:
| As this was my first attempt, I decided to take a pretty
| basic approach, see what the results were like and optimise
| it later.
|
| Content is stored in Django as posts, so I wrote a custom
| document reader that created a new LlamaIndex document for
| each post, attaching the post id, title, link and published
| date as metadata. This gave better results than just loading
| in all the content as a text or CSV file, which I tried
| first.
|
| I did try with a bunch of different techniques to split the
| chunks, including by sentence count and a larger and smaller
| number of tokens. In the end I decided to leave it to the
| LlamaIndex default just to get it working.
| drittich wrote:
| I wrote a C# library to do this, which is similar to other
| chunking approaches that are common, like the way langchain
| does it: https://github.com/drittich/SemanticSlicer
|
| Given a list of separators (regexes), it goes through them in
| order and keeps splitting the text by them until the chunk
| fits within the desired size. By putting the higher level
| separators first (e.g., for HTML split by <h1> before <h2>),
| it's a pretty good proxy for maintaining context.
|
| Which chunk size you decide on largely depends on your data,
| so I typically eyeball a sample of the results to determine
| if the splitting is satisfactory.
|
| You can see the separators here: https://github.com/drittich/
| SemanticSlicer/blob/main/Semanti...
| nico wrote:
| Has anyone used sqlite for storing embeddings? Are there any
| extensions or tips for making it easier?
|
| I have a small command line python app that uses sqlite for a db.
| Postgres would be a huge overkill for the app
|
| PS: is sqlite-vas good? https://github.com/asg017/sqlite-vss
| berkes wrote:
| Personally, I'd use meilisearch in your case (though, obv, I
| know next to nothing about all the requirements and details of
| your case).
|
| Stuffing everything in PG makes sense, because PG is Good
| Enough in most situations that one throws at it, but hardly
| "very good". It's just "the best" because it is already there1.
| PG truly is "very good", maybe even "the best" at relational
| data. But all other cases: there are better alternatives out
| there.
|
| Meilisearch is small. It's not embeddable (last time I looked)
| so It does run as separate service; but does so with minimal
| effort and overhead. And aside from being blazingly fast in
| searching (often faster than elasticsearch; definitely much
| easier and lighter), it has stellar vector storage.
|
| 1 I wrote e.g. github.com/berkes/postgres_key_value a Ruby (so,
| slow) hashtable-alike interface that uses existing Postgres
| tech for a situation where Redis and Memcached would've been a
| much better candidate, but where we already had Postgres, and
| with this library, could postpone the introduction, complexity
| and management of a redis server.
|
| Edit: forgot this footnote.
| ComputerGuru wrote:
| Meilisearch is embeddable as of recently, but you have to do
| some legwork since it's not the intended use case.
| ComputerGuru wrote:
| We do, from years ago, with embedding stored as blobs (well
| before sqlite_vss). No SQLite extension but some user-defined
| functions for some distance operations and for others we are
| able to load all weights into memory and perform the ops there
| (data is small enough).
|
| Another comment mentioned meilisearch. You might be interested
| in the fact that a recent rearchitecture of their codebase
| split it into a library that can be embedded into your app
| without needing a separate process.
| wiredfool wrote:
| Duckdb seems pretty reasonable for embeddings, as there's a
| native vector type and you can insert it from a dataframe. For
| similarity, there's array_cosine and array_dot_product.
| swalsh wrote:
| Sorry if i'm completely missing it, I noticed in the code, there
| is something around chat:
|
| https://github.com/tembo-io/pg_vectorize/blob/main/src/chat....
|
| This would lead me to believe there is some way to actually use
| SQL for not just embeddings, but also prompting/querying the
| LLM... which would be crazy powerful. Are there any examples on
| how to do this?
| chuckhend wrote:
| There is a RAG example here https://github.com/tembo-
| io/pg_vectorize?tab=readme-ov-file#...
|
| You can provide your own prompts by adding them to the
| `vectorize.prompts` table. There's an API for this in the
| works. It is poorly documented at the moment.
| falling_myshkin wrote:
| been a lot of these RAG abstractions posted recently. As someone
| working on this problem, it's unclear to me whether the
| calculation and ingestion of embeddings from source data should
| be abstracted into the same software package as their search and
| retrieval. I guess it probably depends on the complexity of the
| problem. This does seem interesting in that it does make
| intuitive sense to have a built-in db extension if the source
| data itself is coming from the same place as the embeddings are
| going. But so far I have preferred a separation of concerns in
| this respect, as it seems that in some cases the models will be
| used to compute embeddings outside the db context (for example,
| the user search query needs to get vectorized. why not have the
| frontend and the backend query the same embedding service?)
| Anyone else have thoughts on this?
| chuckhend wrote:
| Its certainly up for debate and there is a lot of nuance. I
| think it can simplify the system's architecture quite a bit of
| all the consumers of data do not need to keep track of which
| transformer model to use. After all, once the embeddings are
| first derived from the source data, any subsequent search query
| will need to use the same transformer model that created the
| embeddings in the first place.
|
| I think the same problem exists with classical/supervised
| machine learning. Most model's features went through some sort
| of transformation, and when its time to call the model for
| inference those same transformations will need to happen again.
| pdabbadabba wrote:
| There's a fair amount of skepticism towards the efficacy of RAG
| in these comments--often in contrast to simply using a model with
| a huge context window to analyze the corpus in one giant chunk.
| But that will not be a viable alternative in all use cases.
|
| For example, one might need to analyze/search a very large corpus
| composed of many documents which, as a whole, is very unlikely to
| fit within any realistic context window. Or one might be
| constrained to only use local models and may not have access to
| models with these huge windows. Or both!
|
| In cases like these, can anyone recommend a more promising
| approach than RAG?
| make3 wrote:
| rag is also so much faster and cheaper and energy efficient
| than running a giant transformer over every single document,
| it's like not remotely close
| Atotalnoob wrote:
| RAG.
|
| Large context windows can cause the LLMs to get confused or use
| part of the window. I think googles new 1M context window is
| promising for its recall but needs more testing to be certain.
|
| Other llms have shown they reduce performance on larger
| contexts.
|
| Additionally, the LLM might discard instructions and
| hallucinate.
|
| Also, if you are concatenating user input, you are opening
| yourself up to prompt injection.
|
| The larger the context, the larger the prompt injection can be.
|
| There is also the cost. Why load up 1M tokens and pay for a
| massive request when you can load a fraction that is actually
| relevant?
|
| So regardless if google's 1M context is perfect and others can
| match it, I would steer away from just throwing it into the
| massive context.
|
| To me, it's easier to mentally think of this as SQL query
| performance. Is a full table scan better or worse than using an
| index? Even if you can do a full table scan in a reasonable
| amount of time, why bother? Just do things the right way...
| softwaredoug wrote:
| I think people assume RAG will just be a vector search problem.
| You take the user's input and somehow get relevant context.
|
| It's really hard to coordinate between LLMs, a vector store,
| chunking the embeddings, turning user's chats into query
| embeddings - and other queries - etc. It's a complicated search
| relevance problem that's extremely multifaceted, use case, and
| domain specific. Just doing search from a search bar well without
| all this complexity is hard enough. And vector search is just one
| data store you might use alongside others (alongside keyword
| search, SQL, whatever else).
|
| I say this to get across Postgres is actually uniquely situated
| to bring a multifaceted approach that's not just about vector
| storage / retrieval.
| ofermend wrote:
| Totally agree - the "R" in RAG is about retrieval which is a
| complex problem and much more than just similarity between
| embedding vectors.
| kevinqualters wrote:
| Not sure if postgres is uniquely situated, elasticsearch can do
| everything you mentioned and much more.
| BeetleB wrote:
| Yes ... and no.
|
| You're right for many applications. Yet, for many other
| applications, simply converting each document into an
| embedding, converting the search string into an embedding, and
| doing a search (e.g. cosine similarity), is all that is needed.
|
| My first attempt at RAG was just that, and I was blown away at
| how effective it was. It worked primarily because each document
| was short enough (a few lines of text) that you didn't lose
| much by making an embedding for the whole document.
|
| My next attempt failed, for the reasons you mention.
|
| Point being: Don't be afraid to give the simple/trivial
| approach a try.
| ravenstine wrote:
| Is RAG just a fancy term for sticking an LLM in front of a search
| engine?
| chuckhend wrote:
| To oversimplify, its more like sending a block of text to an
| LLM and asking it to answer a query based on that block of
| text.
| politelemon wrote:
| I'll also attempt an answer.
|
| You chop up your documents into chunks. You create fancy
| numbers for those chunks. You take the user's question and find
| the chunks that kind of match it. You pass the user's question
| with the document to an LLM and tell it to produce a nice
| looking answer.
|
| So it's a fancy way of getting an LLM to produce a natural
| looking answer over your chunky choppy search.
| cybereporter wrote:
| How does this compare to other vector search solutions (LanceDB,
| Chroma, etc.)? Curious to know which one I should choose.
| gijoegg wrote:
| How should developers think about using this extension versus
| PostgresML's pgml extension?
| chuckhend wrote:
| One difference between the two projects is that pg_vectorize
| does not run the embedding or chat models on the same host as
| postgres, rather they are always separate. The extension makes
| http requests to those models, and provides background workers
| that help with that orchestration. Last I checked, PostgresML
| extension interoperates with a python runtime on same host as
| postgres.
|
| PostgresML has a bunch of features for supervised machine
| learning and ML Ops...and pg_vectorize does not do any of that.
| e.g. you cannot train an XGboost model on pg_vectorize, but
| PostgresML does a great job with that. PGML is a great
| extension, there is some overlap but architecturally very
| different projects.
| samaysharma wrote:
| Few relevant blogs on using pg_vectorize:
|
| * Doing vector search with just 2 commands
| https://tembo.io/blog/introducing-pg_vectorize
|
| * Connecting Postgres to any huggingface sentence transformer
| https://tembo.io/blog/sentence-transformers
|
| * Building a question answer chatbot natively on Postgres
| https://tembo.io/blog/tembo-rag-stack
| politelemon wrote:
| I don't know how to articulate the uncomfortable feeling I'd be
| having, about something 'inside' the database doing the download
| and making requests to other systems outside a boundary. It might
| be a security threat or just my inexperience, how common is it
| for postgres extensions to do this?
___________________________________________________________________
(page generated 2024-03-06 23:01 UTC)