[HN Gopher] Pg_vectorize: Vector search and RAG on Postgres
       ___________________________________________________________________
        
       Pg_vectorize: Vector search and RAG on Postgres
        
       Author : samaysharma
       Score  : 269 points
       Date   : 2024-03-06 08:34 UTC (14 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | rgrieselhuber wrote:
       | Tembo has been doing very interesting work.
        
         | KingOfCoders wrote:
         | Indeed! Everyone who pushes Postgres is a win in my book.
        
           | DSingularity wrote:
           | Why?
        
             | isoprophlex wrote:
             | Because it empowers developers instead of salespeople?
        
       | chaps wrote:
       | Neat! Any notable gotchas we should know about?
        
         | chuckhend wrote:
         | RAG can cost a lot of money if not done thoughtfully. Most
         | embedding and chat completion model providers charge by the
         | token (think number of words in the request). You'll pay to
         | have the data in the database transformed into embeddings, that
         | is mostly a one-time fixed cost. Then every time there is a
         | search query in RAG, that question needs to be transformed. The
         | chat completion model (like ChatGPT 4), will charge for the
         | number of tokens in the request + number of tokens in the
         | response.
         | 
         | Self-hosting can be a big advantage for cost control, but it
         | can be complicated too. Tembo.io's managed service provides
         | privately hosted embedding models, but does not have hosted
         | chat completion models yet.
        
           | nostrebored wrote:
           | Do you work for Tembo?
           | 
           | I only ask because making embedding model consumption based
           | charges seem like the norm is off to me -- there are a ton of
           | open source, self managed embeddings that you can use.
           | 
           | You can spin up a tool like Marqo and get a platform that
           | handles making the embedding calls and chunking strategy as
           | well.
        
       | jonplackett wrote:
       | I'm using pg_vector in supabase and it seems great in a prototype
       | form.
       | 
       | Has anyone tried using it at scale? How does it do vs pine cone /
       | Cloudflare before search?
        
         | carlossouza wrote:
         | Pg_vector is great! I'm using it at scale here:
         | 
         | https://trendingpapers.com/
         | 
         | But it's Postgres + pg_vector only, no Supabase
        
         | kiwicopple wrote:
         | fwiw, we have plenty of companies using pgvector at scale, in
         | production - some of them migrating their workload directly
         | from pinecone. I'll see if I can dig up some of the high-level
         | numbers later today
        
           | philippemnoel wrote:
           | I read this paper claiming that there's no fundamental
           | limitations to Postgres' vector support down the line. Pretty
           | excited for the work Supabase, Neon, Tembo, ourselves at
           | ParadeDB and many others are doing to keep pushing it
           | forward.
           | 
           | Definitely more and more vector production workloads coming
           | to Postgres
           | 
           | Paper: https://www.cs.purdue.edu/homes/csjgwang/pubs/ICDE24_V
           | ecDB.p...
        
         | falling_myshkin wrote:
         | There's an issue in the pgvector repo about someone having
         | several ~10-20million row tables and getting acceptable
         | performance with the right hardware and some performance
         | tuning: https://github.com/pgvector/pgvector/issues/455
         | 
         | I'm in the early stages of evaluating pgvector myself. but
         | having used pinecone I currently am liking pgvector better
         | because of it being open source. The indexing algorithm is
         | clear, one can understand and modify the parameters.
         | Furthermore the database is postgresql, not a proprietary
         | document store. When the other data in the problem is stored
         | relationally, it is very convenient to have the vectors stored
         | like this as well. And postgresql has good observability and
         | metrics. I think when it comes to flexibility for specialized
         | applications, pgvector seems like the clear winner. But I can
         | definitely see pinecone's appeal if vector search is not a core
         | component of the problem/business, as it is very easy to use
         | and scales very easily
        
         | whakim wrote:
         | I've used pgvector at scale (hundreds of millions of rows) and
         | while it _does_ work quite well, you'll hit some rough edges:
         | 
         | * Reliance on vacuum means that you really have to watch/tune
         | autovacuum or else recall will take a big hit, and the indices
         | are huge and hard to vacuum.
         | 
         | * No real support for doing ANN on a subset of embeddings (ie
         | filtering) aside from building partial indices or hoping that
         | oversampling plus the right runtime parameters get the job done
         | (very hard to reason about in practice), which doesn't fit many
         | workloads.
         | 
         | * Very long index build times makes it hard to experiment with
         | HNSW hyperparameters. This might get better with parallel index
         | builds, but how much better is still TBD.
         | 
         | * Some minor annoyances around database api support in Python,
         | e.g. you'll usually have to use a DBAPI-level cursor if you're
         | working with numpy arrays on the application side.
         | 
         | That being said, overall the pgvector team are doing an awesome
         | job considering the limited resources they have.
        
       | tosh wrote:
       | Are there any examples for when RAG powered by vectorsearch works
       | really well?
       | 
       | I tried best practices like having the llm formulate an answer
       | and using the answer for the search (instead of the question) and
       | trying different chunk sizes and so on but never got it to work
       | in a way that I would consider the result as "good".
       | 
       | Maybe it was because of the type of data or the capabilities of
       | the model at the time (GPT 3.5 and GPT 4)?
       | 
       | By now context windows with some models are large enough to fit
       | lots of context directly into the prompt which is easier to do
       | and yields better results. It is way more costly but cost is
       | going down fast so I wonder what this means for RAG +
       | vectorsearch going forward.
       | 
       | Where does it shine?
        
         | pid-1 wrote:
         | I had similar experiences. Can't understand all the hype around
         | RAG when the results are so bad.
        
           | c_ehlen wrote:
           | I'm using it in for an internal application and the results
           | so far are amazing. Considering it was hacked together in a
           | few hours.
           | 
           | It helps a lot with discovery. We have some large PDFs and
           | also a large amount of smaller PDFs. Simply asking a
           | question, getting an answer with the exact location in the
           | pdf is really helpful.
        
             | qrios wrote:
             | From our experience simple RAG is often not that helpful as
             | the questions itself are not represented in the vector
             | space (except you use an FAQ dataset as input). Either a
             | preprocessing by an LLM or specific context handling needs
             | to be done.
        
         | inductive_magic wrote:
         | We're getting very solid results.
         | 
         | Instead of performing rag on the (vectorised) raw source texts,
         | we create representations of elements/"context clusters"
         | contained within the source, which are then vectorised and
         | ranked. That's all I can disclose, hope that helps.
        
           | Merik wrote:
           | Thanks for your message. I should say that giving your
           | comment to GPT-4, with a request for a solution architecture
           | that could produce good results based on the comment,
           | produced a very detailed, fascinating solution. https://chat.
           | openai.com/share/435a3855-bf02-4791-97b3-4531b8...
        
             | weird-eye-issue wrote:
             | A whole lot of noise
        
               | Merik wrote:
               | Maybe, but it expanded on the idea in the vague comment
               | and together introduced me to the idea of embedding each
               | sentence and then clustering the sentences, then taking
               | the centroid of the sentences as the embedding to
               | index/search against. I had not thought of doing that
               | before.
        
             | isoprophlex wrote:
             | If only the thing could speak and summarize in plain
             | English instead of hollow, overly verbose bulleted lists.
        
           | falling_myshkin wrote:
           | After seeing raw source text performance, I agree that
           | representational learning of higher-level semantic "context
           | clusters" as you say seems like an interesting direction.
        
           | reerdna wrote:
           | Sounds a little like this recent paper;
           | 
           | "RAPTOR: Recursive Abstractive Processing for Tree-Organized
           | Retrieval"
           | 
           | https://arxiv.org/abs/2401.18059
        
         | theolivenbaum wrote:
         | We built a RAG system for one of our clients in the aviation
         | industry. >20m technical support messages and associated
         | answers / documentation, and we're seeing between 60-80% recall
         | for top 3 documents when testing. Definitely pays off to use as
         | much of the structure you'll find in the data, plus combining
         | multiple strategies (knowledge graph for structured data, text
         | embeddings across data types, filtering and boosting based on
         | experts experience, etc). The baseline pure RAG-only approach
         | was under 25% recall.
        
         | ofermend wrote:
         | have you tried Vectara?
        
       | adamcharnock wrote:
       | I did a hobby RAG project a little while back, and I'll just
       | share my experience here.
       | 
       | 1. First ask the LLM to answer your questions without RAG. It is
       | easy to do and you may be surprised (I was, but my data was semi-
       | public). This also gives you a baseline to beat.
       | 
       | 2. Chunking of your data needs to be smart. Just chunking every N
       | characters wasn't especially fruitful. My data was a book, so it
       | was hierarchal (by heading level). I would chunk by book section
       | and hand it to the LLM.
       | 
       | 3. Use the context window effectively. There is a weighted
       | knapsack problem here, there are chunks of various sizes
       | (chars/tokens) with various weightings (quality of match). If
       | your data supports it, then the problem is also hierarchal. For
       | example, I have 4 excellent matches in this chapter, so should I
       | include each match, or should I include the whole chapter?
       | 
       | 4. Quality of input data counts. I spent 30 minutes copy-pasting
       | the entire book into markdown format.
       | 
       | This was only a small project. I'd be interested to hear any
       | other thoughts/tips.
        
         | nindalf wrote:
         | It seems like you put a lot of thought and effort into this.
         | Were you happy with the results?
         | 
         | If you could put the entire book into the context window of
         | Gemini (1M tokens), how do you think it would compare with your
         | approach? It would cost like $15 to do this, so not cheap but
         | also cost effective considering the time you've spent chunking
         | it.
        
           | s_tim wrote:
           | But you would need to spend the $15 on every request whereas
           | the RAG approach would be most likely significantly cheaper
           | per request.
        
           | adamcharnock wrote:
           | At the time I was working with either GPT 3.5 or 4, and -
           | looking at my old code - I was limiting myself to around 14k
           | tokens.
           | 
           | > Were you happy with the results?
           | 
           | It was somewhat ok. I was testing the system from a
           | (copyrighted) psychiatry textbook book and getting feedback
           | on the output from a psychotherapist. The idea was to provide
           | a tool to help therapists prep for a session, rather than
           | help patients directly.
           | 
           | As usual it was somewhat helpful but a little too vague
           | sometimes, or missed important specific information for
           | situation.
           | 
           | It is possible that it could be improved with a larger
           | context window, having more data to select, or different
           | prompting. But the frequent response was along the lines of,
           | "this is good advice, but it just doesn't drill down enough".
           | 
           | Ultimately we found that GPT3.5/4 could produce responses
           | that matched or exceeded our RAG-based solution. This was
           | surprising as it is quite domain specific, but also it seemed
           | pretty clear that GPT must have be trained on very data very
           | similar the (copyrighted) content we were using.
           | 
           | Further steps would be:
           | 
           | 1. Use other LLM models. Is it just GPT3.5/4 that is
           | reluctant to drill down?
           | 
           | 2. Used specifically-trained LLMs(/LORA) based on the
           | expected response style
           | 
           | I'd be careful of entering this kind of arms race. It seems
           | to be a fight against mediocre results, and at any moment
           | OpenAI at al may release a new model that eats your lunch.
        
             | CuriouslyC wrote:
             | The thing you have to keep in mind is that OpenAI (and all
             | the big tech companies) are risk averse. That's the reason
             | that model alignment is so overbearing. The foundation
             | models are also trying to be good at everything, which
             | means they won't be sensitive to the nuances of specific
             | types of questions.
             | 
             | RAG and chatbot memory systems are here to stay, and they
             | will always provide a benefit.
        
             | WhitneyLand wrote:
             | Did you try or consider fine-tuning Gpt 3.5 as a
             | complementary approach?
        
         | kjqgqkejbfefn wrote:
         | >tree-based approach to organize and summarize text data,
         | capturing both high-level and low-level details.
         | 
         | https://twitter.com/parthsarthi03/status/1753199233241674040
         | 
         | processes documents, organizing content and improving
         | readability by handling sections, paragraphs, links, tables,
         | lists, page continuations, and removing redundancies,
         | watermarks, and applying OCR, with additional support for HTML
         | and other formats through Apache Tika:
         | 
         | https://github.com/nlmatics/nlm-ingestor
        
           | osigurdson wrote:
           | I don't understand. Why build up text chunks from different,
           | non-contiguous sections?
        
             | infecto wrote:
             | If those non-contiguous sections share similar
             | semantic/other meaning, it can make sense from a search
             | perspective to group them?
        
               | bageler wrote:
               | it starts to look like a graph problem
        
         | osigurdson wrote:
         | >> Just chunking every N characters wasn't especially fruitful
         | 
         | Is there any science associated with creative effective
         | embedding sets? For a book, you could do every sentence, every
         | paragraph, every page or every chapter (or all of these
         | options). Eventually people will want to just point their RAG
         | system at data and everything works.
        
           | CuriouslyC wrote:
           | The easy answer is just use a model to chunk your data for
           | you. Phi-2 can chunk and annotate with pre/post summary
           | context in one pass, and it's pretty fast/cheap.
           | 
           | There is an optimal chunk size, which IIRC is ~512 tokens
           | depending on some factors. You could hierarchically model
           | your data with embeddings by chunking the data, then
           | generating summaries of those chunks and chunking the
           | summaries, and repeating that process ad nauseum until you
           | only have a small number of top level chunks.
        
             | drittich wrote:
             | How does this work when there is a limited context window.
             | You do some pre-chunking?
        
               | CuriouslyC wrote:
               | Phi can ingest 2k tokens and the optimal chunk size is
               | between 512-1024 depending on the model/application, so
               | you just give it a big chunk and tell it to break it down
               | into smaller chunks that are semantically related,
               | leaving enough room for book-end sentences to enrich the
               | context of the chunk. Then you start the next big chunk
               | with the remnants of the previous one that the model
               | couldn't group.
        
               | drittich wrote:
               | Isn't "give it a big chunk" just the same problem at a
               | higher level? How do you handle, say, a book?
        
               | CuriouslyC wrote:
               | You don't need to handle a whole book, the goal is to
               | chunk the book into chunks of the correct size, which is
               | less than the context size of the model you're using to
               | chunk it semantically. When you're ingesting data, you
               | fill up the chunker model's context, and it breaks that
               | up into smaller, self relevant chunks and a remainder.
               | You then start from the remainder and slurp up as much
               | additional text as you can to fill the context and repeat
               | the process.
        
             | kordlessagain wrote:
             | This is an example of knowledge transfer from a model. I
             | used a similar approach to augment chunked texts with
             | questions, summaries, and keyterms (which require
             | structured output from the LLM). I haven't tried using a
             | smaller model to do this as GPT3.5 is fast and cheap
             | enough, but I like the idea of running a model in house to
             | do things like this.
        
         | CuriouslyC wrote:
         | Pro tip, you can use small models like phi-2 to do semantic
         | chunking rather than just chunk on based on size, and chunking
         | works even better if you book-end chunks with summaries of text
         | coming before/after to enrich the context.
         | 
         | Another pro tip, you can again use a small model to
         | summarize/extract RAG content to get more actual data in the
         | context.
        
           | hack_edu wrote:
           | Could you share a bit more about semantic chunking with Phi?
           | Any recommendations/examples of prompts?
        
             | CuriouslyC wrote:
             | Sure, it'll look something like this:
             | 
             | """ Task: Divide the provided text into semantically
             | coherent chunks, each containing between 250-350 words. Aim
             | to preserve logical and thematic continuity within each
             | chunk, ensuring that sentences or ideas that belong
             | together are not split across different chunks.
             | 
             | Guidelines: 1. Identify natural text breaks such as
             | paragraph ends or section divides to initiate new chunks.
             | 2. Estimate the word count as you include content in a
             | chunk. Begin a new chunk when you reach approximately 250
             | words, preferring to end on a natural break close to this
             | count, without exceeding 350 words. 3. In cases where text
             | does not neatly fit within these constraints, prioritize
             | maintaining the integrity of ideas and sentences over
             | strict adherence to word limits. 4. Adjust the boundaries
             | iteratively, refining your initial segmentation based on
             | semantic coherence and word count guidelines.
             | 
             | Your primary goal is to minimize disruption to the logical
             | flow of content across chunks, even if slight deviations
             | from the word count range are necessary to achieve this.
             | """
        
               | eurekin wrote:
               | Is phi actually able to follow those instructions? How do
               | you handle errors?
        
               | CuriouslyC wrote:
               | Whether or not it follows the instructions as written, it
               | produces good output as long as the chunk size stays on
               | the smaller side. You can validate that all the original
               | text is present in the chunks and that no additional text
               | has been inserted easily enough and automatically re-
               | prompt.
        
               | c0brac0bra wrote:
               | Any comments about using Sparse Priming Representations
               | for achieving similar things?
        
               | CuriouslyC wrote:
               | That looks like it'd be an adjunct strategy IMO. In most
               | cases you want to have the original source material on
               | tap, it helps with explainability and citations.
               | 
               | That being said, it seems that everyone working at the
               | state of the art is thinking about using LLMs to
               | summarize chunks, and summarize groups of chunks in a
               | hierarchical manner. RAPTOR
               | (https://arxiv.org/html/2401.18059v1) was just published
               | and is close to SoTA, and from a quick read I can already
               | think of several directions to improve it, and that's not
               | to brag but more to say how fertile the field is.
        
               | WhitneyLand wrote:
               | Wild speculation - do you think there could be any
               | benefit from creating two sets of chunks with one set at
               | a different offset from the first? So like, the boundary
               | between chunks in the first set would be near the middle
               | of a chunk in the second set?
        
               | CuriouslyC wrote:
               | No, it's better to just create summaries of all the
               | chunks, and return summaries of chunks that are adjacent
               | to chunks that are being retrieved. That gives you edge
               | context without the duplication. Having 50% duplicated
               | chunks is just going to burn context, or force you to do
               | more pre-processing of your context.
        
               | politelemon wrote:
               | This just isn't working for me, phi-2 starts summarizing
               | the document I'm giving it. I tried a few news articles
               | and blog posts. Does using a GGUF version make a
               | difference?
        
           | samier-trellis wrote:
           | Can you speak a bit to the speed/slowness of doing such
           | chunking? We recently started using LLMs to clean the text
           | (data quality/text cleanliness is a problem for us), and it
           | has increased our indexing time a lot.
        
             | CuriouslyC wrote:
             | It's going to depend on what you're running on, but phi2 is
             | pretty fast so you can reasonably expect to be hitting ~50
             | tokens a second. Given that, if you are ingesting a 100k
             | token document you can expect it to take 30-40 minutes if
             | done serially, and you can of course spread stuff in
             | parallel.
        
               | samier-trellis wrote:
               | thanks for the info--good to know we aren't the only ones
               | contending with speed for large documents lol
        
       | valstu wrote:
       | I assume you need to split the data to suitable sized database
       | rows matching your model max length? Or does it do some chunking
       | magic automatically?
        
         | chuckhend wrote:
         | There is no chunking built into the postgres extension yet, but
         | we are working on it.
         | 
         | It does check the context length of the request against the
         | limits of the chat model before sending the request, and
         | optionally allows you to auto-trim the least relevant documents
         | out of the request so that it fits the model's context window.
         | IMO its worth spending time getting chunks prepared, sized,
         | tuned for your use case though. There are some good
         | conversations above discussing methods around this, such as
         | using a summarization model to create the chunks.
        
       | nextaccountic wrote:
       | Is this only for LLMs?
        
         | DSingularity wrote:
         | No. Anything with a vector representation is compatible.
        
           | nextaccountic wrote:
           | Ok, so I don't understand what's the difference between
           | pg_vectorize and pgvector
           | 
           | I see that pg_vectorize uses pgvector under the hood so it
           | does.. more things?
        
             | infecto wrote:
             | High-level API that hand waves the embedding and LLM
             | search/query implementation.
        
               | nextaccountic wrote:
               | > and LLM search/query implementation.
               | 
               | So part of pg_vectorize is specific to LLMs?
        
               | chuckhend wrote:
               | There's an API that abstracts vector search only.
               | vectorize.search() and that part is not unique to LLMs
               | but it does require selection of an embedding model. Some
               | people have called embedding models LLMs.
               | 
               | vectorize.rag() requires selection of a chat completion
               | model. Thats more specific to LLM than vector search IMO.
        
             | chuckhend wrote:
             | pg_vectorize is a wrapper around pgvector. In addition to
             | what pgvector provides, vectorize provides hooks into many
             | methods to generate your embeddings, implements several
             | methods for keeping embeddings updated as your data grows
             | or changes, etc. It also handles the transformation of your
             | search query for you.
             | 
             | For example, it creates the index for you, create cron job
             | to keep embeddings updated (or triggers if thats what you
             | prefer), handles inserts/upserts as new data hits the table
             | or existing data is updated. When you search for "products
             | for mobile electronic devices", that needs to be
             | transformed to embeddings, then the vector similarity
             | search needs to happen -- this is what the project
             | abstracts.
        
       | patresh wrote:
       | The high level API seems very smooth to quickly iterate on
       | testing RAGs. It seems great for prototyping, however I have
       | doubts whether it's a good idea to hide the LLM calling logic in
       | a DB extension.
       | 
       | Error handling when you get rate limited, the token has expired
       | or the token length is too long would be problematic, and from a
       | security point of view it requires your DB to directly call
       | OpenAI which can also be risky.
       | 
       | Personally I haven't used that many Postgres extensions, so
       | perhaps these risks are mitigated somehow that I don't know?
        
         | infecto wrote:
         | I would agree with you. Similar to a Langchain in my mind, some
         | interesting ideas but a lot more powerful to implement on your
         | own. I would much rather use pgvectors directly.
        
       | throwaway77384 wrote:
       | What is RAG in this context? I only know it as red, amber,
       | green...
        
         | vmfunction wrote:
         | given the context of vector db, it would be Retrieval-Augmented
         | Generation (RAG) [1]
         | 
         | 1: https://medium.com/@mutahar789/optimizing-rag-a-guide-to-
         | cho...
        
           | throwaway77384 wrote:
           | Thanks for the information :)
        
         | armanboyaci wrote:
         | I found this explanation:
         | https://www.promptingguide.ai/techniques/rag
         | 
         | > General-purpose language models can be fine-tuned to achieve
         | several common tasks such as sentiment analysis and named
         | entity recognition. These tasks generally don't require
         | additional background knowledge.
         | 
         | > For more complex and knowledge-intensive tasks, it's possible
         | to build a language model-based system that accesses external
         | knowledge sources to complete tasks. This enables more factual
         | consistency, improves reliability of the generated responses,
         | and helps to mitigate the problem of "hallucination".
         | 
         | > Meta AI researchers introduced a method called Retrieval
         | Augmented Generation (RAG) to address such knowledge-intensive
         | tasks. RAG combines an information retrieval component with a
         | text generator model. RAG can be fine-tuned and its internal
         | knowledge can be modified in an efficient manner and without
         | needing retraining of the entire model.
        
           | throwaway77384 wrote:
           | Thank you!
        
       | mattkevan wrote:
       | Funny this has come up - I've literally just finished building a
       | RAG search with Postgres this morning.
       | 
       | I run a directory site for UX articles and tools, which I've
       | recently rebuilt in Django. It has links to over 5000 articles,
       | making it hard to browse, so I thought it'd be fun to use RAG
       | with citations to create a knowledge search tool.
       | 
       | The site fetches new articles via RSS, which are chunked,
       | embedded and added to the vector store. On conducting a search,
       | the site returns a summary as well as links to the original
       | articles. I'm using LlamaIndex, OpenAI and Supabase.
       | 
       | It's taken a little while to figure out as I really didn't know
       | what I was doing and there's loads of improvements to make, but
       | you can try it out here: https://www.uxlift.org/search/
       | 
       | I'd love to hear what you think.
        
         | rjrodger wrote:
         | Any chance you'd Open Source it?
        
           | mattkevan wrote:
           | Yes, I'm planning to write an article showing the code,
           | mostly so I don't forget how I did it but hopefully it'll
           | also be useful to others.
           | 
           | I relied on a lot of articles and code examples when putting
           | it together so I'm happy to do the same.
        
             | WesleyJohnson wrote:
             | Looking forward to this. We're running a Django codebase
             | and would like to pull in Confluence articles and run GPT
             | queries against them.
        
         | gcanyon wrote:
         | This is awesome! I just tried it out, and I have a few bits of
         | feedback:
         | 
         | Show the question I asked. I asked something like "What are the
         | latest trends in UX for mobile?" and got a good answer that
         | started with "The latest trends in UX design for mobile
         | include..." but that's not the same as just listing the exact
         | thing I asked.
         | 
         | Either show a chat log or stuff it into the browser history
         | or...something. There's no way to see the answer to the
         | question I asked before this one.
         | 
         | I just went back and did another search, and the results had 6
         | links, while the response (only) had cites labeled [7] [6] and
         | [8] in that order.
         | 
         | Again, great stuff!
        
           | mattkevan wrote:
           | Hey thanks for trying it out and giving your feedback, much
           | appreciated!
           | 
           | Good idea about showing the question, I've just added that to
           | the search result.
           | 
           | Yes, I agree a chat history is really needed - it's on the
           | list, along with a chat-style interface.
           | 
           | Also need to figure out what's going on with the citations in
           | the text - the numbers seem pretty random and don't link up
           | to the cited articles.
           | 
           | Thanks again!
        
         | netdur wrote:
         | Could you share more about your strategy or the approach you
         | took for chunking the articles? I'm curious about the criteria
         | or methods you used to decide the boundaries of each chunk and
         | how you ensured the chunks remained meaningful for the search
         | functionality. Thanks!
        
           | mattkevan wrote:
           | As this was my first attempt, I decided to take a pretty
           | basic approach, see what the results were like and optimise
           | it later.
           | 
           | Content is stored in Django as posts, so I wrote a custom
           | document reader that created a new LlamaIndex document for
           | each post, attaching the post id, title, link and published
           | date as metadata. This gave better results than just loading
           | in all the content as a text or CSV file, which I tried
           | first.
           | 
           | I did try with a bunch of different techniques to split the
           | chunks, including by sentence count and a larger and smaller
           | number of tokens. In the end I decided to leave it to the
           | LlamaIndex default just to get it working.
        
           | drittich wrote:
           | I wrote a C# library to do this, which is similar to other
           | chunking approaches that are common, like the way langchain
           | does it: https://github.com/drittich/SemanticSlicer
           | 
           | Given a list of separators (regexes), it goes through them in
           | order and keeps splitting the text by them until the chunk
           | fits within the desired size. By putting the higher level
           | separators first (e.g., for HTML split by <h1> before <h2>),
           | it's a pretty good proxy for maintaining context.
           | 
           | Which chunk size you decide on largely depends on your data,
           | so I typically eyeball a sample of the results to determine
           | if the splitting is satisfactory.
           | 
           | You can see the separators here: https://github.com/drittich/
           | SemanticSlicer/blob/main/Semanti...
        
       | nico wrote:
       | Has anyone used sqlite for storing embeddings? Are there any
       | extensions or tips for making it easier?
       | 
       | I have a small command line python app that uses sqlite for a db.
       | Postgres would be a huge overkill for the app
       | 
       | PS: is sqlite-vas good? https://github.com/asg017/sqlite-vss
        
         | berkes wrote:
         | Personally, I'd use meilisearch in your case (though, obv, I
         | know next to nothing about all the requirements and details of
         | your case).
         | 
         | Stuffing everything in PG makes sense, because PG is Good
         | Enough in most situations that one throws at it, but hardly
         | "very good". It's just "the best" because it is already there1.
         | PG truly is "very good", maybe even "the best" at relational
         | data. But all other cases: there are better alternatives out
         | there.
         | 
         | Meilisearch is small. It's not embeddable (last time I looked)
         | so It does run as separate service; but does so with minimal
         | effort and overhead. And aside from being blazingly fast in
         | searching (often faster than elasticsearch; definitely much
         | easier and lighter), it has stellar vector storage.
         | 
         | 1 I wrote e.g. github.com/berkes/postgres_key_value a Ruby (so,
         | slow) hashtable-alike interface that uses existing Postgres
         | tech for a situation where Redis and Memcached would've been a
         | much better candidate, but where we already had Postgres, and
         | with this library, could postpone the introduction, complexity
         | and management of a redis server.
         | 
         | Edit: forgot this footnote.
        
           | ComputerGuru wrote:
           | Meilisearch is embeddable as of recently, but you have to do
           | some legwork since it's not the intended use case.
        
         | ComputerGuru wrote:
         | We do, from years ago, with embedding stored as blobs (well
         | before sqlite_vss). No SQLite extension but some user-defined
         | functions for some distance operations and for others we are
         | able to load all weights into memory and perform the ops there
         | (data is small enough).
         | 
         | Another comment mentioned meilisearch. You might be interested
         | in the fact that a recent rearchitecture of their codebase
         | split it into a library that can be embedded into your app
         | without needing a separate process.
        
         | wiredfool wrote:
         | Duckdb seems pretty reasonable for embeddings, as there's a
         | native vector type and you can insert it from a dataframe. For
         | similarity, there's array_cosine and array_dot_product.
        
       | swalsh wrote:
       | Sorry if i'm completely missing it, I noticed in the code, there
       | is something around chat:
       | 
       | https://github.com/tembo-io/pg_vectorize/blob/main/src/chat....
       | 
       | This would lead me to believe there is some way to actually use
       | SQL for not just embeddings, but also prompting/querying the
       | LLM... which would be crazy powerful. Are there any examples on
       | how to do this?
        
         | chuckhend wrote:
         | There is a RAG example here https://github.com/tembo-
         | io/pg_vectorize?tab=readme-ov-file#...
         | 
         | You can provide your own prompts by adding them to the
         | `vectorize.prompts` table. There's an API for this in the
         | works. It is poorly documented at the moment.
        
       | falling_myshkin wrote:
       | been a lot of these RAG abstractions posted recently. As someone
       | working on this problem, it's unclear to me whether the
       | calculation and ingestion of embeddings from source data should
       | be abstracted into the same software package as their search and
       | retrieval. I guess it probably depends on the complexity of the
       | problem. This does seem interesting in that it does make
       | intuitive sense to have a built-in db extension if the source
       | data itself is coming from the same place as the embeddings are
       | going. But so far I have preferred a separation of concerns in
       | this respect, as it seems that in some cases the models will be
       | used to compute embeddings outside the db context (for example,
       | the user search query needs to get vectorized. why not have the
       | frontend and the backend query the same embedding service?)
       | Anyone else have thoughts on this?
        
         | chuckhend wrote:
         | Its certainly up for debate and there is a lot of nuance. I
         | think it can simplify the system's architecture quite a bit of
         | all the consumers of data do not need to keep track of which
         | transformer model to use. After all, once the embeddings are
         | first derived from the source data, any subsequent search query
         | will need to use the same transformer model that created the
         | embeddings in the first place.
         | 
         | I think the same problem exists with classical/supervised
         | machine learning. Most model's features went through some sort
         | of transformation, and when its time to call the model for
         | inference those same transformations will need to happen again.
        
       | pdabbadabba wrote:
       | There's a fair amount of skepticism towards the efficacy of RAG
       | in these comments--often in contrast to simply using a model with
       | a huge context window to analyze the corpus in one giant chunk.
       | But that will not be a viable alternative in all use cases.
       | 
       | For example, one might need to analyze/search a very large corpus
       | composed of many documents which, as a whole, is very unlikely to
       | fit within any realistic context window. Or one might be
       | constrained to only use local models and may not have access to
       | models with these huge windows. Or both!
       | 
       | In cases like these, can anyone recommend a more promising
       | approach than RAG?
        
         | make3 wrote:
         | rag is also so much faster and cheaper and energy efficient
         | than running a giant transformer over every single document,
         | it's like not remotely close
        
         | Atotalnoob wrote:
         | RAG.
         | 
         | Large context windows can cause the LLMs to get confused or use
         | part of the window. I think googles new 1M context window is
         | promising for its recall but needs more testing to be certain.
         | 
         | Other llms have shown they reduce performance on larger
         | contexts.
         | 
         | Additionally, the LLM might discard instructions and
         | hallucinate.
         | 
         | Also, if you are concatenating user input, you are opening
         | yourself up to prompt injection.
         | 
         | The larger the context, the larger the prompt injection can be.
         | 
         | There is also the cost. Why load up 1M tokens and pay for a
         | massive request when you can load a fraction that is actually
         | relevant?
         | 
         | So regardless if google's 1M context is perfect and others can
         | match it, I would steer away from just throwing it into the
         | massive context.
         | 
         | To me, it's easier to mentally think of this as SQL query
         | performance. Is a full table scan better or worse than using an
         | index? Even if you can do a full table scan in a reasonable
         | amount of time, why bother? Just do things the right way...
        
       | softwaredoug wrote:
       | I think people assume RAG will just be a vector search problem.
       | You take the user's input and somehow get relevant context.
       | 
       | It's really hard to coordinate between LLMs, a vector store,
       | chunking the embeddings, turning user's chats into query
       | embeddings - and other queries - etc. It's a complicated search
       | relevance problem that's extremely multifaceted, use case, and
       | domain specific. Just doing search from a search bar well without
       | all this complexity is hard enough. And vector search is just one
       | data store you might use alongside others (alongside keyword
       | search, SQL, whatever else).
       | 
       | I say this to get across Postgres is actually uniquely situated
       | to bring a multifaceted approach that's not just about vector
       | storage / retrieval.
        
         | ofermend wrote:
         | Totally agree - the "R" in RAG is about retrieval which is a
         | complex problem and much more than just similarity between
         | embedding vectors.
        
         | kevinqualters wrote:
         | Not sure if postgres is uniquely situated, elasticsearch can do
         | everything you mentioned and much more.
        
         | BeetleB wrote:
         | Yes ... and no.
         | 
         | You're right for many applications. Yet, for many other
         | applications, simply converting each document into an
         | embedding, converting the search string into an embedding, and
         | doing a search (e.g. cosine similarity), is all that is needed.
         | 
         | My first attempt at RAG was just that, and I was blown away at
         | how effective it was. It worked primarily because each document
         | was short enough (a few lines of text) that you didn't lose
         | much by making an embedding for the whole document.
         | 
         | My next attempt failed, for the reasons you mention.
         | 
         | Point being: Don't be afraid to give the simple/trivial
         | approach a try.
        
       | ravenstine wrote:
       | Is RAG just a fancy term for sticking an LLM in front of a search
       | engine?
        
         | chuckhend wrote:
         | To oversimplify, its more like sending a block of text to an
         | LLM and asking it to answer a query based on that block of
         | text.
        
         | politelemon wrote:
         | I'll also attempt an answer.
         | 
         | You chop up your documents into chunks. You create fancy
         | numbers for those chunks. You take the user's question and find
         | the chunks that kind of match it. You pass the user's question
         | with the document to an LLM and tell it to produce a nice
         | looking answer.
         | 
         | So it's a fancy way of getting an LLM to produce a natural
         | looking answer over your chunky choppy search.
        
       | cybereporter wrote:
       | How does this compare to other vector search solutions (LanceDB,
       | Chroma, etc.)? Curious to know which one I should choose.
        
       | gijoegg wrote:
       | How should developers think about using this extension versus
       | PostgresML's pgml extension?
        
         | chuckhend wrote:
         | One difference between the two projects is that pg_vectorize
         | does not run the embedding or chat models on the same host as
         | postgres, rather they are always separate. The extension makes
         | http requests to those models, and provides background workers
         | that help with that orchestration. Last I checked, PostgresML
         | extension interoperates with a python runtime on same host as
         | postgres.
         | 
         | PostgresML has a bunch of features for supervised machine
         | learning and ML Ops...and pg_vectorize does not do any of that.
         | e.g. you cannot train an XGboost model on pg_vectorize, but
         | PostgresML does a great job with that. PGML is a great
         | extension, there is some overlap but architecturally very
         | different projects.
        
       | samaysharma wrote:
       | Few relevant blogs on using pg_vectorize:
       | 
       | * Doing vector search with just 2 commands
       | https://tembo.io/blog/introducing-pg_vectorize
       | 
       | * Connecting Postgres to any huggingface sentence transformer
       | https://tembo.io/blog/sentence-transformers
       | 
       | * Building a question answer chatbot natively on Postgres
       | https://tembo.io/blog/tembo-rag-stack
        
       | politelemon wrote:
       | I don't know how to articulate the uncomfortable feeling I'd be
       | having, about something 'inside' the database doing the download
       | and making requests to other systems outside a boundary. It might
       | be a security threat or just my inexperience, how common is it
       | for postgres extensions to do this?
        
       ___________________________________________________________________
       (page generated 2024-03-06 23:01 UTC)