[HN Gopher] RAG Using Unstructured Data and Role of Knowledge Gr...
       ___________________________________________________________________
        
       RAG Using Unstructured Data and Role of Knowledge Graphs
        
       Author : semihsalihoglu
       Score  : 155 points
       Date   : 2024-01-15 12:45 UTC (2 days ago)
        
 (HTM) web link (kuzudb.com)
 (TXT) w3m dump (kuzudb.com)
        
       | semihsalihoglu wrote:
       | This is a post that summarizes some reading that I had done in
       | the space of LLMs + Knowledge Graphs with the goal of identifying
       | technically deep and interesting directions. The post cover
       | retrieval augmented generation (RAG) systems that use
       | unstructured data (RAG-U) and the role folks envision knowledge
       | graphs to play in it. Briefly the design spectrum of RAG-U
       | systems have two dimensions: 1) What additional data to put into
       | LLM prompts: such as, documents, or triples extracted from
       | documents. 2) How to store and fetch that data: such as a vector
       | index, gdbms, or both.
       | 
       | The standard RAG-U uses vector embeddings of chunks, which are
       | fetched from a vector index. An envisioned role of knowledge
       | graphs is to improve standard RAG-U by explicitly linking the
       | chunks through the entities they mention. This is a promising
       | idea but one that need to be subjected to rigorous evaluation as
       | done in prominent IR publications, e.g., SIGIR.
       | 
       | The post then discusses the scenario when an enterprise does not
       | have a knowledge graph and discuss the ideal of automatically
       | extracting knowledge graphs from unstructured pdfs and text
       | documents. It covers the recent work that uses LLMs for this task
       | (they're not yet competitive with specialized models) and
       | highlights many interesting open questions.
       | 
       | Hope this is interesting to people who are interested in the area
       | but intimidated because of the flood of activity (but don't be; I
       | think the area is easier to digest than it may look.)
        
         | daxfohl wrote:
         | Having just started from zero, I agree on the easy to digest
         | point. You can get a pretty good understanding of how most
         | things work in a couple days, and the field is moving so fast
         | that a lot of papers are just exploring different iterative
         | improvements on basic concepts.
        
         | kordlessagain wrote:
         | Knowledge graphs improve vector search by providing a "back of
         | the book" index for the content. This can be done using
         | knowledge extraction from an LLM during indexing, such as
         | pulling out keyterms of a given chunk before embedding, or
         | asking a question of the content and then answering it using
         | the keyterms in addition to the embeddings. One challenge I
         | found with this is determining keyterms to use with prompts
         | that have light context, but using a time window helps with
         | this, as does hitting the vector store for related content,
         | then finding the keyterms for THAT content to use with the
         | current query.
        
           | sroussey wrote:
           | What open source model is good at pulling keyterms?
        
             | semihsalihoglu wrote:
             | For entity extraction you can look at SpanMarker:
             | https://tomaarsen.github.io/SpanMarkerNER/. I'm sure other
             | tools exists and others can hopefully point at more.
        
             | laminarflow027 wrote:
             | OpenNRE (https://github.com/thunlp/OpenNRE) is another good
             | approach to neural relation extraction, though it's
             | slightly dated. What would be particularly interesting is
             | to combine models like OpenNRE or SpanMarker with entity-
             | linking models to construct KG triples. And a solid,
             | scalable graph database underneath would make for a great
             | knowledge base that can be constructed from unstructured
             | text.
        
         | mark_l_watson wrote:
         | I really liked the idea of creating linked data to connect
         | chunks. That is an idea that deserves some play time (I just
         | added it to my TODO list). Thanks for the good ideas!
        
       | iAkashPaul wrote:
       | One quick check for any RAG system is to ask what all can the bot
       | answer about. Generating scalable metadata at ingestion along
       | with knowledge graphs make for a good closed domain experience
        
       | formercoder wrote:
       | It's interesting to see more developed KG + LLM use cases that
       | aren't just NL to Graph DB Query Lang.
        
         | laminarflow027 wrote:
         | Totally agree! The wave of blog posts and examples one sees
         | where it's just text-to-SQL or text-to-Cypher or any other
         | query lang aren't really exploring the topic at any level of
         | technical depth, and we need to see more evaluations and
         | technical papers that characterize them, so that we can
         | understand how to build better systems.
        
           | semihsalihoglu wrote:
           | I think even on the LLMS + KGs space the depth is not very
           | deep. In fact there is more technical depth in the text-to-
           | SQL than anything else I have seen on LLMs. Maybe the
           | COLBERT-like matrix-models is another topic where there is
           | good technical depth.
        
       | softwaredoug wrote:
       | When I started working in search 10+ years ago, people would
       | build a beautiful UI, and then, only on shipping, realize the
       | search results were trash + irrelevant. They imagined a search
       | system like Elasticsearch was basically Google. When in reality,
       | Elasticsearch is just a bit of infrastructure. A framework, not a
       | solution.
       | 
       | There's a similar thing happening on RAG. Where people think
       | building the chat interaction is the hard thing. The hard thing
       | is extracting + searching to get relevant context. A lot of
       | founders I talk to suddenly realize this at the last minute,
       | right before shipping, similar to search back in the day. It's
       | harder than just throwing chunks in a vector DB. It involves a
       | lot of different backend data sources potentially, and is in many
       | ways harder than a standard search relevance problem (which is
       | itself hard enough).
        
         | MattDaEskimo wrote:
         | Especially considering the additional logic that some queries
         | require. Stacked questions, comparative questions,
         | recommendations, questions that assume information found in
         | previous statements / questions.
         | 
         | It becomes a very frustrating experience matching the inherent
         | chaos of a conversation.
        
           | softwaredoug wrote:
           | Yeah, and to do it well you have to focus on a subset of
           | tasks. Then find a way to gracefully reject anything you
           | can't retrieve well.
           | 
           | In many ways it makes the chat more Siri-like than ChatGPT
           | like. Which may not be what users actually expect.
        
         | laminarflow027 wrote:
         | Very good points. Have you seen any examples of systems (or
         | projects) that successfully combine multiple backend data
         | sources, including databases, that perform better than the
         | single backend alone? This seems like an important enough
         | question that it ought to have been documented somewhere.
        
         | opisthenar84 wrote:
         | True. Pure vectorstores seem limited and kind of overrated.
         | Combining many sources of data is challenging but the right
         | thing to do.
        
         | Keyframe wrote:
         | Great observation. I've seen it often in tech, across the
         | board. It's no better, maybe a step up, than 'idea guy' who
         | 'just' needs someone to build his idea. Hand-waving or complete
         | lack of awareness on the actual value (hard) part.
        
         | hobs wrote:
         | I spent 8 months telling people this before I got laid off
         | while the CEO continues to chase LLM money with no new ideas or
         | even the talent to solve the problem.
         | 
         | They spent so much time on the UI and basically left the actual
         | search to the last minute, and it was a hilarious failure on
         | launch.
        
         | dbish wrote:
         | Yep, we're doing RAG-ish search and ranking across many context
         | types and modalities, you definitely can't just use a vectordb
         | and do some chunking/search, there are a wide variety of
         | search-like ranking, clustering, etc. and domain specific work
         | for relevance and it's very hard to measure and prove
         | improvements.
         | 
         | It's going to just evolve into recreating the various search
         | and ranking processes of old just on top of a bit more semantic
         | understanding with some smarter NLG layered in :). It won't be
         | just LLMs, we'll have intent classification, named entity
         | recognition, a personalization layer, reranking, all that fun
         | stuff again.
        
         | hackernoteng wrote:
         | This is a great comment. Good search is really hard. RAG is
         | much harder. At least with search user can pick the best result
         | manually or refine their search. With RAG you pass topK to the
         | LLM and assume its good results. The assumption is that its
         | "semantic search" with vectors so it will just work... wrong.
        
         | moralestapia wrote:
         | Hmm, RAG is not "the chat interaction", that's GPT or any other
         | "brain" you choose.
         | 
         | Last week I finished building my 3rd RAG stack for legal
         | document retrieval. Almost-vanilla RAG got me 90-95% of the
         | way. Only drawback is cost, still 10x-100x above the ideal
         | price point; but that will only improve in the future.
        
       | dmezzetti wrote:
       | If you're interested in graphs + RAG and want an alternate
       | approach, txtai has a semantic graph component.
       | 
       | https://neuml.hashnode.dev/introducing-the-semantic-graph
       | 
       | https://github.com/neuml/txtai
       | 
       | Disclaimer: I'm the primary author of txtai
        
         | bryan0 wrote:
         | This is really cool, I'm surprised I never heard of this
         | project before. The examples look really clean.
         | 
         | Most RAG tools seem to start with the LLM and add Vector
         | building and retrieval around it, while this tool seems like it
         | started with Vector / Graph building and retrieval, then added
         | LLM support later.
        
           | dmezzetti wrote:
           | Thanks, that's an accurate assessment. The main reason for
           | this approach is that txtai has been around since 2020 before
           | the LLM era.
        
         | Der_Einzige wrote:
         | Note for those who aren't aware, a "Semantic Graph" means a
         | knowledge graph built using a "sentence(pooled) transformer"
         | language model to draw edges between the vertices (text data at
         | whatever granularity the user decides) according to semantic
         | similarity.
         | 
         | What's awesome about them is that they essentially form in my
         | mind the "extractive" analogue to LLMs "generative" nature.
         | 
         | Semantic Graphs give every single graph theory algorithm a
         | unique epistemological twist given any particular dataset. In
         | my case, I've built and released pre-trained semantic graphs
         | for my debate evidence. I observe that path traversals form
         | "debate cases", and that graph centrality in this case finds
         | the most "generic/universally applicable" evidence. Given a
         | different dataset, the same algorithms will have different
         | interpretations.
         | 
         | What makes txtai so awesome is that it creates a synchronized
         | interface between an underlying vector DB, SQL DB, and a
         | semantic knowledge graph. The flexibility and power this offers
         | compared to other vector DB solutions is simply unparalleled. I
         | have seen zero meaningful competition from a vectorDB industry
         | which is flooded with money despite little product
         | differentiation among themselves.
         | 
         | Disclaimer: I wrote an NLP paper with dmezzetti as my co-author
         | about semantic graphs:
         | https://aclanthology.org/2023.newsum-1.10.pdf
        
           | dmezzetti wrote:
           | Thank you for taking the time to share these excellent
           | additional details!
        
       | Oras wrote:
       | The article is a good summary of RAG in the enterprise. It shed
       | some light for me on the quality of building KG using LLMs, as
       | recently, it is an approach that Neo4j was proposing [0].
       | 
       | According to the article, it is either costly (if using OpenAI),
       | or slow using open source AI models. In both cases, predicting
       | the quality of generated KG using LLMs is hard.
       | 
       | [0] https://github.com/neo4j/NaLLM
        
       | laminarflow027 wrote:
       | This is an excellent article that asks some much-needed questions
       | on the _literature_ that exists connecting LLMs and RAGs on
       | unstructured data, with knowledge graphs in between. We 've seen
       | plenty of articles that speculate on how one can build a simple
       | retrieval system on top of a KG, but there are two challenges: a)
       | constructing a high quality KG isn't easy, and b) keyword or
       | phrase embedding on metadata for pre-filtering on relevant
       | sections of the graph is required.
       | 
       | As some others here have pointed out, information extraction and
       | searching with relevant context are the hardest parts of any
       | search system, and it's clear that simply chunking vectors up and
       | throwing them into a vector DB has limitations, no matter what
       | the vector DB vendors tell you. Just like this article says, I
       | hope that 2024 is the year where we actually get some papers that
       | perform more rigorous evaluations of systems that use vector DBs,
       | graph DBs, or a combination of them for building RAGs.
        
       ___________________________________________________________________
       (page generated 2024-01-17 23:01 UTC)