[HN Gopher] RAG Using Unstructured Data and Role of Knowledge Gr...
___________________________________________________________________
RAG Using Unstructured Data and Role of Knowledge Graphs
Author : semihsalihoglu
Score : 155 points
Date : 2024-01-15 12:45 UTC (2 days ago)
(HTM) web link (kuzudb.com)
(TXT) w3m dump (kuzudb.com)
| semihsalihoglu wrote:
| This is a post that summarizes some reading that I had done in
| the space of LLMs + Knowledge Graphs with the goal of identifying
| technically deep and interesting directions. The post cover
| retrieval augmented generation (RAG) systems that use
| unstructured data (RAG-U) and the role folks envision knowledge
| graphs to play in it. Briefly the design spectrum of RAG-U
| systems have two dimensions: 1) What additional data to put into
| LLM prompts: such as, documents, or triples extracted from
| documents. 2) How to store and fetch that data: such as a vector
| index, gdbms, or both.
|
| The standard RAG-U uses vector embeddings of chunks, which are
| fetched from a vector index. An envisioned role of knowledge
| graphs is to improve standard RAG-U by explicitly linking the
| chunks through the entities they mention. This is a promising
| idea but one that need to be subjected to rigorous evaluation as
| done in prominent IR publications, e.g., SIGIR.
|
| The post then discusses the scenario when an enterprise does not
| have a knowledge graph and discuss the ideal of automatically
| extracting knowledge graphs from unstructured pdfs and text
| documents. It covers the recent work that uses LLMs for this task
| (they're not yet competitive with specialized models) and
| highlights many interesting open questions.
|
| Hope this is interesting to people who are interested in the area
| but intimidated because of the flood of activity (but don't be; I
| think the area is easier to digest than it may look.)
| daxfohl wrote:
| Having just started from zero, I agree on the easy to digest
| point. You can get a pretty good understanding of how most
| things work in a couple days, and the field is moving so fast
| that a lot of papers are just exploring different iterative
| improvements on basic concepts.
| kordlessagain wrote:
| Knowledge graphs improve vector search by providing a "back of
| the book" index for the content. This can be done using
| knowledge extraction from an LLM during indexing, such as
| pulling out keyterms of a given chunk before embedding, or
| asking a question of the content and then answering it using
| the keyterms in addition to the embeddings. One challenge I
| found with this is determining keyterms to use with prompts
| that have light context, but using a time window helps with
| this, as does hitting the vector store for related content,
| then finding the keyterms for THAT content to use with the
| current query.
| sroussey wrote:
| What open source model is good at pulling keyterms?
| semihsalihoglu wrote:
| For entity extraction you can look at SpanMarker:
| https://tomaarsen.github.io/SpanMarkerNER/. I'm sure other
| tools exists and others can hopefully point at more.
| laminarflow027 wrote:
| OpenNRE (https://github.com/thunlp/OpenNRE) is another good
| approach to neural relation extraction, though it's
| slightly dated. What would be particularly interesting is
| to combine models like OpenNRE or SpanMarker with entity-
| linking models to construct KG triples. And a solid,
| scalable graph database underneath would make for a great
| knowledge base that can be constructed from unstructured
| text.
| mark_l_watson wrote:
| I really liked the idea of creating linked data to connect
| chunks. That is an idea that deserves some play time (I just
| added it to my TODO list). Thanks for the good ideas!
| iAkashPaul wrote:
| One quick check for any RAG system is to ask what all can the bot
| answer about. Generating scalable metadata at ingestion along
| with knowledge graphs make for a good closed domain experience
| formercoder wrote:
| It's interesting to see more developed KG + LLM use cases that
| aren't just NL to Graph DB Query Lang.
| laminarflow027 wrote:
| Totally agree! The wave of blog posts and examples one sees
| where it's just text-to-SQL or text-to-Cypher or any other
| query lang aren't really exploring the topic at any level of
| technical depth, and we need to see more evaluations and
| technical papers that characterize them, so that we can
| understand how to build better systems.
| semihsalihoglu wrote:
| I think even on the LLMS + KGs space the depth is not very
| deep. In fact there is more technical depth in the text-to-
| SQL than anything else I have seen on LLMs. Maybe the
| COLBERT-like matrix-models is another topic where there is
| good technical depth.
| softwaredoug wrote:
| When I started working in search 10+ years ago, people would
| build a beautiful UI, and then, only on shipping, realize the
| search results were trash + irrelevant. They imagined a search
| system like Elasticsearch was basically Google. When in reality,
| Elasticsearch is just a bit of infrastructure. A framework, not a
| solution.
|
| There's a similar thing happening on RAG. Where people think
| building the chat interaction is the hard thing. The hard thing
| is extracting + searching to get relevant context. A lot of
| founders I talk to suddenly realize this at the last minute,
| right before shipping, similar to search back in the day. It's
| harder than just throwing chunks in a vector DB. It involves a
| lot of different backend data sources potentially, and is in many
| ways harder than a standard search relevance problem (which is
| itself hard enough).
| MattDaEskimo wrote:
| Especially considering the additional logic that some queries
| require. Stacked questions, comparative questions,
| recommendations, questions that assume information found in
| previous statements / questions.
|
| It becomes a very frustrating experience matching the inherent
| chaos of a conversation.
| softwaredoug wrote:
| Yeah, and to do it well you have to focus on a subset of
| tasks. Then find a way to gracefully reject anything you
| can't retrieve well.
|
| In many ways it makes the chat more Siri-like than ChatGPT
| like. Which may not be what users actually expect.
| laminarflow027 wrote:
| Very good points. Have you seen any examples of systems (or
| projects) that successfully combine multiple backend data
| sources, including databases, that perform better than the
| single backend alone? This seems like an important enough
| question that it ought to have been documented somewhere.
| opisthenar84 wrote:
| True. Pure vectorstores seem limited and kind of overrated.
| Combining many sources of data is challenging but the right
| thing to do.
| Keyframe wrote:
| Great observation. I've seen it often in tech, across the
| board. It's no better, maybe a step up, than 'idea guy' who
| 'just' needs someone to build his idea. Hand-waving or complete
| lack of awareness on the actual value (hard) part.
| hobs wrote:
| I spent 8 months telling people this before I got laid off
| while the CEO continues to chase LLM money with no new ideas or
| even the talent to solve the problem.
|
| They spent so much time on the UI and basically left the actual
| search to the last minute, and it was a hilarious failure on
| launch.
| dbish wrote:
| Yep, we're doing RAG-ish search and ranking across many context
| types and modalities, you definitely can't just use a vectordb
| and do some chunking/search, there are a wide variety of
| search-like ranking, clustering, etc. and domain specific work
| for relevance and it's very hard to measure and prove
| improvements.
|
| It's going to just evolve into recreating the various search
| and ranking processes of old just on top of a bit more semantic
| understanding with some smarter NLG layered in :). It won't be
| just LLMs, we'll have intent classification, named entity
| recognition, a personalization layer, reranking, all that fun
| stuff again.
| hackernoteng wrote:
| This is a great comment. Good search is really hard. RAG is
| much harder. At least with search user can pick the best result
| manually or refine their search. With RAG you pass topK to the
| LLM and assume its good results. The assumption is that its
| "semantic search" with vectors so it will just work... wrong.
| moralestapia wrote:
| Hmm, RAG is not "the chat interaction", that's GPT or any other
| "brain" you choose.
|
| Last week I finished building my 3rd RAG stack for legal
| document retrieval. Almost-vanilla RAG got me 90-95% of the
| way. Only drawback is cost, still 10x-100x above the ideal
| price point; but that will only improve in the future.
| dmezzetti wrote:
| If you're interested in graphs + RAG and want an alternate
| approach, txtai has a semantic graph component.
|
| https://neuml.hashnode.dev/introducing-the-semantic-graph
|
| https://github.com/neuml/txtai
|
| Disclaimer: I'm the primary author of txtai
| bryan0 wrote:
| This is really cool, I'm surprised I never heard of this
| project before. The examples look really clean.
|
| Most RAG tools seem to start with the LLM and add Vector
| building and retrieval around it, while this tool seems like it
| started with Vector / Graph building and retrieval, then added
| LLM support later.
| dmezzetti wrote:
| Thanks, that's an accurate assessment. The main reason for
| this approach is that txtai has been around since 2020 before
| the LLM era.
| Der_Einzige wrote:
| Note for those who aren't aware, a "Semantic Graph" means a
| knowledge graph built using a "sentence(pooled) transformer"
| language model to draw edges between the vertices (text data at
| whatever granularity the user decides) according to semantic
| similarity.
|
| What's awesome about them is that they essentially form in my
| mind the "extractive" analogue to LLMs "generative" nature.
|
| Semantic Graphs give every single graph theory algorithm a
| unique epistemological twist given any particular dataset. In
| my case, I've built and released pre-trained semantic graphs
| for my debate evidence. I observe that path traversals form
| "debate cases", and that graph centrality in this case finds
| the most "generic/universally applicable" evidence. Given a
| different dataset, the same algorithms will have different
| interpretations.
|
| What makes txtai so awesome is that it creates a synchronized
| interface between an underlying vector DB, SQL DB, and a
| semantic knowledge graph. The flexibility and power this offers
| compared to other vector DB solutions is simply unparalleled. I
| have seen zero meaningful competition from a vectorDB industry
| which is flooded with money despite little product
| differentiation among themselves.
|
| Disclaimer: I wrote an NLP paper with dmezzetti as my co-author
| about semantic graphs:
| https://aclanthology.org/2023.newsum-1.10.pdf
| dmezzetti wrote:
| Thank you for taking the time to share these excellent
| additional details!
| Oras wrote:
| The article is a good summary of RAG in the enterprise. It shed
| some light for me on the quality of building KG using LLMs, as
| recently, it is an approach that Neo4j was proposing [0].
|
| According to the article, it is either costly (if using OpenAI),
| or slow using open source AI models. In both cases, predicting
| the quality of generated KG using LLMs is hard.
|
| [0] https://github.com/neo4j/NaLLM
| laminarflow027 wrote:
| This is an excellent article that asks some much-needed questions
| on the _literature_ that exists connecting LLMs and RAGs on
| unstructured data, with knowledge graphs in between. We 've seen
| plenty of articles that speculate on how one can build a simple
| retrieval system on top of a KG, but there are two challenges: a)
| constructing a high quality KG isn't easy, and b) keyword or
| phrase embedding on metadata for pre-filtering on relevant
| sections of the graph is required.
|
| As some others here have pointed out, information extraction and
| searching with relevant context are the hardest parts of any
| search system, and it's clear that simply chunking vectors up and
| throwing them into a vector DB has limitations, no matter what
| the vector DB vendors tell you. Just like this article says, I
| hope that 2024 is the year where we actually get some papers that
| perform more rigorous evaluations of systems that use vector DBs,
| graph DBs, or a combination of them for building RAGs.
___________________________________________________________________
(page generated 2024-01-17 23:01 UTC)