[HN Gopher] Wikidata, with 12B facts, can ground LLMs to improve...
___________________________________________________________________
Wikidata, with 12B facts, can ground LLMs to improve their
factuality
Author : raybb
Score : 187 points
Date : 2023-11-17 14:44 UTC (8 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| bxhdndnjdbj wrote:
| Ahh.. feed the LLM a special sauce. Then it will speak the Truth
| dang wrote:
| We've banned this account for posting unsubstantive comments.
| Can you please not create accounts to break HN's rules with? It
| will eventually get your main account banned as well.
|
| If you'd please review
| https://news.ycombinator.com/newsguidelines.html and stick to
| the rules when posting here, we'd appreciate it.
| audiala wrote:
| Wikidata is such a treasure. There is quite a learning curve to
| master the SPARQL query language but it is really powerful. We
| are testing it to provide context to LLMs when generating audio-
| guides and the results are very impressive so far.
| karencarits wrote:
| I wish there was a way to add results from scientific papers to
| wikidata - imagine doing meta-analyses by SPARQL queries
| gaogao wrote:
| You totally can! - https://www.wikidata.org/wiki/Q30249683
|
| It's just pretty sparse, so you would need a focused effort
| to fill out predicates of interest.
| uneekname wrote:
| Am I missing something? I do not see any results indicated
| in the statements of that entity.
| gaogao wrote:
| Right, such a result would need to be marked with a new
| predicate (verb) like: ``` Subject - Transformer's Paper
| Predicate - Score Object - BLEU (28.4) ``` One of the
| trickiest things use a semantic triple store like this is
| that there's a lot of ways of phrasing the data, lots of
| ambiguity. LLMs help in this case by being able to more
| gracefully handle cases like having both 'Score' and
| 'Benchmark' predicates, mergining the two together.
| karencarits wrote:
| Indeed, and hopefully -if there was a structured way of
| doing it - people might want to do that effort in relation
| to doing reviews or meta-analysis to make the underlying
| data available for others, and make it easier to reproduce
| or update the results over time
| bugglebeetle wrote:
| One of my favorite things about ChatGPT is that I pretty much
| never have to write SPARQL myself anymore. I've had zero
| problems with the resulting queries either, except in cases
| where I've prompted it incorrectly.
| gaogao wrote:
| Yeah, it works so well, I wonder if it's just a natural fit
| due to the attention mechanism and graph databases sharing
| some common semantic triple foundations
| gaogao wrote:
| And in the reverse, Wikidata has a lot of gaps in its
| annotations, where human labelling could be augmented by LLMs. I
| wrote some stuff on both ground response and adding more stuff to
| WikiData a while ago
| https://friend.computer/jekyll/update/2023/04/30/wikidata-ll...
| jsemrau wrote:
| Would be a good idea to create an annotation model like DALL-E
| 3 had done.
| unshavedyak wrote:
| Yup. This is my gut to where LLMs will really explode. Let them
| augment data just a bit, train on improved data, augment more,
| train again - etcetc. If we take things slow i suspect in the
| long run it'll be really beneficial to multiple paradigms.
|
| I know people say training bots on bot data is bad, but A: it's
| happening anyway, and B: it can't be worse than the actual
| garbage they get trained on in a lot of cases anyway.. can it?
| Filligree wrote:
| Pixart-alpha provides an example of C: The bot labels can be
| dramatically better than the human labels.
|
| Even though they used LLaVA, and LLaVA isn't all that good
| compared to gpt-4.
| gaogao wrote:
| Training on bot data can be bad when it's ungrounded and
| basically hallucinations on hallucinations.
|
| Having LLMs help curate something grounded is generally
| reasonable. Functionally, it's somewhat similar to how some
| training is using generated subtitles of videos for training
| video/text pairs; it's very feasible to also go and clean
| those up, even though it is bot data.
| ori_b wrote:
| > _Yup. This is my gut to where LLMs will really explode_
|
| Yes, indeed. This is one place where LLMs can make it look
| like a bomb went off.
| prosqlinjector wrote:
| We can do polynomial regression of data sets that looks
| equally plausible, but it's not real data.
| foobarchu wrote:
| > A: it's happening anyway
|
| This is never a valid defense for doing more of something.
| akjshdfkjhs wrote:
| please no!
|
| The cost of now having unknown false data in there would
| completely ruin the value of the whole effort.
|
| The entire value of the data (which is already everywhere
| anyway) is the "cost" contributors paid via heavy moderation.
| If you do not understand why that is diametrically opposite of
| adding/enriching/augmenting/whichever euphemism with LLMs, I
| don't know what else to say.
| cj wrote:
| 100%. I live in a hamlet of a larger town in the US, and was
| curious what the population of my hamlet is.
|
| There's a Wikipedia page for the hamlet, but it's empty. No
| population data, etc.
|
| I'd _much_ rather see no data than a LLM's best guess. I'm
| guessing a LLM using the data would also perform better
| without approximated or "probably right" information.
| gaogao wrote:
| Yeah, adding directly with an LLM is a bad idea. Instead,
| this would be basically making suggestions linked back to the
| Wikipedia snippet that a person could approve, edit, or
| reject. This is a flow for scaling up annotation of data that
| works pretty well, as it also sucks having a ton of the gaps
| in the structured data, if it's sitting fine there in the
| linked Wikipedia page.
| StableAlkyne wrote:
| Did they ever create the bridge from Wikipedia to Wikidata? I
| remember hearing talk about it as a way of helping the lack
| of data. The problem I had with Wikidata a couple years ago
| was that it was usually an incomplete subset of Wikipedia's
| infoboxes.
|
| Checking again for m-xylene,
| https://m.wikidata.org/wiki/Q3234708
|
| You get physical property data and citations.
|
| Now compare that to the chem infobox in wikipedia:
| https://en.m.wikipedia.org/wiki/M-Xylene
|
| You get a lot more useful data, like the dipole moment and
| solubility (kinda important for a solvent like Xylene), and
| tons of other properties that Wikidata just doesn't have. All
| in the infobox.
|
| It's weird that they don't just copy the Wikipedia infobox
| for the chemicals in Wikidata. It's already there and
| organized. And frequently cited.
|
| Maybe it's more useful for other fields, but I can't think of
| a good use I'd get from the chemical section of Wikidata over
| the databases it cites or Wikipedia itself...
| YoshiRulz wrote:
| I'm not that familiar with the subject, but I did read[1]
| that Wikidata's adoption has been slowed by the fact that
| triples can only be used on one page (per localisation).
| There is some support for using it with infoboxes
| though[2].
|
| [1]: https://meta.wikimedia.org/wiki/Help:Array#Wikidata
|
| [2]:
| https://en.wikipedia.org/wiki/Help:Wikidata#In_infoboxes
| raybb wrote:
| It would be really cool if there was a tool that could help
| extra data from say a news article and then populate wikidata
| with it after human review. I find the task of adding simple
| fields like date founded to be too many clicks with the default
| gui.
| huytersd wrote:
| Only if it is human validated and even then not really.
| gibsonf1 wrote:
| Hmm, there is a lot of opinion in Wikidata - so would not call
| all of them facts, although some items are. Even if it was all
| factual, the statistical nature of LLM's would still invent
| things from the input as per the nature of the technology.
| sharemywin wrote:
| You just need to tell it to use the facts:
|
| "Us the information from the following list of facts to answer
| the questions I ask without qualifications. answer
| authoritatively. If the question can't be answered with the
| following facts just say I don't know.
|
| Absolute Facts:
|
| The sky is purple.
|
| The sun is red and green
|
| When it rains animals fall from the sky."
| sharemywin wrote:
| If you tried to make a customer facing chatbot I wouldn't let
| it generate responses directly. I would have it pick from a
| list of canned responses. and/or have it contact a rep to
| intervene on complicated questions. But there's no reason
| this tech couldn't be used for some commercial situations
| now.
| prosqlinjector wrote:
| The sky is not one color and changes color depending on
| weather, sun, and global location.
| behnamoh wrote:
| But their example uses GPT-3, a completely outdated model which
| was prone to hallucinations. But GPT-4 has got much better in
| that regard, so I wonder what the marginal benefit of Wikidata is
| for really huge LLMs such as the 4.
| not2b wrote:
| GPT-4 is not immune to making things up, and a smaller model
| that doesn't have as much garbage and nonsense in its training
| data might achieve results that are nearly as good for much
| less cost.
| behnamoh wrote:
| Clearly you read my comment wrong, I said "GPT-4 has got much
| better in that regard".
| jakobson14 wrote:
| On facts in the wikidata dataset? sure.
|
| But if you think this will stem the tide of LLM hallucinations,
| you're high too. LLMs' primary function is to bullshit.
|
| In chess many games play out with the same opening but within a
| few moves become a game no one has played before. Being outside
| the dataset is the default for any sufficiently long
| conversation.
| mrtesthah wrote:
| I've been waiting for the OpenCYC knowledge ontology to be used
| for this purpose as well.
| gibsonf1 wrote:
| That ontology is quite a mess actually.
| euroderf wrote:
| Me too. But if OpenCYC has been completely absent from the
| public discourse about A.I., does that mean there's a super
| secret collaboration going on ? Or... hmm, maybe the NSA gets
| to throw a few hundred million bucks at the problem ?
| riku_iki wrote:
| you would need also facts base, and in my understanding OpenCYC
| is small compared to Wikidata
| stri8ed wrote:
| Do existing LLM's not already train on this data?
| kfrzcode wrote:
| Nope. Training data for the big LLMs is a corpus of text, not
| structured data. There would be much more dimensionality with
| regard to parameterization as far as I understand when it comes
| to structured data
| brlewis wrote:
| The linked tweet has a diagram where you can pretty quickly see
| that this isn't just about using wikidata as a training set.
| The paper linked from the tweet also gives a good summary on
| its first page.
| crazygringo wrote:
| Can it though?
|
| LLM's are currently trained on actual language patterns, and pick
| up facts that are repeated consistently, not one-off things --
| and within all sorts of different contexts.
|
| Adding a bunch of unnatural "From Wikidata, <noun> <verb> <noun>"
| sentences to the training data, severed from any kind of context,
| seems like it would run the risk of:
|
| - Not increasing factual accuracy because there isn't enough
| repetition of them
|
| - Not increasing factual accuracy because these facts aren't
| being repeated consistently across other contexts, so they result
| in a walled-off part of the model that doesn't affect normal
| writing
|
| - And if they are massively repeated, all sorts of problems with
| overtraining and learning exact sentences rather than the
| conceptual content
|
| - Either way, introducing linguistic confusion to the LLM,
| thinking that making long lists of "From Wikidata, ..." is a
| normal way of talking
|
| If this is a technique that actually works, I'll believe it when
| I see it.
|
| (Not to mention the fact that I don't think most of the stuff
| people are asking LLM's for isn't stuff represented in Wikidata.
| Wikidata-type facts are already pretty decently handled by
| regular Google.)
| ivalm wrote:
| Fine tuning. You can autogenerate all kind of factual questions
| with one word answers based on these triplets.
| Closi wrote:
| Well that's not actually how it works - they are just getting a
| model (WikiSP & EntityLinker) to write a query that responds
| with the fact from Wikidata. Did you read the post or just the
| headline?
|
| Besides, let's not forget that humans are _also_ trained on
| language data, and although humans can also be wrong, if a
| human memorised all of Wikidata (by reading sentences /facts in
| 'training data') it would be pretty good in a pub-quiz.
|
| Also, we obviously can't see anything inside how OpenAI train
| GPT, but I wouldn't be surprised if sources with a higher
| authority (e.g. wikidata) can be given a higher weight in the
| training data, and also if sources such as wikidata could be
| used with reinforcement learning to ensure that answers within
| the dataset are 'correctly' answered without hallucination.
| toomuchtodo wrote:
| In this context, these are more expert systems vs LLMs, and
| as you enumerate, they can be built well if built well. For
| example, Google surfaces search engine results directly. This
| is similar, but more powerful, because Wikimedia Foundation
| can actually improve results, gaps, overall performance while
| Google DGAF.
|
| I would expect as the tide rises with regards to this tech,
| self hosting of training and providing services to prompts
| becomes easier. For Wikimedia, it'll just be another cluster
| and data pipeline system(s) at their datacenter.
| crazygringo wrote:
| Ah, I did misunderstand how it worked, thanks -- I was
| looking at the flow chart and just focusing on the part that
| said "From Wikidata, the filming location of 'A Bronx Tale'
| includes New Jersey and New York" that had an arrow feeding
| it into GTP-3...
|
| I'm not really sure how useful something this simple is,
| then. If it's not actually improving the factual accuracy in
| the training of the model itself, it's really just a hack
| that makes the whole system even harder to reason about.
| westurner wrote:
| The objectively true data part?
|
| Also there's Retrieval Augmented Generation (RAG)
| https://www.promptingguide.ai/techniques/rag :
|
| > _For more complex and knowledge-intensive tasks, it 's
| possible to build a language model-based system that
| accesses external knowledge sources to complete tasks. This
| enables more factual consistency, improves reliability of
| the generated responses, and helps to mitigate the problem
| of "hallucination"._
|
| > _Meta AI researchers introduced a method called Retrieval
| Augmented Generation (RAG) to address such knowledge-
| intensive tasks. RAG combines an information retrieval
| component with a text generator model. RAG can be fine-
| tuned and its internal knowledge can be modified in an
| efficient manner and without needing retraining of the
| entire model._
|
| > _RAG takes an input and retrieves a set of relevant
| /supporting documents given a source (e.g., Wikipedia)._
| The documents are concatenated as context with the original
| input prompt and fed to the text generator which produces
| the final output. This makes RAG adaptive for situations
| where facts could evolve over time. _This is very useful as
| LLMs 's parametric knowledge is static._
|
| > _RAG allows language models to bypass retraining,
| enabling access to the latest information for generating
| reliable outputs via retrieval-based generation._
|
| > _Lewis et al., (2021) proposed a general-purpose fine-
| tuning recipe for RAG. A pre-trained seq2seq model is used
| as the parametric memory and a dense vector index of
| Wikipedia is used as non-parametric memory (accessed using
| a neural pre-trained retriever)._ [...]
|
| > _RAG performs strong on several benchmarks such as
| Natural Questions, WebQuestions, and CuratedTrec. RAG
| generates responses that are more factual, specific, and
| diverse when tested on MS-MARCO and Jeopardy questions. RAG
| also improves results on FEVER fact verification._
|
| > _This shows the potential of RAG as a viable option for
| enhancing outputs of language models in knowledge-intensive
| tasks._
|
| So, with various methods, I think having ground facts in
| the process somehow should improve accuracy.
| rcfox wrote:
| Isn't repetition essentially a way of adding weight? If you
| could increase the inherent weight of Wikidata, wouldn't that
| provide the same effect?
| NovemberWhiskey wrote:
| If you want to increase the likelihood that answers will read
| like Wikipedia entries, sure.
| brandonasuncion wrote:
| Is it possible to finetune an LLM on the factual content
| without altering its linguistic characteristics?
|
| With Stable Diffusion, you're able to use LoRAs to
| introduce specific characters, objects, concepts, etc.
| while maintaining the same visual qualities of the base
| model.
|
| Why can't something similar be done with an LLM?
| visarga wrote:
| If I had the funds I'd run all the training set (GPT4 used 13
| trillion tokens) through a LLM to mine factual statements, then
| do reconciliation or even better, I'd save a summary description
| of the diverse results. In the end we'd end up with an universal
| KB. Even for controversial topics, it would at least model the
| distribution of opinions, and be able to confirm if a statement
| doesn't exist in the database.
|
| Besides mining KB triplets I'd also use the LLM with contextual
| material to generate Wikipedia-style articles based off external
| references. It should write 1000x more articles covering all
| known names and concepts, creating trillions of synthetic tokens
| of high quality. This would be added to the pre-training stage.
| boznz wrote:
| Having an indexed database of facts not half-facts and half or
| untruths is the only way AI is ever going to be useful; and until
| it can fact check for itself these databases will need to be the
| training wheels.
| prosqlinjector wrote:
| Curating and presenting facts is a form of narrative and is not
| at all objective.
| mike_hock wrote:
| But then it needs extra filters so it doesn't accidentally say
| something based.
| Vysak wrote:
| I think this typo works
| TaylorAlexander wrote:
| I don't think it was a typo.
| night-rider wrote:
| 'Facts' based on citations that no longer exist, or if they do
| exist, they remain on Archive.org's Wayback Machine. And then
| when you visit the resource in question, the author is not
| credible enough to be believed and their 'facts' are on shaky
| ground. It's turtles all the way down.
| benopal64 wrote:
| I question the sentiment. I think people CAN argue the basis of
| a fact, however, being pragmatic and holistic can help provide
| some understanding. Truth is always relative and always has
| been. However, the human perspective holds real, tangible,
| recordable, and testable evidence. We rely on the multitude of
| perspectives to fully flesh out reality and determine the
| details and TRUTH of reality at multiple scales. The value of
| diverse human perspectives is similar to the value of
| perceiving an idea, concept, or object at different scale.
| riku_iki wrote:
| sounds like algorithmicly solvable problem..
| Racing0461 wrote:
| What if wiki articles are written using LLMs from now on? That
| would be "ai incest" if its used as training/ground truth data.
|
| I forsee data created before AI/LLMs to be very valuable going
| forward in much the same way steel mined before the detonation of
| the first atomic bomb being used for nuclear devices/MRIs/etc.
| ElectricalUnion wrote:
| There is even a XKCD for that: https://xkcd.com/978/
|
| s/a user's brain/llm/g
| klysm wrote:
| Strong doubt. The problem is LLMs don't have a robust
| epistemology (they can't), and are structurally unable to provide
| a basis over which they've "reasoned".
| blackbear_ wrote:
| Don't get too hung up on the present technical definition of
| LLM. Perhaps it is possible to find new architectures that are
| more suited to ground their claims.
| SrslyJosh wrote:
| > Don't get too hung up on the present technical definition
| of LLM.
|
| The paper is literally about LLMs. Speculation about future
| model architectures is irrelevant.
| lukev wrote:
| Using retrieval to look up a fact and then citing that fact in
| response to a query (with attribution) is absolutely within the
| capabilities of current LLMs.
|
| LLMs "hallucinate" all the time when pulling data from their
| weights (because that's how that works, it's all just token
| generation). But if the correct data is placed within their
| context they are very capable of presenting the data in natural
| language.
| kelseyfrog wrote:
| Humans, when probed, don't have a robust epistemology either.
|
| Our knowledge(and reality) is grounded in heuristics we reify
| into natural order and it's easy for us to forget that our
| conclusions exist as a veneer on our perceptions. Nearly every
| bit of knowledge we hold has an opposite twin that we hold as
| well. We favor completeness over consistency.
|
| When pressed, humans tend to justify their heuristics rather
| than reexamine them because our minds have a clarity bias - ie:
| we would rather feel like things are clear even if they are
| wrong. Often times we can't go back and test if they are wrong
| which biases epistemological justifications even more.
|
| So no, our rationality, the vast proportion of times is used to
| rationalize rather than conclude.
| Animats wrote:
| Isn't this usually done by having something that takes in a
| query, finds possibly relevant info in a database, and adds it to
| the LLM prompt? That allows the use of very large databases
| without trying to store them inside the LLM weights.
| Linell wrote:
| Yes, it's called retrieval augmented generation.
| dmezzetti wrote:
| I think a better approach is using retrieval augmented generation
| with Wikipedia.
|
| This data source is designed for that:
| https://huggingface.co/NeuML/txtai-wikipedia.
|
| With this source, you can also select articles that are viewed
| the most, which is another important factor in validating facts.
| An article which has no views might not be the best source of
| information.
| audiala wrote:
| Part of the issue is to select the right Wikipedia article.
| Wikidata offers a way to know for sure that you query the LLM
| with the right data. Also the wikipedia txtai dataset is for
| english only.
| itissid wrote:
| I feel, if you use a pre-trained model to do these things without
| knowing the set intersection of the test and that dataset, makes
| it very tough to know weather inference is in the transitive
| closure of generated text the models were trained on or weather
| they really improved.
|
| There was another approach to grounding LLMs the other day from
| Normal Computing the other day:
| https://blog.normalcomputing.ai/posts/2023-09-12-supersizing...
| in which they use Mosaic but they also did not mentioned that
| this was actually done.
|
| Sentient or not, I feel there should be a standard on
| aggressively filtering out overlap on training and test datasets
| for approaches like this.
| chrisweekly wrote:
| It's "whether". (weather is eg sunny or raining)
| thewanderer1983 wrote:
| Please don't. Wikipedia long abandoned neutrality. They aren't
| the bearers of truth.
| antipaul wrote:
| Is it AI, or just a look up table?
| Scene_Cast2 wrote:
| For some more reading on using facts for ML, check out this
| discussion: https://news.ycombinator.com/item?id=37354000
| xacky wrote:
| I know for a fact that there are a lot of unreverted vandal edits
| in Wikidata, because Wikidata's bots enter data so fast that it
| is too fast for Special:Recentchanges to monitor. Even Wikipedia
| still regularly gets 15+ year old hoaxes added to their hoax
| list.
| born-jre wrote:
| when 100+B model hallucinates, that's a problem but a mistral 7b
| (qk_4 around 4gb file ) does this have prams to encode
| information to be hallucination proof, since llm cannot know what
| they do not know.
|
| So maybe we should be building smaller model with ability that we
| use their generation abilities not their facts but instead teach
| them t query over another knowledge base system (reverse RAG) for
| facts.
___________________________________________________________________
(page generated 2023-11-17 23:00 UTC)