[HN Gopher] Citation Needed - Wikimedia Foundation's Experimenta...
___________________________________________________________________
Citation Needed - Wikimedia Foundation's Experimental LLM/RAG
Chrome Extension
Author : brokensegue
Score : 143 points
Date : 2024-05-11 21:12 UTC (1 days ago)
(HTM) web link (chromewebstore.google.com)
(TXT) w3m dump (chromewebstore.google.com)
| brokensegue wrote:
| Experimental LLM powered RAG application for checking claims on
| the Internet against Wikipedia.
|
| You can read more about it at
| https://meta.wikimedia.org/wiki/Future_Audiences/Experiment:...
|
| Disclaimer: I worked on this.
| NegativeLatency wrote:
| Any plans for a Safari extension?
| andybak wrote:
| Or Firefox
| purple-leafy wrote:
| Hey matey, do you folks need any more devs to bring ideas or
| code?
|
| I'm heavily into browser extension development! I've done some
| insane things with them.
|
| I've built about 8 browser extensions in the last 6 months,
| most of them have thousands of users and one of them had half a
| million users
|
| Currently building an LLM powered design assistant extension.
|
| If you'd like to chat im reachable by email at
| "hello[at]papillonsoftware[dot]dev"
| squigz wrote:
| > I've built about 8 browser extensions in the last 6 months,
| most of them have thousands of users and one of them had half
| a million users
|
| Do you have any concerns about your ability to properly
| maintain so many extensions?
| purple-leafy wrote:
| Not really, most of them are "feature complete" as they
| target a very small surface area or feature.
|
| For instance, one of them reveals salaries on job seeker
| sites, and is feature complete as it does what it's meant
| to do bug free and fast
| squigz wrote:
| No web extension is going to remain feature complete or
| bug free for very long. What happens when the job seeker
| sites change their HTML/CSS/JS?
| card_zero wrote:
| It looks like this doesn't check whether the article itself
| cites a source for the claim. Is that why it's called "Citation
| Needed"? Because it doesn't actually cite anything?
| bitsinthesky wrote:
| Anywhere that gets into technical details and design?
| input_sh wrote:
| Not the person you've responded to, but I found this:
| https://gitlab.wikimedia.org/repos/future-
| audiences/citation...
| aurareturn wrote:
| This is a great idea. Hopefully we also have LLMs fact checking
| and flagging Wikipedia articles.
| ysavir wrote:
| Wouldn't that just be checking wikipedia articles against the
| same wikipedia articles which the LLM originally trained on?
| relyks wrote:
| No, citations are supposed to be from reliable secondary
| sources or authoritative primary ones external to Wikipedia
| josefx wrote:
| Many of those sources are not available online (books),
| point at paywals (research papers) or are dead . Unless you
| have an API that can bypass these issues reliable you are
| stuck with a tool that has already landed several lawyers
| in hot water for making up citations on the fly.
| falcor84 wrote:
| The Wikipedia Library project [0] grants active editors
| access to a wide range of otherwise paywalled sources. I
| wonder if it could not be extended to this sort of bot.
|
| [0] https://diff.wikimedia.org/2022/01/19/the-wikipedia-
| library-...
| ysavir wrote:
| What LLM will we check those against? How do we trust that
| its source materials are accurate and correct?
| shiomiru wrote:
| I think the idea is that you feed the LLM the article &
| the source material (from citations) and it checks if
| they match up.
| ysavir wrote:
| Sure, but what happens when the article is updated at a
| later date, or rescinded, etc.? Should the LLMs be
| trained to repeat the article verbatim, or to say
| "according to this article[0], blah blah blah" with links
| to the sources?
|
| Wikipedia works because we can update it in real time in
| response to changes. LLMs that need to constantly recrawl
| every time a page on the internet is updated, and that
| properly contextualize the content of that page, is a
| huge ask. Because at that point, it stops being an LLM
| and starts being a very energy-hungry search engine.
| shiomiru wrote:
| Well, it's just a bot, so no need for it to instantly
| react to any and every update.
|
| I also have my doubts on whether it is possible to
| implement efficiently (or at all). I suspect that just
| yanking in the article and all the sources is non-
| feasible, and any smaller chunking would be missing too
| much context. Plus LLM logical capabilities are
| questionable too, so I don't know how well the comparison
| would work...
| visarga wrote:
| Couple this with a search based article generator so you have an
| article generator and an article checker, and then off you go to
| generating 1 trillion pages. Could be useful training content for
| LLMs but also used by people.
| bawolff wrote:
| I wonder how this will go over politically with the wikipedia
| community. AI is such a hot button issue, and the risk of
| hallucinating and saying something wrong seems more pressing in
| this application then most.
| jszymborski wrote:
| I haven't used this, but reading the description, it sounds
| like it's primarily a search engine for wiki articles related
| to selected text. If so, I imagine it wouldn't be super
| susceptible to hallucinations.
| Kwpolska wrote:
| It uses AI to parse the selected text to choose search
| keywords, and to parse the related Wikipedia article to
| decide if it agrees with the selected text. It obviously can
| bullshit in both cases.
| falcor84 wrote:
| Searching for keywords shouldn't be likely to hallucinate.
| And I would assume they would have a subsequent step to run
| a quick check to see they're really in the text. And if
| there is some issue, I suppose we can always fall back to
| something like TF-IDF.
|
| The second part does seem more problematic, but still, as
| essentially a yes/no question, it should be significantly
| less likely to hallucinate/confabulate than for other
| tasks.
| Kwpolska wrote:
| The extension appears to also produce an explanation of
| its decision, so there is potential to bullshit: https://
| meta.wikimedia.org/wiki/Future_Audiences/Experiment:...
|
| Also, if you look into their "wrong" example closer, it
| is a bit misleading, as both sources are correct. Joe
| Biden was 29 on election day, but 30 when he was sworn
| in. Understanding this requires more context than the LLM
| was apparently provided.
| AJRF wrote:
| Honest question - do you expect there to be hallucinations
| in this case? I have extensive experience in LLMs and also
| from talking to peers with similar experience, it is
| uncontroversial to say that given grounding like this LLMs
| won't hallucinate.
|
| I am not sure if when people say this they just don't have
| experience building with LLMs or they do and have
| experience that would make for a very popular and
| interesting research paper.
| Kwpolska wrote:
| The burden of the proof of no bullshit is on the LLM
| proponents.
| hombre_fatal wrote:
| Well wait a sec, you'd be just as guilty of confident
| bullshit if your claims above don't pan out, and they
| didn't even come from an LLM so it's worse.
| simonw wrote:
| "it is uncontroversial to say that given grounding like
| this LLMs won't hallucinate"
|
| I disagree. LLMs that have been "grounded" in text that
| has been injected into their context (RAG style) can
| absolutely stoln hallucinate.
|
| They are less likely to, but that's not the same as
| saying they "won't hallucinate" at all.
|
| I've spotted this myself. It's not uncommon for example
| for Google Gemini to include a citation link which, when
| followed, doesn't support the "facts" it reported.
|
| Furthermore, if you think about how most RAG
| implementations work you'll spot plenty of potential for
| hallucination. What if the text that was pulled into the
| context was part of a longer paragraph that started "The
| following is a common misconception:" - but that prefix
| was omitted from the extract?
| qrian wrote:
| Asking about itself returns this:
|
| > "Citation Needed is an experimental feature developed in 2024"
|
| - The provided passages do not contain any information about a
| feature called 'Citation Needed' being developed in 2024.
|
| - Wikipedia
|
| - Discouragement in education
|
| - not be relied
|
| I know I'm not using it for intended purposes but it seemed
| funny.
| mattyyeung wrote:
| Can quotations be hallucinated? Or are you using something like
| "deterministic quoting"[1]?
|
| Disclosure: author on that work.
|
| [1] https://mattyyeung.github.io/deterministic-quoting
| tuananh wrote:
| No firefox extension though.
| neilv wrote:
| If Wikimedia Foundation didn't get paid for basing this project
| on an aggressively embrace&extend browser, they should look
| into whether they're leaving money on the table, since projects
| like this help extinguish the browser's competition.
| youssefabdelm wrote:
| Does this only work on Chrome? Can't seem to make it work on Arc.
| Daub wrote:
| > A chrome extension for finding citations in Wikipedia
|
| In academia, Wikepedia citations are generally a no no. One
| reason is their unrelaibity (the author is citing a source that
| they themselves can edit). More importantly, Wikepedia may be a
| good place to find primary sources, but in itself it is a
| secondary source.
| jimbobthrowawy wrote:
| It really shouldn't, considering they have a rule about not
| using primary sources.
| _notreallyme_ wrote:
| Yes, they explicitly classify wikipedia as a tertiary source
| [1].
|
| Wikipedia is good for finding secondary source, and then
| primary sources by following the links.
|
| [1] https://en.wikipedia.org/wiki/Wikipedia:Primary_Secondary
| _an...
| boxed wrote:
| That justification seems a bit behind the times honestly. We've
| now seen actual academic fraud on a massive scale with
| extremely little in the way of a correction to fix this, while
| at the same time we've seen Wikipedia handle abuse extremely
| well. The academic fraud is a threat to Wikipedia, more than
| using wikipedia links is a threat to academia.
| Waterluvian wrote:
| This seems no different than it's always been. Even before
| Wikipedia you would not cite secondary sources. But you sure
| would use them to get a foothold on a topic and find some of
| those sources.
| lozenge wrote:
| The point is to find the information in Wikipedia which often
| then has a citation to another source. If you search Google you
| often find repetitions of the information but most sites don't
| cite sources.
| bawolff wrote:
| > One reason is their unrelaibity (the author is citing a
| source that they themselves can edit)
|
| There is plenty of reasons why wikipedia is an inapropriate
| source to cite most of the time in academia, but that surely is
| not one of them.
|
| Acedemics cite their own papers or other sources they have
| editorial control over all the time.
| ruined wrote:
| wikimedia's git repo for this extension
|
| https://gitlab.wikimedia.org/repos/future-audiences/citation...
|
| edit: the readme build instructions are incomplete and i don't
| think hotreload works. use `npm run build-dev` to get a working
| build.
|
| it's not obvious to me what prevents this from being a firefox
| extension as well - it might be the sidebar/sidepanel api
| differences, but i haven't played with those much
| ale42 wrote:
| @wikimedia: Firefox version please
___________________________________________________________________
(page generated 2024-05-12 23:02 UTC)