[HN Gopher] llms.txt directory
___________________________________________________________________
llms.txt directory
Author : pizza
Score : 40 points
Date : 2024-12-23 18:02 UTC (4 hours ago)
(HTM) web link (directory.llmstxt.cloud)
(TXT) w3m dump (directory.llmstxt.cloud)
| whoistraitor wrote:
| Perplexity is listed, but do they actually abide by llms.txt? And
| how can we prove they do? Is it all good faith? I wish there were
| a better way.
| jsheard wrote:
| llms.txt isn't an opt-out signal like robots.txt, it's a way to
| provide a simplified version of pages that are easier for LLMs
| to ingest. It's more of an opt-in for being scraped more
| accurately.
| jsheard wrote:
| It's telling that nearly every site listed is directly involved
| with AI in some way, unsurprisingly the broader internet isn't
| particularly interested in going out of its way to make its
| content easier for AI companies to scrape.
|
| Deliberately putting garbage data in your llms.txt could be funny
| though.
| KTibow wrote:
| I've seen many people joke about intentionally poisoning
| training data but has that ever worked?
| jsheard wrote:
| It's hard to gauge the effectiveness of poisoning huge
| training sets since anything you do is a figurative drop in
| the ocean, but if you can poison the small amount of data
| that an AI agent requests on-the-fly to use with RAG then I
| would guess it's much easier to derail it.
| nyrikki wrote:
| This study shows that controlling 0.1% may be enough.
|
| https://arxiv.org/abs/2410.13722v1
|
| I have noticed some popular copied but incorrect leetcode
| examples leaking into the dataset.
|
| I suspect it depends on domain specificity, but that seems
| within the ability of an SEO spammer or decentralized group
| of individuals.
| ALittleLight wrote:
| Seems silly to put garbage data there. Like intentionally doing
| bad SEO so Google doesn't link you.
|
| I think you should think about it as: I want the LLM to
| recognize my site as a high quality resource and direct traffic
| to me.
|
| Imagine user asks ChatGPT a question. LLM has scrapped your
| website and answers the question. User wants some kind of
| follow up - read more, what's the source, how can I buy this,
| whatever - so the LLM links the page it got the data from.
|
| LLMs seem like they're supplanting search. Being early to work
| with them is an advantage. Working to make your pages look low
| quality seems like an odd choice.
| vouaobrasil wrote:
| I'd prefer not to play that game. I'd rather lose a bit of
| money and traffic and not help LLMs as far as humanly
| possible.
| spencerchubb wrote:
| You seem to be misunderstanding why a website would make
| llms.txt
|
| Obviously, they would not make it just for an AI company to
| scrape
|
| Here's an example. Let's say I run a dev tools company, and I
| want users to be able to find info about me as easily as
| possible. Maybe a user's preferred way of searching the web is
| through a chatbot. If that chatbot also uses llms.txt, it's
| easy for me to deliver the info, and easy for them to consume.
| Win-win
|
| Of course adoption is not very widespread, but such is the case
| for every new standard.
| riffraff wrote:
| llms.txt has a section on "Existing standards" which completely
| forgets about well-known[0], there's an issue opened three
| months[1] ago but it seems it was ignored.
|
| [0] https://en.wikipedia.org/wiki/Well-known_URI
|
| [1] https://github.com/AnswerDotAI/llms-txt/issues/2
| bradarner wrote:
| Have there been any declarations by various AI companies (e.g.
| OpenAI, Anthropic, Perplexity) that they are actually relying
| upon these llms.txt files?
|
| Is there any evidence that the presence of the llms.txt files
| will lead to increased inclusion in LLM responses?
| ashenke wrote:
| And if they are, can I put subtly incorrect data in this file
| to poison llm responses while keeping my content designed for
| humans of the best quality?
| bradarner wrote:
| I'm curious, what would be the reason for doing this?
| nunodonato wrote:
| Anthropic itself publishes a bunch of its own llm.txt files. So
| I guess that means something
| Juliate wrote:
| Why should websites implement yet another custom output format
| for people^Wsoftware that won't bother to use existing, loosely
| yet somewhat structured, open formats?
| caseyy wrote:
| Yes! Please standardize the web into simple hypertext so "LLMs
| can use it". I promise I won't build any tools to read it without
| the ads, tracking, and JavaScript client side garbage myself. I
| will not partake in any such efforts to surf the web as it was
| intended be, before its commercialization and commodification.
| No, sir, I could never!
| Lariscus wrote:
| Making it easier for tech companies to steal my art. Sure, I will
| get right to it. In what world do these thieves live? I hope they
| catch something nasty!
| vouaobrasil wrote:
| This is a great resource to at least figure out all the LLMs out
| there and block them. I already updated my robots.txt file. Of
| course, that is not sufficient, but at least it's a start and
| hopefully the blocking can get more sophisticated as time goes
| on.
___________________________________________________________________
(page generated 2024-12-23 23:00 UTC)