[HN Gopher] llms.txt directory
       ___________________________________________________________________
        
       llms.txt directory
        
       Author : pizza
       Score  : 40 points
       Date   : 2024-12-23 18:02 UTC (4 hours ago)
        
 (HTM) web link (directory.llmstxt.cloud)
 (TXT) w3m dump (directory.llmstxt.cloud)
        
       | whoistraitor wrote:
       | Perplexity is listed, but do they actually abide by llms.txt? And
       | how can we prove they do? Is it all good faith? I wish there were
       | a better way.
        
         | jsheard wrote:
         | llms.txt isn't an opt-out signal like robots.txt, it's a way to
         | provide a simplified version of pages that are easier for LLMs
         | to ingest. It's more of an opt-in for being scraped more
         | accurately.
        
       | jsheard wrote:
       | It's telling that nearly every site listed is directly involved
       | with AI in some way, unsurprisingly the broader internet isn't
       | particularly interested in going out of its way to make its
       | content easier for AI companies to scrape.
       | 
       | Deliberately putting garbage data in your llms.txt could be funny
       | though.
        
         | KTibow wrote:
         | I've seen many people joke about intentionally poisoning
         | training data but has that ever worked?
        
           | jsheard wrote:
           | It's hard to gauge the effectiveness of poisoning huge
           | training sets since anything you do is a figurative drop in
           | the ocean, but if you can poison the small amount of data
           | that an AI agent requests on-the-fly to use with RAG then I
           | would guess it's much easier to derail it.
        
             | nyrikki wrote:
             | This study shows that controlling 0.1% may be enough.
             | 
             | https://arxiv.org/abs/2410.13722v1
             | 
             | I have noticed some popular copied but incorrect leetcode
             | examples leaking into the dataset.
             | 
             | I suspect it depends on domain specificity, but that seems
             | within the ability of an SEO spammer or decentralized group
             | of individuals.
        
         | ALittleLight wrote:
         | Seems silly to put garbage data there. Like intentionally doing
         | bad SEO so Google doesn't link you.
         | 
         | I think you should think about it as: I want the LLM to
         | recognize my site as a high quality resource and direct traffic
         | to me.
         | 
         | Imagine user asks ChatGPT a question. LLM has scrapped your
         | website and answers the question. User wants some kind of
         | follow up - read more, what's the source, how can I buy this,
         | whatever - so the LLM links the page it got the data from.
         | 
         | LLMs seem like they're supplanting search. Being early to work
         | with them is an advantage. Working to make your pages look low
         | quality seems like an odd choice.
        
           | vouaobrasil wrote:
           | I'd prefer not to play that game. I'd rather lose a bit of
           | money and traffic and not help LLMs as far as humanly
           | possible.
        
         | spencerchubb wrote:
         | You seem to be misunderstanding why a website would make
         | llms.txt
         | 
         | Obviously, they would not make it just for an AI company to
         | scrape
         | 
         | Here's an example. Let's say I run a dev tools company, and I
         | want users to be able to find info about me as easily as
         | possible. Maybe a user's preferred way of searching the web is
         | through a chatbot. If that chatbot also uses llms.txt, it's
         | easy for me to deliver the info, and easy for them to consume.
         | Win-win
         | 
         | Of course adoption is not very widespread, but such is the case
         | for every new standard.
        
       | riffraff wrote:
       | llms.txt has a section on "Existing standards" which completely
       | forgets about well-known[0], there's an issue opened three
       | months[1] ago but it seems it was ignored.
       | 
       | [0] https://en.wikipedia.org/wiki/Well-known_URI
       | 
       | [1] https://github.com/AnswerDotAI/llms-txt/issues/2
        
       | bradarner wrote:
       | Have there been any declarations by various AI companies (e.g.
       | OpenAI, Anthropic, Perplexity) that they are actually relying
       | upon these llms.txt files?
       | 
       | Is there any evidence that the presence of the llms.txt files
       | will lead to increased inclusion in LLM responses?
        
         | ashenke wrote:
         | And if they are, can I put subtly incorrect data in this file
         | to poison llm responses while keeping my content designed for
         | humans of the best quality?
        
           | bradarner wrote:
           | I'm curious, what would be the reason for doing this?
        
         | nunodonato wrote:
         | Anthropic itself publishes a bunch of its own llm.txt files. So
         | I guess that means something
        
       | Juliate wrote:
       | Why should websites implement yet another custom output format
       | for people^Wsoftware that won't bother to use existing, loosely
       | yet somewhat structured, open formats?
        
       | caseyy wrote:
       | Yes! Please standardize the web into simple hypertext so "LLMs
       | can use it". I promise I won't build any tools to read it without
       | the ads, tracking, and JavaScript client side garbage myself. I
       | will not partake in any such efforts to surf the web as it was
       | intended be, before its commercialization and commodification.
       | No, sir, I could never!
        
       | Lariscus wrote:
       | Making it easier for tech companies to steal my art. Sure, I will
       | get right to it. In what world do these thieves live? I hope they
       | catch something nasty!
        
       | vouaobrasil wrote:
       | This is a great resource to at least figure out all the LLMs out
       | there and block them. I already updated my robots.txt file. Of
       | course, that is not sufficient, but at least it's a start and
       | hopefully the blocking can get more sophisticated as time goes
       | on.
        
       ___________________________________________________________________
       (page generated 2024-12-23 23:00 UTC)